As agents become ubiquitous in virtual as well as physical worlds, the importance of learning from real-life human interaction is increasing. Here we explore new learning and teaching strategies for an agent situated in a digital cinema environment to solve a language-vision translation problem by playing a multimodal memory game with humans. We discuss the challenges for machine learners, i.e. learning architectures and algorithms, required to deal with this kind of long-lasting, dynamic scenario. We also discuss the challenges for human teachers to address the new machine learning issues. Based on our preliminary experimental results using the hypernetwork learning architecture we argue for self-teaching cognitive agents that actively interact with humans to generate queries and examples to evaluate and teach themselves.