This paper presents a self-supervised framework for perceptual learning based upon correlations in different sensory modalities. We demonstrate this with a system that has learned the vowel structure of American English ñ i.e., the number of vowels and their phonetic descriptions ñ by simultaneously watching and listening to someone speak. It is highly non-parametric, knowing neither the number of vowels nor their input distributions in advance, and it has no prior linguistic knowledge. This work is the first example of unsupervised phonetic acquisition of which we are aware, outside of that done by human infants. This system is based on the cross-modal clustering framework introduced by , which has been significantly enhanced here. This paper presents our results and focuses on the mathematical framework that enables this type of intersensory self-supervised learning.
Subjects: 12. Machine Learning and Discovery; 19.1 Perception