The act of imitating sounds—whether it’s mimicking an engine’s roar or a bee’s buzz—may seem simple to us, but it’s a complex task for machines.
A team of researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) has developed a groundbreaking AI model that can imitate sounds with surprising accuracy.
The research builds on a longstanding challenge in artificial intelligence: how to make machines replicate human-like vocal imitations of sounds. Researchers first built a human vocal tract model, which simulates how the throat, tongue, and lips shape vibrations from the voice box.
Then, using a cognitively inspired AI algorithm, researchers controlled this vocal tract model and made it produce imitations while considering the context-specific ways that humans choose to communicate sound.
The model can generate human-like imitations of various sounds, such as leaves rustling, a snake hissing, or an ambulance siren. It can also work in reverse, guessing real-world sounds from human vocal imitations, much like how computer vision systems can create images from sketches. For example, it can tell the difference between a human imitating a cat’s “meow” and its “hiss.”
In the future, this model could help create more intuitive sound design tools, make AI characters in virtual reality more lifelike, and even assist students in learning new languages.
Robots that resemble humans could be thought to have mental states
The art of imitation, in three parts
The team created three increasingly advanced models to get closer to the nuances of human sound imitation, revealing not just the complexity of the task but also how human behavior shapes the way we reproduce noises.
The team’s first model was relatively simple, aiming to generate imitations that closely matched real-world sounds. However, this baseline version didn’t align well with human behavior. It lacked the subtlety that makes human imitations unique. So, the researchers took a more focused approach.
They introduced a second model, dubbed the “communicative” model, which considered how a listener perceives a sound. For example, if you were to imitate the sound of a motorboat, you’d likely focus on the deep rumble of its engine, which is the most distinctive feature of the sound, even though the splash of the water might be louder. This model created more accurate imitations, but the researchers weren’t satisfied yet—they wanted to go even further.
The team added another reasoning layer to the model in their final version. MIT CSAIL PhD student Kartik Chandra SM ’23 explained, “Vocal imitations can sound different depending on how much effort you put into them. It’s more difficult to produce perfectly accurate sounds, and people naturally avoid making noises that are too rapid, loud, or high-pitched in regular conversation.”
This more refined model accounted for these human tendencies, leading to even more realistic imitations that mirrored the decisions people make when mimicking sounds.
To test their model, the team set up a behavioral experiment where human judges evaluated AI-generated imitations alongside those made by humans. The results were striking: participants preferred the AI model 25 percent of the time overall, with even stronger preferences in some cases.
For example, the AI’s imitation of a motorboat was favored 75 percent of the time, while its imitation of a gunshot was preferred by half of the participants.
These results indicate that the AI model wasn’t just matching real-world sounds; it was doing so in a way that felt more natural and aligned with human vocal behavior.
Undergraduate researcher Matthew Caren, who is passionate about technology’s role in music and art, envisions a wide range of applications for the model. “This model could help artists better communicate sounds to computational systems,” she said. “Filmmakers and content creators could use it to generate AI sounds that are more nuanced to a specific context. Musicians might even use it to quickly search sound databases by simply imitating the sound they have in mind.”
But the team’s ambitions don’t stop there. The researchers are already exploring other potential applications for the model, including in the fields of language development, infant speech learning, and even animal imitation behaviors. The team is particularly interested in studying birds, like parrots and songbirds, whose vocal imitations are an intriguing parallel to human imitation.
While the model has made impressive strides, there are still some challenges to address. For instance, the model struggles with certain consonants, such as the “z” sound, leading to less accurate imitations of sounds like buzzing bees. It also hasn’t yet been able to fully replicate the nuances of how humans imitate speech or music, particularly when imitations differ across languages, like the sound of a heartbeat.
Journal Reference:
- Matthew Caren, Kartik Chandra, Joshua B. Tenenbaum, Jonathan Ragan-Kelley, Karima Ma. Sketching With Your Voice: “Non-Phonorealistic” Rendering of Sounds via Vocal Imitation. DOI: 10.48550/arXiv.2409.13507