In late 2013, the Spike Jonze film Her imagined a future where people would form emotional connections with AI voice assistants. Nearly 12 years later, that fictional premise has veered closer to reality with the release of a new conversational voice model from AI startup Sesame that has left many users both fascinated and unnerved.
"I tried the demo, and it was genuinely startling how human it felt," wrote one Hacker News user who tested the system. "I'm almost a bit worried I will start feeling emotionally attached to a voice assistant with this level of human-like sound."
In late February, Sesame released a demo for the company's new Conversational Speech Model (CSM) that appears to cross over what many consider the "uncanny valley" of AI-generated speech, with some testers reporting emotional connections to the male or female voice assistant ("Miles" and "Maya").
In our own evaluation, we spoke with the male voice for about 28 minutes, talking about life in general and how it decides what is "right" or "wrong" based on its training data. The synthesized voice was expressive and dynamic, imitating breath sounds, chuckles, interruptions, and even sometimes stumbling over words and correcting itself. These imperfections are intentional.
"At Sesame, our goal is to achieve 'voice presence'—the magical quality that makes spoken interactions feel real, understood, and valued," writes the company in a blog post. "We are creating conversational partners that do not just process requests; they engage in genuine dialogue that builds confidence and trust over time. In doing so, we hope to realize the untapped potential of voice as the ultimate interface for instruction and understanding."
Sometimes the model tries too hard to sound like a real human. In one demo posted online by a Reddit user called MetaKnowing, the AI model talks about craving "peanut butter and pickle sandwiches."
Which is important when interacting with someone like me. It asked 'what's your name?' I said 'John,' with my dystonic mouth. It responded, 'Trun. That's an interesting name.' A better response would have been, 'I beg your pardon, can you repeat that?'
So it's able to respond in a verbal and conversational way, but it can't see when it's reaching beyond its understanding. I wonder how it handles background noise. Or a dog barking.
AI can sound natural and human and pass the test on that front. Great, cool, whatever - the real value is that AI is AI, and therefore can have programmed to avoid certain things....like giving into emotional temptation when someone calls in with an issue. A human might be swayed to act in certain ways or give the caller certain deals or discounts.
AI, being unfeeling, would probably be unfazed and less inclined (programmed) to deviate from the script.