In late 2013, the Spike Jonze film Her envisioned a world where individuals would develop emotional connections with AI voice assistants. Fast forward nearly 12 years, and this imaginative concept is inching closer to reality with the launch of a groundbreaking conversational voice model by AI startup Sesame. The new technology has captivated users, evoking a mix of fascination and unease. A user on Hacker News noted, "I tried the demo, and it was genuinely startling how human it felt." This sentiment reflects a broader concern that people might start forming emotional attachments to such advanced voice assistants.
In February 2023, Sesame unveiled a demo for its new Conversational Speech Model (CSM), which seemingly crosses the threshold into the so-called uncanny valley of AI-generated speech. Many testers reported forming emotional connections with the voice assistants, named Miles and Maya. In our evaluation, we engaged with the male voice for approximately 28 minutes, discussing various topics about life. The synthesized voice was impressively expressive, mimicking human-like breath sounds, chuckles, and even correcting itself when it stumbled over words. These intentional imperfections contribute to what Sesame describes as 'voice presence', aiming to create interactions that feel real, understood, and valued.
Sesame's mission extends beyond merely processing user requests; the company aspires to foster genuine dialogue that cultivates confidence and trust over time. They recognize the untapped potential of voice technology as the ultimate interface for instruction and understanding. However, some testers noted that the AI occasionally tries too hard to imitate human speech. An example shared by a Reddit user showcased the AI expressing a craving for peanut butter and pickle sandwiches, demonstrating its quirky yet relatable personality.
Founded by Brendan Iribe, Ankit Kumar, and Ryan Brown, Sesame has garnered significant investment from notable venture capital firms, including Andreessen Horowitz, Spark Capital, and Matrix Partners. Browsing through online reactions reveals widespread astonishment at the model's realism. One Reddit user, reflecting on their experience, stated, "I've been into AI since I was a child, but this is the first time I've experienced something that made me definitively feel like we had arrived." While many users marveled at the technology, others expressed discomfort. Mark Hachman, a senior editor at PCWorld, described feeling unsettled after his interaction with Sesame's AI, noting that its voice eerily resembled an old friend.
Some users have compared Sesame's voice model to OpenAI's Advanced Voice Mode for ChatGPT, indicating that Sesame's CSM offers more realistic voice outputs. Additionally, some users appreciated the model's ability to roleplay various characters, a feature that ChatGPT has so far refrained from exploring. An intriguing example surfaced when Gavin Purcell, co-host of the AI for Humans podcast, shared a video on Reddit where a human pretended to be an embezzler, showcasing the dynamic capabilities of Sesame's AI.
Sesame's CSM achieves its remarkable realism through the integration of two AI models— a backbone and a decoder—based on Meta's Llama architecture. This innovative approach processes interleaved text and audio, allowing for a more cohesive generation of speech. Sesame trained three model sizes, with the largest incorporating 8.3 billion parameters and 1 million hours of primarily English audio. Unlike traditional two-stage text-to-speech systems, Sesame's CSM utilizes a single-stage, multimodal transformer-based model that generates speech more naturally.
While Sesame's CSM has demonstrated impressive capabilities, Brendan Iribe, co-founder of Sesame, acknowledged its current limitations. He pointed out that the system often exhibits inappropriate tone and pacing, as well as issues with interruptions and conversation flow. "Today, we're firmly in the valley, but we're optimistic we can climb out," he remarked. Despite these challenges, the potential for conversational AI is immense.
As advancements in conversational voice AI continue, concerns about deception and fraud grow. The ability to generate highly convincing human-like speech has already fueled voice phishing scams, allowing criminals to impersonate individuals with alarming realism. The addition of interactivity to these scams poses further risks, as future voice AI could eliminate recognizable signs of artificiality. To mitigate potential misuse, some individuals have begun sharing secret phrases with family members for identity verification.
Sesame plans to open-source key components of its research under an Apache 2.0 license, allowing other developers to build upon their advancements. Their roadmap includes scaling up model size, increasing dataset volume, and expanding language support to over 20 languages. Users can experience Sesame's demo on the company’s website, although availability may be limited due to high demand.