In late March 2024, OpenAI introduced a “small-scale preview” of an innovative AI service known as the Voice Engine. This groundbreaking tool claims the ability to clone a person’s voice using just 15 seconds of recorded speech. However, nearly a year later, the Voice Engine remains in its preview stage, with no clear indication from OpenAI regarding its potential launch or the timeline for a broader rollout.
OpenAI's hesitation to fully deploy the Voice Engine may stem from concerns about potential misuse of the technology. The company is likely cautious about inviting regulatory scrutiny, especially given its history of prioritizing “shiny products” over safety. Critics have accused OpenAI of rushing releases to outpace competitors, raising questions about the ethical implications of their innovations.
An OpenAI spokesperson emphasized to TechCrunch that the company continues to test the Voice Engine with a limited group of “trusted partners.” They stated, “We’re learning from how our partners are using the technology so we can improve the model’s usefulness and safety.” The spokesperson highlighted various applications for the Voice Engine, including speech therapy, language learning, customer support, video game characters, and the creation of AI avatars.
The Voice Engine powers the voices available in OpenAI’s text-to-speech API and ChatGPT’s Voice Mode, generating natural-sounding speech that closely mimics the original speaker. This innovative tool converts written text into speech, guided by specific content restrictions. However, the development of the Voice Engine has faced delays and shifting timelines since its inception.
According to a blog post by OpenAI in June 2024, the Voice Engine model predicts the most probable sounds a speaker will produce for a given text transcript, considering various accents and speaking styles. This allows it not only to generate spoken text but also to create “spoken utterances” that reflect the delivery style of different speakers.
Originally, OpenAI aimed to integrate the Voice Engine—initially dubbed Custom Voices—into its API by March 7, 2024. A select group of up to 100 “trusted developers” was to receive early access, focusing on projects that would provide a “social benefit” or exhibit “innovative and responsible” uses of the technology. OpenAI had even established pricing: $15 per million characters for “standard” voices and $30 per million characters for “HD quality” voices.
However, the company postponed the announcement at the last minute, ultimately unveiling the Voice Engine weeks later without an open sign-up option. Access remains restricted to a small group of around ten developers the company began collaborating with in late 2023. OpenAI stated in its announcement post that it hopes to foster discussions about the responsible deployment of synthetic voices and how society can adapt to these advancements.
The Voice Engine has been under development since 2022. OpenAI claims it showcased the tool to “global policymakers at the highest levels” in the summer of 2023 to illustrate both its potential and associated risks. Currently, several partners, including the startup Livox, have access to the Voice Engine. Livox is focused on building devices that enable individuals with disabilities to communicate more effectively.
CEO Carlos Pereira of Livox expressed to TechCrunch that while they could not integrate the Voice Engine into a product due to its online dependency, the technology itself is “really impressive.” He noted, “The quality of the voice and the possibility of having the voices speaking in different languages is unique—especially for our customers with disabilities.” Pereira hopes that OpenAI will develop an offline version of the Voice Engine in the future.
OpenAI is acutely aware of the potential for abuse, particularly in light of the previous U.S. election cycle. In a blog post from June 2024, the company revealed various safety measures implemented in the Voice Engine, including watermarking to trace the origin of generated audio. Developers are required to obtain “explicit consent” from original speakers before utilizing the Voice Engine and must disclose to their audience that the voices are AI-generated.
However, OpenAI has not detailed how it will enforce these policies, raising concerns about the challenges of implementing such measures at scale. Additionally, the company hinted at ambitions to create a “voice authentication experience” to verify speakers and maintain a “no-go” list to prevent the generation of voices that closely resemble prominent figures.
While OpenAI could potentially release the Voice Engine at any moment, it remains uncertain whether it will ever be widely available. The company has consistently indicated its preference for maintaining a limited scope for the service. Nevertheless, the prolonged preview of the Voice Engine has become one of the longest in OpenAI’s history, underscoring the complex balance between innovation, safety, and ethical considerations in the rapidly evolving field of AI voice cloning.