OpenAI built a voice cloning tool, but you can’t use it yet…

OpenAI is perfecting techniques for cloning voices as deepfakes proliferate, but the company insists it is doing so responsibly.

Today marks the debut of a preview version of OpenAI’s speech engine, an extension of the company’s existing text-to-speech API. The speech engine, about two years in development, allows users to upload any 15-second speech sample to generate a synthesized copy of that speech. But there’s no public release date yet, giving the company time to respond to how the model is being used and abused.

“We want to make sure everyone is happy with how it’s being deployed — that we understand the dangers of this technology and that we have mitigations in place,” said Jeff Harris, product staff at OpenAI. TechCrunch Be interviewed.

Training model

Harris said the generative AI models powering speech engines have been hiding in plain sight for some time.

The same model supports the speech and “read aloud” features in OpenAI’s artificial intelligence chatbot ChatGPT, as well as the preset voices provided in the OpenAI text-to-speech API. Spotify has been using it since early September to dub podcasts in different languages ​​from well-known hosts like Lex Fridman.

I asked Harris where the training data for his model came from—a somewhat touchy subject. He would only say that the speech engine model was trained on a combination of licensed and publicly available data.

Models like the Driven Speech Engine are trained on large sets of examples (in this case, voice recordings), often from public sites and datasets on the web.Many builds AI vendors view training data as a competitive advantage and therefore take it and the information associated with it to heart. But training data details are also a potential source of intellectual property-related litigation, another factor in the reluctance to reveal too much.

OpenAI is already be accused The company is accused of violating intellectual property laws by training artificial intelligence on copyrighted content such as photos, artwork, code, articles and e-books without providing credit or payment to the creators or owners.

OpenAI has licensing agreements with some content providers, such as Shutterstock and news publisher Axel Springer, and allows website administrators to prevent their web crawlers from crawling their sites for training data. OpenAI also allows artists to “opt out” and remove their work from the datasets the company uses to train its image-generating models, including the latest DALL-E 3.

But OpenAI does not offer such an opt-out option for its other products. In a recent statement to the UK House of Lords, OpenAI said it was “impossible” to create useful AI models without copyrighted material, and claimed fair use – which allows the use of copyrighted works. The legal principle of secondary creation is as long as it is transformative – protecting it where model training is involved.

synthesized sound

Surprisingly, the speech engine no Train or fine-tune based on user data. This is partly due to the ephemeral way in which the model (a combination of a diffusion process and a converter) generates speech.

“We take a small number of audio samples and text and generate realistic speech that matches the original speaker,” Harris said. “Once the request is completed, the audio used will be deleted.”

As he explains, the model simultaneously analyzes the speech data extracted from it and the text data to be read aloud, generating matching speech without the need to build a custom model for each speaker.

This is not a new technology. Many startups have been offering voice cloning products for years, from ElevenLabs to Replica Studios to Papercup to Deepdub to Respeecher. The same goes for big tech companies like Amazon, Google, and Microsoft—the last of which, by the way, is a major investor in OpenAI.

Harris claims that OpenAI’s approach delivers overall higher quality speech.

We also know it’s going to be priced very aggressively. Although OpenAI removed Voice Engine pricing from marketing materials released today, in documents reviewed by TechCrunch, Voice Engine is priced at $15 per million characters (about 162,500 words). This would fit well into Dickens’s Oliver Twist, with a little room to spare. (The “HD” quality option costs twice that, but confusingly, an OpenAI spokesperson told TechCrunch there’s no difference between HD and non-HD speech. You name it.)

That translates to about 18 hours of audio, making the price just over $1 an hour. That’s indeed cheaper than what one of its more popular competitors, ElevenLabs, charges – $11 per month for 100,000 characters.but it Do At the expense of some customization.

The speech engine does not provide controls to adjust the pitch, pitch, or tempo of speech.In fact, it does not provide any The knobs or dials are currently being fine-tuned, although Harris points out that any expressiveness found in the 15-second speech sample will continue in subsequent generations (for example, if you speak in an excited tone, the generated synthetic speech will always sound excited). We’ll see how good the readings are when other models can be compared directly.

Voiceover talent as a commodity

Voice actor salaries on ZipRecruiter range from $12 to $79 an hour – significantly more expensive than Voice Engine, even on the lower end (actors with agents are priced much higher per project). If OpenAI’s tools become popular, it could commoditize voice work. So, where will the actors go from here?

To be sure, the staffing industry won’t be caught off guard—it has been grappling with the existential threat of generative AI for some time. Voice actors are increasingly being asked to give up the rights to their voices so that clients can use artificial intelligence to generate synthesized versions that could eventually replace them. Voice jobs — especially cheap, entry-level jobs — may be replaced by AI-generated speech.

Now, some AI voice platforms are trying to find a balance.

Replica Studios signed a controversial deal with SAG-AFTRA last year to produce and license copies of the voices of Media Artists Alliance members. The organizations said the arrangement sets out fair and ethical terms and conditions to ensure performers’ consent is obtained when negotiating terms for the use of synthetic voices in new works such as video games.

ElevenLabs, meanwhile, has a marketplace for synthesized sounds that allows users to create sounds, verify them and share them publicly. When others use the voices, the original creators receive compensation—a fixed dollar amount per 1,000 characters.

OpenAI will not set up such a union trade or market, at least not in the short term, and will only require users to obtain “explicit consent” from whose voices are being cloned, “clear disclosure” indicating which voices were generated by AI, and agreement not to use The voices of minors, deceased persons, or political peers.

“How that intersects with the voice actor economy is something we’re watching closely and really curious about,” Harris said. “I think there’s going to be a lot of opportunity to expand the reach of voice actors with this technology. But this is all something we’re going to learn when people actually deploy and use this technology.”

Ethics and deepfakes

Voice cloning applications can and have been abused to an extent that goes far beyond threatening actors’ livelihoods.

The notorious message board 4chan, known for its conspiratorial content, used ElevenLabs’ platform to share hateful messages parodying celebrities such as Emma Watson. The Verge’s James Vincent was able to use artificial intelligence tools to maliciously quickly clone voices, generating samples containing everything from threats of violence to racist and transphobic remarks. At Vice, journalist Joseph Cox documented how to generate a voice clone powerful enough to fool a bank’s authentication system.

There are concerns that bad actors will try to influence elections through voice cloning. They’re not unfounded: In January, a telephone campaign hired a deep-seated President Joe Biden to prevent citizens from voting in New Hampshire—prompting the Federal Communications Commission to take action to outlaw such activity in the future.

So, in addition to banning deepfakes at the policy level, what measures, if any, has OpenAI taken to prevent the Voice Engine from being abused? Harris mentioned some.

First, Voice Engine is only open to a very small number of developers (about 100). Harris said that in addition to experimenting with “responsible” synthetic media, OpenAI is also prioritizing “low-risk” and “socially beneficial” use cases, such as those in healthcare and accessibility.

Some early adopters of speech engines include Age of Learning, an edtech company that uses the tool to generate voiceovers for previous actors, and HeyGen, a storytelling app that leverages speech engines for translation. Livox and Lifespan are using speech engines to create voices for people with speech impairments and disabilities, and Dimagi is building speech engine-based tools to provide feedback to health workers in their primary language.

Here is the sound generated by Lifespan:


Here’s one from Livox:

Second, clones created using the speech engine are watermarked using technology developed by OpenAI that embeds inaudible identifiers in the recordings. (Other vendors, including Resemble AI and Microsoft, have adopted similar watermarks.) Harris stopped short of promising there was no way around the watermark, but described it as “tamper-proof.”

“If there’s an audio clip, it’s easy for us to look at that clip and determine that it was generated by our system and the developer who actually made that generation,” Harris explain. “As of now, it’s not open source – we currently have it internally. We’re curious about making it public, but obviously that comes with the additional risk of exposing and compromising it.”

Third, OpenAI plans to provide access to the speech engine to members of its Red Team Network, a group of contracted experts that help inform the company’s AI model risk assessments and mitigation strategies to thwart malicious use.

Some experts believe that AI red teams are not detailed enough and that it is the responsibility of vendors to develop tools to defend against the harm that AI can cause. OpenAI hasn’t gone very far with its speech engine, but Harris claims the company’s “overriding principle” is to release the technology safely.

general release

Depending on how the preview goes and how Voice Engine is received by the public, OpenAI may release the tool to the wider developer community, but at this time the company is reluctant to make any specific promises.

Harris did However, a sneak peek at the Voice Engine roadmap shows that OpenAI is testing a security mechanism that would let users read randomly generated text as proof that they exist and know how their voice is being used. Harris said this could give OpenAI the confidence to bring its speech engine to more people, or it could be just the beginning.

“What moves us forward in terms of actual voice matching technology really depends on what we learn from the pilots, what safety issues we find and what mitigations we put in place,” he said. “We don’t want people confusing artificial voices and real human voices.”

On this last point we can agree.

#OpenAI #built #voice #cloning #tool


Discover more from Yawvirals Gurus' Zone

Subscribe to get the latest posts sent to your email.

Leave a Comment

Index

Discover more from Yawvirals Gurus' Zone

Subscribe now to keep reading and get access to the full archive.

Continue reading