OpenAI and Google have reportedly transcribed YouTube videos to collect text for their artificial intelligence models, potentially violating the creators’ copyright.

Tech giants are allegedly cutting corners to get as much data as possible to train their artificial intelligence models, according to an investigation by The New York Times and Meta.

OpenAI researchers are said to have created a speech recognition tool called Whisper that can transcribe audio from YouTube videos. This can generate new conversational text, making the AI ​​system smarter.

The investigation cited multiple sources saying that more than 1 million hours of YouTube videos had been transcribed, despite discussions about possible violations of YouTube rules. The records are then fed into GPT-4, the advanced artificial intelligence system that powers the latest version of the ChatGPT chatbot. YouTube parent company Google also reportedly transcribes videos to train its own artificial intelligence models.

Among other things, OpenAI president Greg Brockman was personally involved in collecting the videos, the Times wrote.

OpenAI’s alleged use of YouTube videos may also violate Google’s policies, which prohibit the use of its content in “standalone” applications and the use of its videos in “automated ways” through methods such as bots, botnets or scrapers.

Are tech companies running out of training data?

The report also suggests that OpenAI exhausted its supply of useful data in 2021, so it discussed transcribing podcasts, audiobooks, and YouTube videos to train its next-generation models. At the time, they were said to have mined the computer code repository GitHub and used a database of chess moves from the Quizlet website as well as data describing high school exams and homework assignments.

The New York Times said that Google’s legal department asked the company’s privacy team to revise the wording of its policy to expand the scope of actions it can take on consumer data, including using office tools such as Google Docs.

Meta is also facing a shortage of available training data, and in recordings reviewed by the publication, its AI team can be heard discussing the unauthorized use of copyrighted material in an effort to keep up with OpenAI, The Times reported . After reportedly exhausting the “virtually available English-language books, essays, poetry, and news articles on the Internet,” the company considered taking steps such as acquiring a book license or outright acquiring a major publishing house.

Last week, YouTube CEO Neal Mohan said using videos from the platform to train AI models would be a “clear violation” of YouTube’s terms and conditions because OpenAI’s chief technology officer “had no idea” Whether the tool was trained on YouTube videos.

The advanced systems created by OpenAI, Google, and others require vast amounts of information to learn. This demand is depleting the reserves of high-quality public data on the Internet, especially as some data owners limit access to AI companies. The Wall Street Journal points out that by 2028, there is a 90% chance that demand for high-quality data will exceed supply.

OpenAI, Google and Meta have been contacted for further comment.

Featured Image: Canva

#OpenAI #Google #accused #YouTube #transcripts #artificial #intelligence

Leave a Reply

Your email address will not be published. Required fields are marked *