OpenAI Transcribed Over One Million Hours of YouTube Videos to Train GPT-4

baoshi.rao

Recently, The Wall Street Journal reported that artificial intelligence companies are facing difficulties in collecting high-quality training data. Subsequently, The New York Times detailed some of the methods companies are using to address this issue, which involve the ambiguous gray areas of AI copyright law.

The story begins with OpenAI. The company urgently needed training data and reportedly developed the Whisper audio transcription model, transcribing over one million hours of YouTube videos to train its most advanced large language model, GPT-4. The New York Times reported that OpenAI was aware of the legal issues but believed it constituted fair use. OpenAI President Greg Brockman was personally involved in collecting the videos used. OpenAI spokesperson Lindsey Held told The Verge that the company curates "unique" datasets for each model, using "numerous sources, including public data and non-public data from partners." Held also mentioned that the company is considering generating its own synthetic data.

According to sources from The New York Times, Google has also collected transcripts from YouTube. Google spokesperson Matt Bryant stated that the company "trained models on some YouTube content in accordance with our agreements with YouTube creators."

Meta has similarly encountered limitations in the availability of high-quality training data. In its efforts to catch up with OpenAI, the company has considered using copyrighted works without permission, including paying for book licenses or directly acquiring a major publisher. These companies are struggling with the rapid depletion of model training data. The Wall Street Journal reported this week that by 2028, companies might exhaust all available new content. Potential solutions include training models on 'synthetic' data generated by the models themselves or adopting 'curriculum learning' approaches. However, another option these companies might consider is using whatever data they can find, regardless of whether they have permission, which could raise concerns under copyright laws.