7/4/2024

How tech giants cut corners to harvest data for AI

The New York Times published a bombshell report on how, behind the scenes, large AI companies find a (sometimes illegal) way to harvest content to train their large language models, the basis of generative AI.

The funniest part is Google turning a blind eye to OpenAI transcribing 1 million hours (!) of YouTube videos to power GPT-4 because Google itself was doing the same for Gemini. (The practice violates the terms of use of YouTube.)

It gets better. Two days earlier, YouTube CEO Neal Mohan told Bloomberg that the use of videos by OpenAI to train Sora, its astonishing AI video generator, would be against the platform’s terms of use.