Why data will always be a precious commodity in the AI world
The New York Times lawsuit against OpenAI late last year over the tech company’s use of the newspaper’s journalism to train its large language model (LLM) represented a major move in unprecedented times. It also could portend a shift in the Big Tech/content creator relationship—one that was fraught to begin with and might now turn increasingly litigious. At the heart of the suit is the question of data, and whether the companies behind LLMs can claim “fair use” in gobbling up that data.
When we think about the amount of data that is needed to train LLMs it stands to reason that organizations will be protective over how their proprietary data is used and credited. LLMs require vast amounts of data, and despite OpenAI CEO Sam Altman’s recent claims, OpenAI and ChatGPT need access to a wide range of data to strengthen the model—and this may include both proprietary and non-copyright work. The high quality and reliability of the New York Times content precisely strengthens ChatGPT outputs.
That was the company’s own position three weeks ago, according to The Telegraph, which shared a submission from OpenAI to the House of Lords communications and digital select committee. In the submission, the company admitted that it could not train LLMs like ChatGPT without access to copyrighted work. In fact, it would be “impossible.”
Data is the backbone of AI and all models rely on patterns and correlations established by vast amounts of training data. Generative AI tools need high quality training data—like copyrighted content from the New York Times and other notable publishers—to provide high-quality and enough quantity of training data also reduces hallucinations, actually making responses relevant.
While the New York Times’ case against Open AI and Microsoft is probably the most visible challenge involving intellectual property implications of AI, it is hardly the only one. Plaintiffs have filed multiple lawsuits claiming the training process for AI programs infringed upon their copyrights in written and visual works. These include lawsuits by the Authors Guild and authors Paul Tremblay, Michael Chabon, Sarah Silverman, and others against OpenAI. Michael Chabon, Sarah Silverman, and other content creators have also initiated suits against Meta. There are proposed class action lawsuits against Alphabet Inc., Stability AI, and Midjourney, as well as a lawsuit by Getty Images against Stability AI.
As AI use continues to proliferate, there will be increasing pressure to resolve these copyright issues. And litigation involving intellectual property rights is just the tip of the iceberg. The number of cases centered on AI-related accuracy, safety, and discrimination are likely to rise.
Given the complexity and sheer volume of all of these cases, it will likely take years before these matters are resolved. For now, all we can say for sure is that ordinary companies rolling out AI tools would be wise to exercise care to track and monitor their use of the tech. Should a particular AI tool come under regulatory or judicial scrutiny and thus come off the market, companies will want to be able to adapt quickly and smoothly.
(26)