The AI arms race may soon center on a competition for ‘expert’ data
The AI arms race may soon center on a competition for ‘expert’ data
Training data scraped from the web may have been OK for the first wave of large language models, but as competition increases, more and better data will be needed.
Welcome to AI Decoded, Fast Company’s weekly newsletter that breaks down the most important news in the world of AI. You can sign up to receive this newsletter every week here.
The AI arms race will soon focus on competition for data
We use benchmark tests such as MMLU and HellaSwag to test large language models’ knowledge and problem-solving capability. But over the past six months it’s become clear that the performance gaps between well-known models is narrowing. A year ago, OpenAI’s GPT-4 was considered the undisputed champion of LLM, but now models from Anthropic, Mistral, Meta, Cohere, and Google are producing similar or better scores, depending on the benchmark.
In the past, we’ve improved large models by giving them more training data and compute power. But the performance returns on training with data scraped indiscriminately from the public web are limited, many believe. As a result, we’re left with a growing group of LLMs with roughly equal performance. Now, AI developers will likely try to gain an edge by acquiring stores of specialized data, such as health data.
“We built really great general purpose machines that talk like humans, but just like humans [who] are not experts, they’re generalists,” says Ali Golshan, cofounder and CEO of the synthetic data company Gretel. “Now, what we’re saying is that these general purpose machines need to become experts.” But “expert” training data is usually not public; it’s proprietary, held close by corporations. Gretel’s platform can be used to anonymize such data for use in training models.
We’ve already seen a number of AI developers strike deals to license content data from publishers. Earlier this week, in fact, OpenAI said it had signed a content deal to use content from the Financial Times. Reuters reported in February that Google was licensing data from the social platform Reddit. The New York Times Company sued OpenAI for using its content without permission, and the suit may well result in some form of licensing deal.
But as AI companies intensify their quest for specialized domain data, we may see deals that go well beyond licensing agreements. It’s very possible that AI companies will buy content companies outright, just for their training data. Stephen DeAngelis, founder and CEO of the reasoning AI company Enterra Solutions believes that Wikipedia, WolframAlpha, or even Getty Images could be targets in this type of acquisition. Tech firms could also be eyeing lesser-known companies that possess a kind of data needed to fill some crucial gap in an LLM’s knowledge. Or, AI companies might try to tap into academic knowledge, DeAngelis says. “I could see these large firms saying to colleges, ‘We’ll pay you a lot of money so you can fuel your investment in research, and can we license a copy of that [research] content to put into our LLM,’” he says.
The budding AI industry is already seeing an alarming migration of research and engineering talent, and specialized computer chips, to powerful, deep-pocketed players, such as Microsoft, Meta, and, increasingly, Elon Musk’s X (formerly Twitter). This concentration of resources could also soon include training data, further entrenching the players with the most buying power.
California frontier AI bill is moving through the Senate
A bill that would impose safety guidelines on AI companies developing large AI models has been moving through the state’s Senate and will get a full hearing May 6. The bill, called the Safe and Secure Innovation for Frontier Artificial Intelligence Systems Act (SB 1047), would require AI companies to study the safety implications before training large models, satisfy certain safety requirements, and report any safety incidents caused by the model. The bill would also establish a “Frontier Model Division” within the state’s Department of Technology that would collect information on large model development and oversee certification, and it proposes civil penalties for companies that violate the requirements of the Act.
It’s still unclear how the state would actually enforce such requirements. And the law doesn’t automatically hold developers accountable if a model causes harm, or even a catastrophe, as Dan Hendrycks of the Center for AI Safety points out on X. “The question is whether they took reasonable measures to prevent that,” he writes. “This bill could have used strict liability, where developers would be liable for catastrophic harms regardless of fault, but that’s not what the bill does.”
As the bill moves toward an eventual vote, some in the AI community are voicing fears that it could stifle the work of smaller AI startups and people working on open-source models.
The bill is unique for its focus on the development of the largest models, such as OpenAI’s GPT-4 and Google’s Gemini. The last major milestone in AI model safety came when Amazon, Anthropic, Google, Inflection, Meta, Microsoft, and OpenAI all pledged to the Biden administration in September 2023 to study the societal risks of new models (such as bias and privacy violations), proactively manage risk, and empower their internal safety teams.
For now AI regulation is happening mostly on a state level, with a good deal of focus on deepfakes and employment discrimination. SB 1047 is especially important given that California is often a first mover on technology regulation, and laws that pass in Sacramento are often used as models or templates for other states.
Survey: AI is already changing the way people search the web
The Verge survey about “how Americans are using and thinking about AI” includes some notable findings around new chatbot users (there are fewer of them in 2024) and AI usage patterns (people are finding more, and increasingly advanced, ways of using the technology). But most interesting is what the survey turns up around search: “The first meaningful disruption in search in 20 years is coming into full view,” The Verge editors write.
The survey of 2,000 users asked: “Do you use AI tools in place of search engines (like Google) to find information about a topic?” 61% of Gen Z respondents said they did, along with 53% of millennials. And 63% of millennials and 52% of Gen Z say they “trust the veracity of information that AI provides,” compared to only 32% of baby boomers. More than half of the respondents said they think AI can do a better job on common search tasks, such as planning a family activity or outing or discovering new recipes.
These levels of AI-native search adoption could push Google to make its own AI-native search, called Search Generative Experience (SGE), a regular part of its traditional search experience sooner than expected. The results also bode well for AI search upstarts like Perplexity.
ABOUT THE AUTHOR
(12)