New research: Twitter bot detection tools aren’t very good
A new paper suggests that the field of bot detection is based on a flawed premise due to poor-quality original data.
The research, presented this week at the Web Conference (where it was awarded best paper), found that bot detection tools can rely on funky, flawed data sets that replicate mistakes made within one another, rather than trying to actually accurately identify bots.
Zachary Schutzman, a researcher at the Massachusetts Institute of Technology, and his colleagues first investigated the accuracy of bot detection tools when they themselves wanted to analyze conversations on Twitter, but needed to strip their dataset of bot-generated content. They found the existing range of bot detection tools weren’t very good.
“We downloaded one of the [bot] datasets from a website, trained a simple model in Python, and got, like, 99% accuracy,” says Schutzman. The team’s first reaction was that they had done something wrong: Their simple model couldn’t possibly be as accurate as the complicated neural networks their peers had deployed. “It turned out we didn’t make a silly mistake: A very simple model did work very well on this data,” he says.
They thought maybe it was a problem with the specific training data they had downloaded, which had been used on one bot-detection model. So they tried another data set. And got the same result. “Right off the shelf, we were getting a model that was getting a 95 or 98% accuracy,” he says. Their super-simple attempts at bot classification should not have come anywhere close to matching the accuracy of highly complicated machine learning methods, which are in vogue to detect bots.
They set out to understand why. They found numerous glitches with the data set collection and labeling. In trying to favor the whizz-bang of new AI technology, past researchers had in fact made the decisions their models make fiendishly simple. It turned out if the account had ever liked a tweet, it was labeled as a human in the data set. If it had never liked a tweet, it was a bot in the data set. “We realized that this was a systemic issue in the data sets that are commonly used for bot detection,” Schutzman says.
The models developed to detect bots may have been very complicated. But the underlying data they were trained on was so simple. And because of the way academic research works, those data sets are used in multiple models, replicating the errors along the way.
“This is a very interesting paper that exposes substantial issues around existing bot detection benchmarking data sets,” says Manoel Horta Ribeiro, a Ph.D. student at the Swiss Federal Institute of Technology, who was not involved with the study. The finding, Ribeiro says, makes sense: Bot detection models generalize poorly across different data sets, meaning that models trained on one set of data perform poorly on other data. An army of spambots designed to shift the conversation around politics on Twitter will behave differently to spambots designed to sell you a cryptocurrency scam, for instance—yet bot detection tools tend to lump them together as one and the same.
“Data sets have artifacts that seem to hinder this generalization,” Ribeiro says. “Some data sets were collected using specific hashtags, and models can mistakenly capture these features, which may not be good discriminators of bots in the wild.”
In their experiment, Schutzman and his colleagues were able to get similar levels of accuracy detecting bots by, for instance, asking if an account had ever tweeted the word “earthquake” or liked more than 16 tweets throughout its lifetime.
That causes problems, reckons Schutzman. “People doing bot detection work, building algorithms to detect COVID misinformation bots or something on Twitter, need to be very careful about what data they’re using,” he says. And more than that, past research on the prevalence of bots on social media ought to be reconsidered.
Christopher Bouzy, the founder of Bot Sentinel, a third-party Twitter bot detection tool named in the paper, says that “while the paper raises important concerns, it falls short in several crucial aspects for a comprehensive understanding of the subject.” Bouzy says bot detection isn’t just limited to using machine learning. “There are other approaches, such as rule-based systems, network analysis, and behavioral analysis, which can also play a significant role in identifying bots. By not considering these alternative methods, the paper presents an incomplete picture of the bot detection landscape.”
Schutzman, for his part, doesn’t blame those behind the datasets, nor the models. He says that Twitter is itself culpable, as it has under Elon Musk restricted access to high-quality data. Twitter, for instance, now charges $42,000 a month for a basic level access to its API that academics previously got for free or a low cost.
“The reason that people are relying on the data sets that have been generated from other papers is because you can’t just go to twitter.com/bot-detection and download a bunch of high quality, clean data produced by the platform that’s trustworthy,” Schutzman says.
(18)