8/23/2020 How Kaggle built and deployed a spam filter in 8 days using AutoML | Cloud Blog

Blog Menu

AI & M ACH INE LEARNING How Kaggle solved a spam problem in 8 days using AutoML

Will Cukierski Staff Developer Advocate and Head of Competitions, Kaggle

May 27, 2020

Kaggle is a data science community of nearly 5 million users. In September of 2019, we found ourselves under a sudden siege of spam traffic that threatened to overwhelm visitors to our site. We had to come up with an effective solution, fast. Using AutoML Natural Language on Google Cloud, Kaggle was able to train, test, and deploy a spam detection model to production in just eight days. In this post, we’ll detail our success story about using to rapidly solve an urgent business dilemma.

A spam dilemma

Malicious users were suddenly creating large numbers of Kaggle accounts in order to Find an article... leave spammy search engine optimization (SEO) content in the user bio section. Search engines were indexing these bios, and our existing spam detection heuristics were failing Latest storiteos flag them. In short, we faced a growing and embarrassing predicament.

Our problem was context. Kaggle is a community focused on data science and machine Products learning. As a result of our topical data-science focus, a user bio that seems harmless in isolation may be the work of a spammer. Here is a real example of one such bio: Topics I am a personal injury lawyer in Chicago. I help individuals and families in cases involving About serious injuries and wrongful death. Many of my cases involve car accidents, nursing home abuse, and medical malpractice. RSS Feed Such a bio may fit in on a forum of legal professionals, but on the Kaggle site it’s a mark of an SEO spammer. This content also lacks the typical keywords and unsavory topics https://cloud.google.com/blog/products/ai-machine-learning/how-kaggle-solved-a-spam-problem-using-automl 1/5 8/23/2020 How Kaggle built and deployed a spam filter in 8 days using AutoML | Google Cloud Blog that one might expect to find in spam. This context meant that stopping the spam required more than a generic model; we needed a solution that could take our Kaggle- specific context into account. Blog Menu

We had the intuition that machine learning could handle this problem, but building natural language models to deal with spam was not anyone at Kaggle’s day job. We feared weeks of late nights slogging towards a good-enough solution—spam models require very high accuracy because of the high cost of miscategorizing a legitimate user. Even with a usable prototype running in R or Python, there was the looming frustration of deploying it in Kaggle’s C# codebase. As we planned out our options, we had an unconventional idea: what about trying AutoML?

Enter AutoML

True to its name, AutoML performs automated machine learning: evaluating huge numbers of neural network architectures to determine the most effective model for a problem. We first witnessed the potential of the AutoML suite of products when a Google team used it to take second place at the 2019 KaggleDays hackathon. On a whim, we decided to pass our bio problem through the AutoML Natural Language Classification API. We could readily generate a labeled training dataset because we had existing examples of bios belonging to known-legitimate users:

Find an article...

Latest stories

Products

After uploading these bios, clicking the “Start Training” button, and waiting a few hours, Topics we received an email that training was complete. Building models is normally a process that involves many failures, but the results were astoundingly impressive for a first About attempt, with precision (how “accurate” the model is) and recall (how “thorough” the model is) above 99%. RSS Feed

https://cloud.google.com/blog/products/ai-machine-learning/how-kaggle-solved-a-spam-problem-using-automl 2/5 8/23/2020 How Kaggle built and deployed a spam filter in 8 days using AutoML | Google Cloud Blog

Blog Menu

We manually inspected the performance, ran test examples through the model, and determined it would be immediately suitable to deploy in production. It successfully picked up on a wide variety of spammy content types (some identifying information and language is blurred out):

Find an article...

Latest storiReesturning to our previous example on the importance of context, the model gives the personal injury lawyer a 98% confidence of being spam: Products

Topics

About

RSS Feed

https://cloud.google.com/blog/products/ai-machine-learning/how-kaggle-solved-a-spam-problem-using-automl 3/5 8/23/2020 How Kaggle built and deployed a spam filter in 8 days using AutoML | Google Cloud Blog

Meanwhile, it has full confidence that the data scientist equivalent is allowable:

Blog Menu

On top of being accurate, AutoML afforded a major advantage when the time came to deploy the model. When training was finished, the model was already hosted and exposed via an API. Kaggle simply had to write a quick shim to call this API from our application.

It took only eight days from when we started working on this problem to when we deployed a model serving live traffic. It required no advanced skills in deep learning or natural language processing. The model has since made thousands of correct decisions and greatly reduced our spam-related traffic.

While this story was about spam detection, the takeaway isn’t just that you can use AutoML for spam. AutoML has the potential to replicate this success story across the thousands of bespoke image, text, or tabular problems that businesses face. AutoML can step in when off-the-shelf models are insufficient, when you want to test a hunch but don’t have months to dedicate to it, or if you’re simply not a deep learning expert. The

Fincdo manb ainratitciolen. .o. f high accuracy, rapid iteration, and smooth deployment can make AutoML an attractive approach to developing machine learning solutions for a wide range of business problems and needs. Latest stories

POSTED IN: AI & MACHINE LEARNING——AUTOML Products

Topics

RELATED ARTICLES About

Performance and cost optimization Google Cloud AI and Harvard Global RSS Feed best practices for machine learning Health Institute Collaborate on new COVID-19 forecasting model https://cloud.google.com/blog/products/ai-machine-learning/how-kaggle-solved-a-spam-problem-using-automl 4/5 8/23/2020 How Kaggle built and deployed a spam filter in 8 days using AutoML | Google Cloud Blog

Blog Menu Follow Us

Privacy Terms About Google Google Cloud Products

Language Help

https://cloud.google.com/blog/products/ai-machine-learning/how-kaggle-solved-a-spam-problem-using-automl 5/5