AI Starts with Data
Total Page:16
File Type:pdf, Size:1020Kb
AI starts eBook Series with data Facing the challenges of data collection and annotation In collaboration with: Introduction / Contents 2 The unsung hero Information is the lifeblood of artificial intelligence projects. Without sufficient quantities of high quality, accurately labeled data for model training, machine learning algorithms will struggle to make the right predictions, no matter how well they are built. Subsequently, any product that uses AI systems based on sub-par data will suffer too. Defining the amount of data necessary for an AI project is like asking about the length of a piece of string – a sensible attitude is to assume that you can’t have too much training data. There are some open source datasets available, and when approaching a popular use case, you may be able to work with what’s already out there; but if you are looking to build something truly new and original, you will likely need to embark on a long data collection and annotation phase. In this eBook, created in partnership with data specialists at Lionbridge AI (now acquired by TELUS Market perspective International), we look at the human networks Despite the impact of Covid-19, that power most artificial intelligence projects. We 3 the AI market continues to grow discuss the techniques employed in data collection and annotation, outline some of the potential pitfalls Getting the basics right that can befall a project at this early stage, and meet The importance of data the experts to discuss topics like privacy and bias – 4 collection and annotation including representatives of Bloomberg, Unit 9, and Omdia. Finding the right approach Six reasons to outsource data We hope this guide will provide useful insight into the 5 collection and annotation challenges of data preparation, helping ensure that your next artificial intelligence project is a success. How TELUS International - via Lionbridge AI - enables AI Alan Martin | Associate Editor | AI Business 6 to flourish Annotation in action Helping businesses get to grips 9 with their data Best practices Learning data annotation from 10 Bloomberg Stories from the field Employing AI in interactive media eBook in collaboration with: 14 Eliminating bias and ensuring privacy Dealing with the issues of AI – before 16 data becomes the model AI starts with data | telusinternational.com Market perspective 3 Despite the impact of Covid-19, the AI market continues to grow What does the future hold for developers of AI-based products and services You might assume that the crippling It looks like businesses are only The top eight industry verticals set effect of Covid-19 on the world’s scratching the surface in terms of to benefit are, according to Omdia, economies, supply chains, and working what’s possible with AI, as new and consumer, automotive, financial patterns would slow the progress of innovative applications emerge. “AI is services, telecommunications, retail, artificial intelligence, but the latest proven to drive down costs, generate business services, healthcare, and report from analyst firm Omdia new revenue schemes and enhance advertising. “Each industry in this suggests that’s not the case. customer experience,” states the category of top industries has many report, which identifies 340 AI use viable use cases, a general willingness “Smaller companies and certain cases across 23 industry sectors – with among industry participants to use AI, industries may postpone AI pilot 201 considered “unique.” and large volumes of data generation programs and other investments during that AI algorithms can analyze to times of economic distress, but large Among the more popular use cases, optimize processes, lower costs, and companies with AI embedded into 27 are projected to exceed $1bn in boost revenue,” the report explained. their business models will continue revenue by 2025. Five will be passing to leverage the benefits of AI,” Neil the $3bn mark, including voice What unites these different types Dunay and Aditya Kaul said in the recognition, video surveillance, and of organizations is their growing company’s quarterly AI software marketing-focused intelligent virtual thirst for data. Without well-sourced market forecast. “Once the global assistants (IVAs). By the middle of the and accurately labeled data, any economy gets back on track, the AI decade, the AI-based software market algorithm will ultimately be hobbled, software market should experience a as a whole is expected to become a and perform poorly compared to strong recovery.” $99 billion industry. its rivals – something to be wary of in the current hyper-competitive environment. “Poor data quality is enemy number one to the widespread, profitable use of machine learning,” Thomas Redman, president of Data Quality Solutions, wrote in the Harvard Business Review. “The quality demands of machine learning are steep, and bad data can rear its ugly head twice — first in the historical data used to train the predictive model, and second, in the new data used by that model to make future decisions.” Anybody looking to earn their share of that $99 billion would do well to pay attention to the intricacies of data collection and annotation. There is still time to catch up and take part in the AI revolution – but latecomers can’t afford to learn about the impact of poorly labeled data the hard way. AI Business eBook Series | aibusiness.com Getting the basics right 4 The importance of data collection and annotation Businesses have to consider the question of data quality to avoid common problems with AI Products and services that incorporate legwork: literally collecting the humans from bots are a great example: machine learning can’t exist without information that will be used to train a when the system asks you to identify training data. From virtual assistants machine learning model. This could be buses in a picture, you’re not just like Siri and Alexa to medical-grade anything from thousands of pictures of proving your humanity – you’re also diagnostic algorithms, every AI-based rashes teaching AI to spot skin cancer, helping train an image recognition system requires its creators to spend to hundreds of handwriting samples algorithm. substantial time and effort on gathering for an Optical Character Recognition the right kind of data and labeling it algorithm. The important thing is to Machine learning algorithms created appropriately. have as many examples as possible, with limited data will work in principle, whether that’s text, audio, images, or but they simply won’t be reliable While it may not be the part of AI video. enough for production purposes – that most consumers see, it should especially when facing less common not be underestimated, both in Annotation, also known as labeling, edge cases. Likewise, if the data is terms of importance, and how long is also reliant on human expertise. labeled inaccurately, the algorithms will it takes to get right. As IBM’s Arvind Trained annotators need to manually make the wrong predictions. Krishna noted at the Wall Street label the media so that machine Journal’s Future of Everything Festival, learning algorithms can understand Let’s imagine a company that wants to data collection and annotation can the samples they are fed. This comes build an AI-based system to identify represent about 80% of the work on in many different forms: it could be breeds of dogs from photos. If your any given AI project. video annotation, text categorization, data collection process is limited, or semantic annotation, to name just then the system may have no trouble This isn’t something everyone is a few possibilities. Without this, no recognizing more common breeds prepared for, he added: “So you run machine learning algorithm will be able like Labradors or Terriers – but the out of patience along the way, because to compute the attributes relevant to same algorithm may end up mistaking you spend your first year just collecting its work. a Chihuahua for a rodent, or a Chow and cleansing the data, and you say: Chow for a bear, given neither has ‘Hey, wait a moment, where’s the AI? In the skin cancer example used earlier, much resemblance to the rest. If the I’m not getting the benefit.’ And then labeling would likely require specialist data isn’t labeled correctly, the same you kind of bail on it.” dermatological knowledge, but simpler errors might occur – but this time it labeling tasks can be handled by almost would be down to humans not knowing Why it matters anyone. The familiar reCAPTCHA the differences between an Alaskan Data collection involves doing the checks used online to distinguish Malamute and a Siberian Husky – or not labeling the key characteristic tells. This is clearly a frivolous example – not least because plenty of existing projects can already tell the difference between dog breeds, including those from Microsoft and Google. But the principles are ultimately the same, whatever you want to achieve with AI: data collection and annotation are the all-important foundations on which good machine learning algorithms are built. AI starts with data | telusinternational.com Finding the right approach 5 Six reasons to outsource data collection and annotation While businesses, and especially done, the sooner you can begin actually people from all walks of life. This startups, might be tempted to try training your models. can go a long way towards helping and handle their data collection eliminate unintended bias. and annotation work internally, this Data security and personal can often prove short-sighted. The information It needn’t be as expensive as you amount of resources required for the The thirst for data exhibited by AI- think task should not be underestimated, based systems worries both privacy The main reason for keeping and there are plenty of pitfalls facing advocates and regulators.