<<

AI starts eBook Series with data

Facing the challenges of data collection and annotation

In collaboration with: Introduction / Contents 2 The unsung hero

Information is the lifeblood of projects. Without sufficient quantities of high quality, accurately labeled data for model training, algorithms will struggle to make the right predictions, no matter how well they are built. Subsequently, any product that uses AI systems based on sub-par data will suffer too.

Defining the amount of data necessary for an AI project is like asking about the length of a piece of string – a sensible attitude is to assume that you can’t have too much training data. There are some open source datasets available, and when approaching a popular use case, you may be able to work with what’s already out there; but if you are looking to build something truly new and original, you will likely need to embark on a long data collection and annotation phase.

In this eBook, created in partnership with data specialists at Lionbridge AI (now acquired by TELUS Market perspective International), we look at the human networks Despite the impact of Covid-19, that power most artificial intelligence projects. We 3 the AI market continues to grow discuss the techniques employed in data collection and annotation, outline some of the potential pitfalls Getting the basics right that can befall a project at this early stage, and meet The importance of data the experts to discuss topics like privacy and bias – 4 collection and annotation including representatives of Bloomberg, Unit 9, and Omdia. Finding the right approach Six reasons to outsource data We hope this guide will provide useful insight into the 5 collection and annotation challenges of data preparation, helping ensure that your next artificial intelligence project is a success. How TELUS International - via Lionbridge AI - enables AI Alan Martin | Associate Editor | AI Business 6 to flourish Annotation in action Helping businesses get to grips 9 with their data Best practices Learning data annotation from 10 Bloomberg

Stories from the field Employing AI in interactive media eBook in collaboration with: 14 Eliminating bias and ensuring privacy Dealing with the issues of AI – before 16 data becomes the model

AI starts with data | telusinternational.com Market perspective 3 Despite the impact of Covid-19, the AI market continues to grow What does the future hold for developers of AI-based products and services

You might assume that the crippling It looks like businesses are only The top eight industry verticals set effect of Covid-19 on the world’s scratching the surface in terms of to benefit are, according to Omdia, economies, supply chains, and working what’s possible with AI, as new and consumer, automotive, financial patterns would slow the progress of innovative applications emerge. “AI is services, telecommunications, retail, artificial intelligence, but the latest proven to drive down costs, generate business services, healthcare, and report from analyst firm Omdia new revenue schemes and enhance advertising. “Each industry in this suggests that’s not the case. customer experience,” states the category of top industries has many report, which identifies 340 AI use viable use cases, a general willingness “Smaller companies and certain cases across 23 industry sectors – with among industry participants to use AI, industries may postpone AI pilot 201 considered “unique.” and large volumes of data generation programs and other investments during that AI algorithms can analyze to times of economic distress, but large Among the more popular use cases, optimize processes, lower costs, and companies with AI embedded into 27 are projected to exceed $1bn in boost revenue,” the report explained. their business models will continue revenue by 2025. Five will be passing to leverage the benefits of AI,” Neil the $3bn mark, including voice What unites these different types Dunay and Aditya Kaul said in the recognition, video surveillance, and of organizations is their growing company’s quarterly AI software marketing-focused intelligent virtual thirst for data. Without well-sourced market forecast. “Once the global assistants (IVAs). By the middle of the and accurately labeled data, any economy gets back on track, the AI decade, the AI-based software market algorithm will ultimately be hobbled, software market should experience a as a whole is expected to become a and perform poorly compared to strong recovery.” $99 billion industry. its rivals – something to be wary of in the current hyper-competitive environment.

“Poor data quality is enemy number one to the widespread, profitable use of machine learning,” Thomas Redman, president of Data Quality Solutions, wrote in the Harvard Business Review. “The quality demands of machine learning are steep, and bad data can rear its ugly head twice — first in the historical data used to train the predictive model, and second, in the new data used by that model to make future decisions.”

Anybody looking to earn their share of that $99 billion would do well to pay attention to the intricacies of data collection and annotation. There is still time to catch up and take part in the AI revolution – but latecomers can’t afford to learn about the impact of poorly labeled data the hard way.

AI Business eBook Series | aibusiness.com Getting the basics right 4 The importance of data collection and annotation Businesses have to consider the question of data quality to avoid common problems with AI Products and services that incorporate legwork: literally collecting the humans from bots are a great example: machine learning can’t exist without information that will be used to train a when the system asks you to identify training data. From virtual assistants machine learning model. This could be buses in a picture, you’re not just like Siri and Alexa to medical-grade anything from thousands of pictures of proving your humanity – you’re also diagnostic algorithms, every AI-based rashes teaching AI to spot skin cancer, helping train an image recognition system requires its creators to spend to hundreds of handwriting samples algorithm. substantial time and effort on gathering for an Optical Character Recognition the right kind of data and labeling it algorithm. The important thing is to Machine learning algorithms created appropriately. have as many examples as possible, with limited data will work in principle, whether that’s text, audio, images, or but they simply won’t be reliable While it may not be the part of AI video. enough for production purposes – that most consumers see, it should especially when facing less common not be underestimated, both in Annotation, also known as labeling, edge cases. Likewise, if the data is terms of importance, and how long is also reliant on human expertise. labeled inaccurately, the algorithms will it takes to get right. As IBM’s Arvind Trained annotators need to manually make the wrong predictions. Krishna noted at the Wall Street label the media so that machine Journal’s Future of Everything Festival, learning algorithms can understand Let’s imagine a company that wants to data collection and annotation can the samples they are fed. This comes build an AI-based system to identify represent about 80% of the work on in many different forms: it could be breeds of dogs from photos. If your any given AI project. video annotation, text categorization, data collection process is limited, or semantic annotation, to name just then the system may have no trouble This isn’t something everyone is a few possibilities. Without this, no recognizing more common breeds prepared for, he added: “So you run machine learning algorithm will be able like Labradors or Terriers – but the out of patience along the way, because to compute the attributes relevant to same algorithm may end up mistaking you spend your first year just collecting its work. a Chihuahua for a rodent, or a Chow and cleansing the data, and you say: Chow for a bear, given neither has ‘Hey, wait a moment, where’s the AI? In the skin cancer example used earlier, much resemblance to the rest. If the I’m not getting the benefit.’ And then labeling would likely require specialist data isn’t labeled correctly, the same you kind of bail on it.” dermatological knowledge, but simpler errors might occur – but this time it labeling tasks can be handled by almost would be down to humans not knowing Why it matters anyone. The familiar reCAPTCHA the differences between an Alaskan Data collection involves doing the checks used online to distinguish Malamute and a Siberian Husky – or not labeling the key characteristic tells.

This is clearly a frivolous example – not least because plenty of existing projects can already tell the difference between dog breeds, including those from Microsoft and . But the principles are ultimately the same, whatever you want to achieve with AI: data collection and annotation are the all-important foundations on which good machine learning algorithms are built.

AI starts with data | telusinternational.com Finding the right approach 5 Six reasons to outsource data collection and annotation

While businesses, and especially done, the sooner you can begin actually people from all walks of life. This startups, might be tempted to try training your models. can go a long way towards helping and handle their data collection eliminate unintended bias. and annotation work internally, this Data security and personal can often prove short-sighted. The information It needn’t be as expensive as you amount of resources required for the The thirst for data exhibited by AI- think task should not be underestimated, based systems worries both privacy The main reason for keeping and there are plenty of pitfalls facing advocates and regulators. While it data annotation in-house is the newcomers. is common practice to redact non- understandable fear of snowballing essential information when collecting expenses, but there are low-cost Here are six reasons why you might data for annotation, no company wants solutions out there for companies want to outsource your data collection to be left liable in case of a data breach. running on a tight budget. and annotation to a specialist. Dedicated data collection and If you have a dataset that hasn’t been Quality, consistency and accuracy annotation companies know this, and labeled before, and doesn’t need The most obvious reason to will have plenty of security measures expert knowledge for the labels to outsource data preparation tasks is in place, as well as confidentiality be applied (e.g. ‘does this picture straightforward: specialized companies agreements with all staff tasked with contain a cat’ rather than ‘does this have done this before, and they know handling personal information. It is one echocardiogram show restrictive exactly how to collect and label data. less thing to worry about. cardiomyopathy or constrictive pericarditis’) then cloud-based services While these skills can be acquired Casting your net wider like Amazon Mechanical Turk can in-house, the lack of experience will Outsourcing data labeling to a third source low-cost workers to get it usually mean mistakes along the way. party means that the groundwork done. The very definition of ‘many You shouldn’t underestimate how is undertaken by a diverse range of hands make light work’. difficult it is to be consistent over a long period of time, even with a small dataset – and consistency is hugely important when it comes to the effectiveness of your AI-based systems.

The scale of the project Even the most basic machine learning systems require vast amounts of data, and you may quickly discover that the sheer volume is too large for your team to deal with. Every annotation project, no matter how small, will be time-consuming, and you probably don’t want to pull your staff away from their actual jobs for weeks and months on end.

Speed Businesses that exist to collect and annotate data for AI projects have dedicated staff to turn around assignments quickly. The faster this is

AI Business eBook Series | aibusiness.com Powered by people 6

How TELUS International - via Lionbridge AI - enables AI to flourish You can imagine What happens during a typical data collection and that your job is never annotation project? done when it comes Lionbridge AI specializes in data plenty of datasets readily available for preparation: both collecting and common AI models, niche projects to certain types of annotating large datasets, so they can require localized sourcing knowledge. be used by AI-based systems. The Kostopoulos provided an example: workflows, because company emerged out of machine “Imagine that a customer comes and it’s like trying to hit a learning operations run by translation says, ‘I need to collect this data, and giant Lionbridge, and was later acquired I want it to be done in a home-like moving target,” Tucker by TELUS International. environment, and I will send you prototypes.’ They want 700 Italian explained. “You’re “Data collection and data annotations participants, and this needs to be a never going to move are two different parts of the machine 50/50 male-female split. They also want learning project cycle,” Aristotelis five accents – northern, southern, as fast as information Kostopoulos, vice president of AI Sardinian, whatever – and they need to Product Solutions at Lionbridge AI, be divided into age groups.” moves. And with every told AI Business. “Sometimes, it comes step you take, you first: you collect the data and then In this scenario, Lionbridge AI would you get to annotate. Other times, the have to identify the right cities to visit actually add to the customer already has the data, and you – perhaps Milan for northern accents, get to annotate it.” and Rome for southern ones – and amount of data that’s then take care of all the logistics. out there. Finding needles in the haystack The difficulty of obtaining high quality “You have to rent a place that meets data is one of the main reasons the requirements, and then you need why businesses turn to experts to hire a local team that will be doing like Lionbridge AI. While there are the moderation. Then you get the

AI starts with data | telusinternational.com Powered by people 7

devices, arrange the setup, and start scheduling this type of collection. You source people and then you get them to attend at a specific time, do the recording for one hour, pay them, and it goes on,” Kostopoulos explained.

While doing this, you have to be hyper- aware of geographical differences, too – which is where local knowledge is essential. “You could do a speech data collection project across the United States, and in different locations people would order a pizza in a completely different way,” Suzanne Tucker, enterprise sales director at Lionbridge AI, said. “You have to collect all of those in order to represent the whole of the United States, right?”

Logistics challenges are multiplied when clients want a global solution that works across borders. Ricardo Rodrigues, global program director at Lionbridge AI, remembers one particularly large-scale challenge: To make things more interesting, no matter how collecting handwriting samples for an input recognition project. This detailed the handbooks, there will always be an required participants to use tablets to element of ambiguity in data annotation. draw everything from math equations and cursive to diagrams and simple striking gestures. “We’ll use our own proprietary “Our senior project managers, our tools like the language AI annotation operations teams, our program “We collected and annotated as platform, or our customers’ tools,” directors – they help our clients, not many as 100,000 of those for English Kostopoulos said. “We manage the only by delivering the workflows that and 600,000 for Chinese,” Rodrigues workflow end-to-end in either case.” they’re asking for, but by helping them recalled. “It’s a lot of data to collect evolve guidelines and best practices to and annotate, but it’s understandable Larger projects can result in ongoing make sure that we’re not putting the because you need very precise work. “You can imagine that your annotators into a position where they information: handwriting technology job is never done when it comes to are going to create bias just because needs to recognize all the variations in certain types of workflows, because we don’t allow for free thought. writing styles.” it’s like trying to hit a moving target,” Tucker explained. “You’re never going “This is especially important when Annotation stations to move as fast as information moves. dealing with subjective material. In that Once the necessary data is collected, it And with every step you take, you scenario, you don’t necessarily want a is time to start the annotation process. actually add to the amount of data right or wrong answer. It has to be in In simple terms, this means describing that’s out there.” line with the guidelines, but you want key information contained in a sample, to allow the human interpretation of so that machine learning algorithms To make things more interesting, no content as well.” can understand the similarities and matter how detailed the handbooks, differences between bits of data, and there will always be an element of All of these are pretty compelling make the resulting artificial intelligence ambiguity in data annotation. “It’s reasons for companies to outsource model reliable and robust – ideally, important that we represent all the data preparation to specialists like without the need for further human individual ideas behind the data,” Lionbridge AI, but some are tempted intervention. Tucker said. to keep things in-house – often

AI Business eBook Series | aibusiness.com Powered by people 8

because they believe that they but it’s also good for the people. always be on the hunt for participants would be saving money, or due We call them the Lionbridge AI across social media, from Facebook to concerns about the security of community.” and Instagram to Pinterest and Reddit. sensitive projects. The community – which includes The community members are, without Larger companies, Kostopoulos native speakers of more than 500 doubt, the unsung heroes of the explained, generally see this as a languages and dialects – has a high artificial intelligence revolution. “The false economy. “They are smart net promoter score, meaning one element that is not always talked enough,” he said. “They that participants often about, but is crucial, is the people,” have done this in the refer each other to Rodrigues said. “And when you talk past and they know become potential data about the people, you talk about the how it works, so workers. “That’s community. You talk about the people they are mainly because we do our that are there every day annotating, outsourcing this best to engage and reading instructions, putting those type of work.” be transparent as instructions into action, and talking 500+ much as we can – with our guys about their challenges.” Smaller companies languages and dialects and we pay on time often learn the and have decent “I have an awesome team, so that challenges of rates,” Stirbu said. makes the long days a little bit easier data collection and For most, this is extra to manage, because you know that you annotation firsthand. work, rather than their have people to rely on.” “They think that they can main source of income. do it, but then they collect some Given that AI systems are often seen data, they annotate using their own More than half of Lionbridge AI’s as a cold and impersonal, there’s employees, they train, they don’t get contributors join via referral. With AI something innately comforting about the results and they realize that they and data preparation being an ever- the very human work that brings them need at least 20 times more data than growing industry, the company will to life. they have.”

For organizations that are set on DIY, Lionbridge AI provides its software as a service. “If it is a small company that really wants to do a proof of concept, or they just want to try something new, they can license our tools,” Kostopoulos said. “They do the proof of concept, they see that it works, and then when they want to scale, they can come back to us and update to a managed service where we take care of the project.”

The community Data collection and annotation requires a huge amount of manual work, and Lionbridge AI has been in the game long enough to have more than a million potential participants in its database. “We have talent pools that we have recruited, and we try to re-utilize,” Elenn Stirbu, Director of Global Community Sourcing at Lionbridge AI, explained. “It is, of course, good for us because we don’t need to constantly bring on new staff,

AI starts with data | telusinternational.com Annotation in action 9

Helping businesses get to grips with their data

Good quality training data is the Institute needed to update its database From an initial pool of one million foundation of successful AI, but of non-native English speakers to qualified contributors, Lionbridge AI collection and annotation is a huge improve recognition across a wide handpicked 50 native speakers of both undertaking. Knowing this, businesses variety of accents. languages to annotate Traveloka’s are increasingly looking to delegate the data using a proprietary custom text work to specialist firms that maintain Lionbridge AI was on hand to classification workbench. their own global networks of labeling help. Thanks to its global team specialists. of professional contributors, the Over a cumulative 6,000 hours, the team company was able to record, collect was able to parse more than 200,000 text Lionbridge AI has built up considerable and transcribe a dataset with 300,000 strings to improve search on the site. expertise in both fields, running samples of non-native English speakers, Traveloka’s search engine now lets users large-scale crowdsourcing projects which was used to retrain and improve search 76 different product combinations for customers in industries including VoiceTra. The project worked so well with a single click, for a smoother user automotive, retail, finance, defense, that the two companies are looking for experience. healthcare and IT. The company’s data more ways to collaborate. powers everything from online ads to “Lionbridge AI initially stood out because drones to medical imaging systems. Label makers of their capabilities in the regional Below, we look at two of its recent Of course, many companies already language space, but we were also projects. have a huge amount of data on impressed by their flexible approach to hand – they just don’t have the time, data annotation,” Dr. Deb Goswami, Data gatherers resources, or expertise to label it in- Traveloka’s data science lead, explained. One party to take advantage of house. Traveloka, an Indonesian travel Lionbridge AI’s global network was booking startup with ‘unicorn’ status, “From our previous experience, we the National Institute of Information approached Lionbridge AI to assist with know how difficult it is to build out and Communications Technology – or annotation. The aim of the project effective data annotation teams from NICT. The organization offers a free was to improve the search function on scratch. We were extremely pleased with translation service called VoiceTra its website, and help visitors find the Lionbridge AI’s ability to scale without which helps non-natives get around right product quicker, reducing the compromising on quality or speed, Japan. Visitors simply speak a sentence chances of them bouncing in search of particularly when it came to acting on into their phone, and it is translated alternative results. our feedback.” into Japanese to immediately break down communication barriers. The data annotation partner would have In other words, no matter how niche to map thousands of search queries to a your product, it’s probably worth seeing But the organization had a problem: large array of product and sub-product if external partners could help. That extra around 80% of the world’s English categories – and crucially, it would need experience and network of contributors speakers are non-native, and the app to have a working knowledge of both could make all the difference between a struggled to understand them. The English and Bahasa Indonesian. successful launch and an AI flop.

AI Business eBook Series | aibusiness.com Best practices 10 Learning data annotation from Bloomberg Machine learning We look at some of the complexities of data engineers and data annotation projects with Bloomberg’s Tina Tseng and scientists typically Amanda Stent. are taught about One of the most important functions types of products – financial, legal, of AI-based systems is dealing with government products – we thought it every aspect of the the volumes of data that humans was a great opportunity for us.” machine learning physically can’t manage. That describes Bloomberg’s experiences to a tee: Garbage in, garbage out – squared workflow except how the company handles hundreds of Annotation is such a vital part of the billions of data points from global AI puzzle that the world should be to design and manage capital markets every day, from stock full of best practice guides. And yet, annotation projects. information and earnings reports to it patently isn’t. “Machine learning foreign exchange rates and commodity engineers and data scientists typically prices. are taught about every aspect of the machine learning workflow except Careful and accurate data annotation how to design and manage annotation enables the company to manage this projects,” the Bloomberg guide mountain of information, and it has laments. recently published an open access Best Practices Guide for those looking to “I think that’s partly because it’s hard up their data labeling game. – it’s hard and involves human beings,” Amanda Stent, NLP architect from “Annotation is often learned through Bloomberg’s office of the CTO, said. practical experience rather than formal “It’s much easier to think about an training,” Tina Tseng, legal analyst with algorithm or some evaluation metric.” Bloomberg Law who co-wrote the guide, told AI Business. “Having a guide Tseng agreed: “Annotation involves that actually collects all the expertise soft skills that data scientists and and experience that we’ve acquired machine learning engineers don’t through projects in many different necessarily focus on. Effective contexts and supporting many different communication and collaboration is often needed because it’s such a team effort.”

The subject deserves attention, since the consequences of poor data annotation can be significant. “With traditional computer science, we would say ‘garbage in garbage out,’” Stent explained. “With machine learning- driven data science, it’s ‘garbage in, garbage out – squared.’

“Not only can your annotations be messy, but the way you’ve defined your problem may be incorrect. You

AI starts with data | telusinternational.com Best practices 11

may have biases in the way that your Not only can your annotations be messy, but data was sampled, which means that your annotations are incomplete. And the way you’ve defined your problem may be I don’t just mean biases in the sense incorrect. You may have biases in the way that that you’re only sampling data from men, or only sampling data from white your data was sampled, which means that your people. You may be only sampling data from the last year when the important annotations are incomplete. phenomena happened two years ago.” annotate,” Tseng noted. “You want in December 2019, nobody was talking Which isn’t to say that other more your samples to be representative of about COVID-19, and in March it was obvious biases don’t exist: they do, the real world dataset because you everything that you heard about. And and can create a real problem. “If we want models to be able to perform if your document classifier didn’t have sample all our data from people who well on the entirety of the dataset that that data in there, then there’s no way are wealthy enough to be investors, you’re going to be looking at.” it would ever help with that label.” then you will miss people who are not wealthy enough to be investors,” The guide has lots of advice for Managing expectations Stent said. “I think what we are seeing avoiding common pitfalls, from Another key consideration is managing with the results of polling over the defining goals clearly and planning a the expectations of stakeholders, last two elections is you’re missing a roadmap, to engaging in plenty of open who will often expect results a lot large demographic somehow. It’s not communication along the way. One faster than they can actually emerge. that the math is wrong. It’s that you’re problem project managers should keep Estimates tend to be off, according somehow missing something.” an eye out for is data drift. to Stent, not just because it’s difficult to predict how much time it takes to The guide highlights the temptation “This can happen with any dataset, pre-process and annotate data, but also to throw out data points that are but especially with finance datasets because people tend to underestimate difficult to annotate, or appear like where the distribution of your data how much data needs to be annotated. outliers, which can be a huge mistake in month A is quite different from the “So when you go from zero to the first when it comes to accuracy. “Those distribution of your data in month B,” 10,000 data points, the performance of are probably the most important to Stent said. “Just to give one example, in your model goes from nothing to

AI Business eBook Series | aibusiness.com 12

80-something percent,” she simply isn’t a perfect route here, and it on your bottom line and the success of explained. “And you’re thinking, if I should vary from project to project. a project in the long run. They’re very continue on this trajectory, I’ll be done valuable.” in two weeks. “The crowd are incredibly smart and the wisdom of the crowd is a real One problem that can rear its head, “But with machine learning, it takes thing,” Stent said. “But sometimes regardless of the chosen annotation exponentially longer to get from you could get a better result method, is worker fatigue. 80 to 85 than it does to get from much more quickly and Annotation can be a zero to 80, and then exponentially at a lower cost, using monotonous, repetitive longer to get from 85 to 88, et a smaller number task. It is only human cetera. So it is really important that of experts with for boredom to experienced project managers are distance supervision Data Drift kick in, and that able to communicate those sorts of being the limited one of the reasons can have a knock- timelines and assessments to product bet. And that’s model accuracy on effect on and engineering who may not have especially true if degrades over time consistency. How machine learning experience in their your data needs to is the problem background.” be kept private.” approached at Bloomberg? There’s also a real risk that attempts to Tseng agreed: “Each speed up annotation will have a knock- project is different and each For Stent, it’s all about on effect on quality – a problem that’s needs to be resourced based on the matching the task to the worker’s especially prevalent in crowdsourced data required, the complexity of the experience, giving them a chance to solutions, where people are paid per task and the privacy of the data. both use and grow their expertise. annotation. But the guide is pretty Variety and an overall vision of where clear that the three common ways for “That’s why having an experienced things are going helps too: “You want annotation to work – crowdsourced, project manager or project managers your workers to understand what via dedicated vendors, or through full- at your organization can make all the the outcome is going to be. Are they time employees – each have their own difference,” Stent added. “Somebody doing things that go directly in front advantages and disadvantages. who can come in and say, ‘Yes, I’ve of clients? Are they doing things that managed this kind of project before will be used to train a model to go The crowd, for example, is great – here are the kinds of workers we directly in front of clients? If they can for getting a lot of annotations done can use and here’s the timeline that understand and become invested in quickly, but isn’t very helpful if you has worked in the past, and here’s the the project, they will be less likely to need specialist knowledge. Full-time budget that worked in the past.’ One become bored.” employees, meanwhile, are expensive or two of people like Tina and some of in any serious volume, but guarantee my other colleagues who participated And of course, some subjects for secret projects stay that way. There in this workshop, make such an impact machine learning projects are just inherently entertaining. “I will say the most fun annotation project I’ve ever managed was one where we were asking people to pick the funniest entry for The New Yorker caption contest,” Stent recalled. “They had one staff member who had to trawl through 5,000 submissions to the caption contest each week to pick the three that you would vote for. The idea was: could we use machine learning to help cull the 5,000 down to something manageable, so that the staff member would only have to choose from around 500?

“I had them writing to me afterwards asking me to post more.”

AI starts with data | telusinternational.com Chapter Heading 13

AI Business eBook Series | aibusiness.com Stories from the field 14

Employing AI in interactive media We discuss the importance of good training data and look at exotic machine learning projects with Unit 9’s Maciej Zasada. When it comes to artificial intelligence, BBC’s Tomorrow’s World, involved and time-consuming,” he said. “The trying to define clear timelines can the creation of a chatbot that could amount of time we need to tweak and be tricky. “I think people often don’t predict the number of ‘likes’ a photo play with the final output has to be understand this about commercial would get on social media. “The idea significantly longer than in non-machine projects,” Maciej Zasada, technical was to relieve the anxiety and tension learning projects.” director of production studio Unit 9, which teenagers often have before told AI Business. “It’s actually quite posting something online,” Zasada For Zasada, this is all relatively new. unfair to expect the technical team explained. Armed with a Master’s degree in to be able to say ‘okay, that’s going Informatics, he saw the commercial to take two months and we’ll need X The project required a lot of training interest in AI and decided to take an amount of data.’” data to be useful – and lots of online course in machine learning, fine-tuning. “When using machine certified by . The Unit 9 has taken on a number of learning, the output, to an extent, is two-year training program has helped AI-based projects in recent years, for unpredictable, and any debugging – and him understand the process better, clients ranging from Samsung to Royal trying to force the algorithm to be but also know not to be daunted by Caribbean, and each presented its own more in line with what we anticipate the scope of challenges in AI. One of challenges. One commission, for the the output to be – is very challenging the course’s lecturers was , founder of the Google Brain project and former chief scientist at Baidu. “He still says if he’s got a new [AI] challenge, it’s not like he immediately knows what to do,” Zasada recalled.

Terabytes of unlabeled data This attitude was certainly helpful in 2018, when Unit 9 embarked on an ambitious AI-based project for Huawei, called ‘Sound of Light’. The company wanted to train an AI system to recognize the Northern Lights, so that

AI starts with data | telusinternational.com Stories from the field 15

a human composer could perform a symphony to match the different kinds of auroras.

That may sound simple enough – but to genuinely have an AI system directing musical output in response to the Northern Lights, you need a dataset of auroras… and one simply didn’t exist. “I don’t think anyone has actually worked on anything machine learning-related with the auroras,” Zasada said. “How can you get hold of 10,000 images of an aurora?”

The answer, as is often the case, was smartphone, the Mate 20 Pro, with a Managing bias, rather than eradicating a human specialist: a ‘Northern Light live conductor leading an orchestra at it completely, might be the best we chaser’ who had assembled plenty of an event in Vienna. can hope for at the moment, but the unlabeled data. “You’ve got people AI market is developing fast in other who just chase auroras, and they have The pitfalls of in-housing respects. “The popularization of cloud terabytes of film footage of multiple While many organizations will find computing and cloud solutions – this different Northern Light phenomena publicly available datasets sufficient facilitated and enabled us to work across years and locations, and we for their needs, there are potential with much bigger datasets, with very essentially just partnered with one of downsides to handling the labeling complex neural networks and machine those people and got hold of these process internally. “The time it learning algorithms that would be terabytes of data,” Zasada explained. takes to gather and pre-process untrainable on a single developer’s data is something we sometimes machine,” Zasada said. From there, the team extracted underestimated,” Zasada said. key video frames, leaving them with On top of that, the growing interest approximately 100,000 different aurora Even if time is on your side, in the field means that more extensive images – difficult even for a team the outsourcing may seem like an appealing datasets are now freely available: “As size of Unit 9’s to classify. option: “If you have the same person machine learning is becoming more doing the assignments, they become and more popular there are many Instead, the company turned to tired, bored and distracted, you just more open source machine learning Amazon Mechanical Turk – a can’t expect them to label data for solutions that we can take inspiration crowdsourcing platform where eight hours in a row.” from, sometimes use directly, members of the public are paid small sometimes just essentially build upon, sums to perform simple tasks: in this Splitting labeling tasks between multiple which is great.” case, pairing aurora images that looked people counters this to a degree, but alike. The company ended up with ten adds inconsistency, while still leaving Not every development is labeled baskets of aurorae types. the process open to personal biases. unambiguously good, of course. One To Zasada, that’s another advantage change which Zasada calls both “scary Mechanical Turk isn’t for everyone, of outsourcing: while nobody has a and promising” is the ability of AI but it was a great fit for the task, complete answer to the issues of bias, systems to mimic human artifacts. since it didn’t require any specialized appealing to the wisdom of crowds can ‘’ are becoming increasingly knowledge of aurorae. While it allows help. sophisticated, and natural language researchers to specify age group, processing has reached the stage ethnicity, and country of participants, “If you target a very diverse group where GPT-3 models can write there’s no way to select professional of people to respond and to provide articles that, at a glance, can pass for musicians, for example – for that, you data, you’ll have many biases, but the product of human imagination. would need a data annotation firm. they should more or less neutralize in “This is very exciting, but poses a lot the end,” he said. “You should avoid of privacy considerations as well; in Huawei was pleased enough with getting your dataset from a narrow the age of disinformation, when people the resulting algorithm to use it in group of people because the bias might don’t know what to believe, this is also the promotion of its 2018 flagship be consistent and very strong.” very risky,” he said.

AI Business eBook Series | aibusiness.com Eliminating bias and ensuring privacy 16 Dealing with the issues of AI – before data becomes the model Obviously there are We look at the steps you can take to eliminate many laws, especially common issues that plague commercial AI projects. in Europe, in place in In July 2019, Apple became the subject Lionbridge AI have secure labs purpose- of a news story it would rather have built for the collection and annotation order to protect PII, avoided. A whistleblower told The process, with strict rules in place for Guardian that contractors tasked contractors managing the data. and compliance is an with refining the Siri digital assistant essential part of the were routinely exposed to private “They need to leave their mobile information, and would hear discussions phones outside, they can’t take an process. But one of the related to medical information, sexual MP3 player or anything,” Aristotelis encounters, and even drug deals. Kostopoulos, vice president of AI things I love about my Product Solutions at Lionbridge AI, work is that we can Though there was never any indication explained. “You cannot access your of personally identifiable information personal email or take screenshots create scenarios that leaking, the story caused quite a from the computer and send them.” stir; customers were clearly not don’t violate any privacy comfortable with such monitoring, and That’s certainly helpful, but according laws, but still represent what this meant for privacy. So how to Ricardo Rodrigues, one of the can businesses solve the privacy issues company’s global program directors, human behavior.” related to data collection? there needs to be more transparency from big tech about the way customer “Privacy is of utmost importance and data is handled, in order to make we work really hard,” Suzanne Tucker, people more accepting of the essential enterprise sales director at Lionbridge work that still needs the human touch. AI, told AI Business. “The PII [Personal For him, there’s a ‘give and take’ Identifiable Information] data factor, is involved with large companies: “If I am something that we discuss at the very more transparent with my users, then beginning to set expectations: where the users will be more willing to share can this workforce live? Where can we their data with me.” do the project? Does that have to be in a secure facility? How do we remove Battling bias any PII, if necessary, in the process? There’s another issue that AI projects There are so many different nuances need to be wary of, and that’s bias. to it. Prejudiced datasets can lead to serious issues, from criminal justice “Obviously there are many laws, algorithms taking a dislike to ethnic especially in Europe, in place in order minority offenders, to recruiting to protect PII, and compliance is an tools that show a bias against hiring essential part of the process. But one women. Dr Ramsey Faragher, a of the things I love about my work teacher in AI and machine learning at is that we can create scenarios that Queen’s College Cambridge, once said: don’t violate any privacy laws, but still “It’s inevitable that if you simply train represent human behavior.” AIs on data that we’re not proud of, you’re going to end up with AIs that are Where sensitive information has racist, sexist, or worse.” Fortunately, to be handled by people, security companies working in AI today are alive is paramount, and companies like to the dangers of bias in data,

AI starts with data | telusinternational.com Chapter Heading 17

Kostopoulos said, especially for collection projects.

So how can the companies just starting out with AI avoid falling at the first hurdle? For Elenn Stirbu, director of Global Community Sourcing at Lionbridge AI, it’s a case of working around the problem rather than eliminating it entirely. “All data is biased,” she noted. “So the challenge is to have enough data to represent all the biases that you can have.

“That’s why one important thing in data collection projects is the distribution of the demographics. You want to make sure that all age ranges are represented, and genders, and we can go further than that and talk about ethnicities and skin tone.” Even things like height, weight and eye color can be considered – after all, you never know what kinds of patterns AI might spot.

Once data is collected, it is imperative to ensure that bias isn’t introduced during the annotation stage. This can be done through clear guidelines that focus on uniformity and consistency of data labeling. “If you’re working with 100 people, you’re working with 100 different understandings, different life experiences,” Rodrigues said. At the same time, any guidelines have to be flexible enough to cope with the inevitable ambiguity in data. “That’s why human annotators are so special, and provide so much value,” Tucker added.

It’s a tricky balancing act, but one that Lionbridge AI has to deal with every day. The rewards speak for themselves: “One of the greatest things about the job is it’s always evolving – it’s fascinating work,” Tucker said. “You go home, get on your smartphone, and you’re reading something on LinkedIn – and sometimes I stop and think, ‘you know what? That’s the program we worked on three years ago, and now it’s in my feed.’ So it comes full circle.”

AI Business eBook Series | aibusiness.com eBook Series

TELUS International designs, builds and delivers next-generation digital solutions to enhance the customer experience (CX) for global and disruptive brands. The company’s integrated solutions span digital strategy and consulting, IT lifecycle, data annotation and intelligent automation, and omnichannel CX including content moderation and other trust and safety solutions.

For more information, visit telusinternational.com

For more info