The Curious Case of Data Annotation and AI – RTInsights

Data annotation takes time. And for in-house teams, labeling data can be the proverbial bottleneck, limiting a companys ability to quickly train and validate machine learning models.

By its very definition, artificialintelligence refers to computer systems that can learn, reason, and act forthemselves, but where does this intelligence come from? For decades, thecollaborative intelligence of humans and machines has produced some of theworlds leading technologies. And while theres nothing glamorous about thedata being used to train todays AI applications, the role of data annotationin AI is nonetheless fascinating.

See also: New Tool Offers Help with Data Annotation

Poorly Labeled Data Leads to Compromised AI

Imagine reviewing hours of video footage sortingthrough thousands of driving scenes, to label all of the vehicles that comeinto frame, and youve got data annotation. Data annotation is the process oflabeling images, video, audio, and other data sources, so the data isrecognizable to computer systems programmed for supervised-learning. This isthe intelligence behind AI algorithms.

For companies using AI to solve worldproblems, improve operations, increase efficiencies, or otherwise gain acompetitive edge, training an algorithm is more than just collecting annotateddata, its sourcing superior quality training data and ensuring that data iscontributing to model validation, so applications can be brought to marketquickly, safely, and ethically.

Data is the most crucial element of machine learning.Without data annotation, computers couldnt be trained to see, speak, orperform intelligent functions, yet obtaining datasets, and labeling trainingdata are among the top limitations to adopt AI, according to the McKinsey Global Institute. Another knownlimitation is data bias, which can creep in at any stage of the training datalifecycle, but more often than not occurs from poor quality or inconsistentdata labeling.

The IDC shared that 50 percent of IT and dataprofessionals surveyed report data quality as a challenge in deploying AIworkloads, but where does quality data come from?

Open-source datasets are one way to collectdata for an ML model, but since many are curated for a specific use case, itmay not be useful for highly specialized needs. Also, the amount of data neededto train your algorithm may vary based on the complexity of the problem youretrying to solve, and the complexity of your model.

The Waymo Open Dataset is the largest, mostdiverse autonomous driving dataset to date, consisting of thousands of imageslabeled with millions of bounding boxes and object classes12 million 3Dbounding box labels and 1.2 million 2D bounding box labels, to be exact. Still,Waymo has plans to continuously grow the size of this dataset even further.

Why? Because current, accurate, and refresheddata is necessary to continuously train, validate, and maintain agile machinelearning models. There are always edge cases, and for some use cases, even moredata is needed. If the data is lacking in any way, those gaps compromise theintelligence of the algorithm in the form of bias, false positives, poorperformance, and other issues.

Lets say youre searching for a new laptop.When you type your specifications into the search bar, the results that come upare the work of millions of labeled and indexed data points, from product SKUsto product photos.

If your search returns results for a lunchbox,a briefcase, or anything else mistaken for the signature clamshell of a laptop,youve got a problem. You cant find it, so you cant buy it, and that companyjust lost a sale.

This is why quality annotated data is soimportant. Poor quality data has a direct correlation to biased and inaccuratemodels, and in some cases, improving data quality is as simple as making sureyou have the right data in the first place.

Vulcan Inc., experienced the challenge of diversity in their datasetfirst-hand while working to develop AI-enabled products that could record andmonitor African wildlife. While trying to detect cows in imagery, they realizedtheir model could not recognize cows in Africa, based on their dataset of cowsfrom Washington, alone. To get their ML model operating at peak performance,they needed to create a training dataset of their own.

Labeling Data, Demanding for AI Teams

As you might expect, data annotation takestime. And for in-house teams, labeling data can be the proverbial bottleneck,limiting your ability to quickly train and validate machine learning models.

Labeling datasets is arguably one of the hardest parts of building AI. Cognilytica reports that 80 percent of AIproject time is spent aggregating, cleaning, labeling, and augmenting data tobe used in machine learning models. Thats before any model development or AItraining even begins.

And while labeling data is not an engineeringchallenge, nor is it a data science problem, data annotation can provedemanding for several reasons.

The first is the sheer amount of time it takesto prepare large volumes of raw data for labeling. Its no secret, human effortis required to create datasets, and sorting irrelevant data from the desireddata is a task in and of itself.

Then, theres the challenge of getting theclean data labeled efficiently and accurately. A short video could take severalhours to annotate, depending on the object classes represented and theirdensity for the model to learn effectively.

An in-house team may not have enough dedicatedpersonnel to process the data in a timely manner, leaving model development ata standstill until this task is complete. In some cases, the added pressure ofkeeping the AI pipeline moving can lead to incomplete or partially labeleddata, or worse, blatant errors in the annotations.

Even in instances where existing personnel canserve as the in-house data annotation team, and they have the training andexpertise to do it well, few companies have the technology infrastructure tosupport an AI pipeline from ingestion to algorithm, securely and smoothly.

This is why organizations lacking the time fordata annotation, annotation expertise, clear strategies for AI adoption, ortechnology infrastructure to support the training data lifecycle partner withtrusted providers to build smarter AI.

To improve its retail item coverage from 91 to 98percent, Walmart worked with a specializeddata annotation partner to evaluate their data and ensure its accuracyto train Walmart systems. With more than 2.5 million items cataloged during thepartnership, the Walmart team has been able to focus on model development,rather than aggregating data.

How Data Annotation Providers Combine Humans and Tech

Data annotation providers have access to toolsand techniques that can help expedite the annotation process and improve theaccuracy of data labeling.

For starters, working day in and day out withtraining data means these companies see a range of scenarios where dataannotation is seamless and where things could be improved. They can then passthese learnings on to their clients, helping to create effective training datastrategies for AI development.

For organizations unsure of how tooperationalize AI in their business, an annotation provider can serve as atrusted advisor to your machine learning teamasking the right questions, atthe right time, under the right circumstances.

A recent report shared that organizations spend 5x moreon internal data labeling, for every dollar spent on third-party services. Thismay be due, in part, to the expense of assigning data scientists and ML engineerslabeling tasks. Still, theres also something to be said about the establishedplatforms, workflows, and trained workforce that allow annotation serviceproviders to work more efficiently.

Working with a trusted partner often meansthat the annotators assigned to your project receive training to understand thecontext of the data being labeled. It also means you have a dedicatedtechnology platform for data labeling. Over time, your dedicated team oflabelers can begin to specialize in your specific use-case, and this expertiseresults in lower costs and better scalability of your AI programs.

Technology platforms that incorporateautomation and reporting, such as automated QA, can also help improve labelingefficiency by helping to prevent logical fallacies, expedite training for datalabelers, and ensure a consistent measure of annotation quality. This alsohelps reduce the amount of manual QA time required by clients, as well as theannotation provider.

Few-click annotation is another example, whichuses machine learning to increase accuracy and reduce labeling time. Withfew-click annotation, the time it would take a human to annotate several pointscan be reduced down from two minutes to a few seconds. This combination ofmachine learning and the support of a human, who does a few clicks, produces alevel of labeling precision previously not possible with human effort alone.

The human in the loop is not going away in theAI supply chain. However,more data annotation providers are also using pre and post-processingtechnologies to support humans training AI. In pre-processing, machine learningis used to convert raw data into clean datasets, using a script. This does notreplace or reduce data labeling, but it can help improve the quality of theannotations and the labeling process.

There are no shortcuts to train AI, but a dataannotation provider can help expedite the labeling process, by leveragingin-house technology platforms, and acting as an extension of your team, toclose the loop between data scientists and data labelers.

See the rest here:

The Curious Case of Data Annotation and AI - RTInsights

Related Posts

Comments are closed.