Data annotation takes time. And for in-house teams, labeling data can be the proverbial bottleneck, limiting a companys ability to quickly train and validate machine learning models.
By its very definition, artificialintelligence refers to computer systems that can learn, reason, and act forthemselves, but where does this intelligence come from? For decades, thecollaborative intelligence of humans and machines has produced some of theworlds leading technologies. And while theres nothing glamorous about thedata being used to train todays AI applications, the role of data annotationin AI is nonetheless fascinating.
See also: New Tool Offers Help with Data Annotation
Poorly Labeled Data Leads to Compromised AI
Imagine reviewing hours of video footage sortingthrough thousands of driving scenes, to label all of the vehicles that comeinto frame, and youve got data annotation. Data annotation is the process oflabeling images, video, audio, and other data sources, so the data isrecognizable to computer systems programmed for supervised-learning. This isthe intelligence behind AI algorithms.
For companies using AI to solve worldproblems, improve operations, increase efficiencies, or otherwise gain acompetitive edge, training an algorithm is more than just collecting annotateddata, its sourcing superior quality training data and ensuring that data iscontributing to model validation, so applications can be brought to marketquickly, safely, and ethically.
Data is the most crucial element of machine learning.Without data annotation, computers couldnt be trained to see, speak, orperform intelligent functions, yet obtaining datasets, and labeling trainingdata are among the top limitations to adopt AI, according to the McKinsey Global Institute. Another knownlimitation is data bias, which can creep in at any stage of the training datalifecycle, but more often than not occurs from poor quality or inconsistentdata labeling.
The IDC shared that 50 percent of IT and dataprofessionals surveyed report data quality as a challenge in deploying AIworkloads, but where does quality data come from?
Open-source datasets are one way to collectdata for an ML model, but since many are curated for a specific use case, itmay not be useful for highly specialized needs. Also, the amount of data neededto train your algorithm may vary based on the complexity of the problem youretrying to solve, and the complexity of your model.
The Waymo Open Dataset is the largest, mostdiverse autonomous driving dataset to date, consisting of thousands of imageslabeled with millions of bounding boxes and object classes12 million 3Dbounding box labels and 1.2 million 2D bounding box labels, to be exact. Still,Waymo has plans to continuously grow the size of this dataset even further.
Why? Because current, accurate, and refresheddata is necessary to continuously train, validate, and maintain agile machinelearning models. There are always edge cases, and for some use cases, even moredata is needed. If the data is lacking in any way, those gaps compromise theintelligence of the algorithm in the form of bias, false positives, poorperformance, and other issues.
Lets say youre searching for a new laptop.When you type your specifications into the search bar, the results that come upare the work of millions of labeled and indexed data points, from product SKUsto product photos.
If your search returns results for a lunchbox,a briefcase, or anything else mistaken for the signature clamshell of a laptop,youve got a problem. You cant find it, so you cant buy it, and that companyjust lost a sale.
This is why quality annotated data is soimportant. Poor quality data has a direct correlation to biased and inaccuratemodels, and in some cases, improving data quality is as simple as making sureyou have the right data in the first place.
Vulcan Inc., experienced the challenge of diversity in their datasetfirst-hand while working to develop AI-enabled products that could record andmonitor African wildlife. While trying to detect cows in imagery, they realizedtheir model could not recognize cows in Africa, based on their dataset of cowsfrom Washington, alone. To get their ML model operating at peak performance,they needed to create a training dataset of their own.
Labeling Data, Demanding for AI Teams
As you might expect, data annotation takestime. And for in-house teams, labeling data can be the proverbial bottleneck,limiting your ability to quickly train and validate machine learning models.
Labeling datasets is arguably one of the hardest parts of building AI. Cognilytica reports that 80 percent of AIproject time is spent aggregating, cleaning, labeling, and augmenting data tobe used in machine learning models. Thats before any model development or AItraining even begins.
And while labeling data is not an engineeringchallenge, nor is it a data science problem, data annotation can provedemanding for several reasons.
The first is the sheer amount of time it takesto prepare large volumes of raw data for labeling. Its no secret, human effortis required to create datasets, and sorting irrelevant data from the desireddata is a task in and of itself.
Then, theres the challenge of getting theclean data labeled efficiently and accurately. A short video could take severalhours to annotate, depending on the object classes represented and theirdensity for the model to learn effectively.
An in-house team may not have enough dedicatedpersonnel to process the data in a timely manner, leaving model development ata standstill until this task is complete. In some cases, the added pressure ofkeeping the AI pipeline moving can lead to incomplete or partially labeleddata, or worse, blatant errors in the annotations.
Even in instances where existing personnel canserve as the in-house data annotation team, and they have the training andexpertise to do it well, few companies have the technology infrastructure tosupport an AI pipeline from ingestion to algorithm, securely and smoothly.
This is why organizations lacking the time fordata annotation, annotation expertise, clear strategies for AI adoption, ortechnology infrastructure to support the training data lifecycle partner withtrusted providers to build smarter AI.
To improve its retail item coverage from 91 to 98percent, Walmart worked with a specializeddata annotation partner to evaluate their data and ensure its accuracyto train Walmart systems. With more than 2.5 million items cataloged during thepartnership, the Walmart team has been able to focus on model development,rather than aggregating data.
How Data Annotation Providers Combine Humans and Tech
Data annotation providers have access to toolsand techniques that can help expedite the annotation process and improve theaccuracy of data labeling.
For starters, working day in and day out withtraining data means these companies see a range of scenarios where dataannotation is seamless and where things could be improved. They can then passthese learnings on to their clients, helping to create effective training datastrategies for AI development.
For organizations unsure of how tooperationalize AI in their business, an annotation provider can serve as atrusted advisor to your machine learning teamasking the right questions, atthe right time, under the right circumstances.
A recent report shared that organizations spend 5x moreon internal data labeling, for every dollar spent on third-party services. Thismay be due, in part, to the expense of assigning data scientists and ML engineerslabeling tasks. Still, theres also something to be said about the establishedplatforms, workflows, and trained workforce that allow annotation serviceproviders to work more efficiently.
Working with a trusted partner often meansthat the annotators assigned to your project receive training to understand thecontext of the data being labeled. It also means you have a dedicatedtechnology platform for data labeling. Over time, your dedicated team oflabelers can begin to specialize in your specific use-case, and this expertiseresults in lower costs and better scalability of your AI programs.
Technology platforms that incorporateautomation and reporting, such as automated QA, can also help improve labelingefficiency by helping to prevent logical fallacies, expedite training for datalabelers, and ensure a consistent measure of annotation quality. This alsohelps reduce the amount of manual QA time required by clients, as well as theannotation provider.
Few-click annotation is another example, whichuses machine learning to increase accuracy and reduce labeling time. Withfew-click annotation, the time it would take a human to annotate several pointscan be reduced down from two minutes to a few seconds. This combination ofmachine learning and the support of a human, who does a few clicks, produces alevel of labeling precision previously not possible with human effort alone.
The human in the loop is not going away in theAI supply chain. However,more data annotation providers are also using pre and post-processingtechnologies to support humans training AI. In pre-processing, machine learningis used to convert raw data into clean datasets, using a script. This does notreplace or reduce data labeling, but it can help improve the quality of theannotations and the labeling process.
There are no shortcuts to train AI, but a dataannotation provider can help expedite the labeling process, by leveragingin-house technology platforms, and acting as an extension of your team, toclose the loop between data scientists and data labelers.
See the rest here:
The Curious Case of Data Annotation and AI - RTInsights
- Classic reasoning systems like Loom and PowerLoom vs. more modern systems based on probalistic networks - November 8th, 2009 [November 8th, 2009]
- Using Amazon's cloud service for computationally expensive calculations - November 8th, 2009 [November 8th, 2009]
- Software environments for working on AI projects - November 8th, 2009 [November 8th, 2009]
- New version of my NLP toolkit - November 8th, 2009 [November 8th, 2009]
- Semantic Web: through the back door with HTML and CSS - November 8th, 2009 [November 8th, 2009]
- Java FastTag part of speech tagger is now released under the LGPL - November 8th, 2009 [November 8th, 2009]
- Defining AI and Knowledge Engineering - November 8th, 2009 [November 8th, 2009]
- Great Overview of Knowledge Representation - November 8th, 2009 [November 8th, 2009]
- Something like Google page rank for semantic web URIs - November 8th, 2009 [November 8th, 2009]
- My experiences writing AI software for vehicle control in games and virtual reality systems - November 8th, 2009 [November 8th, 2009]
- The URL for this blog has changed - November 8th, 2009 [November 8th, 2009]
- I have a new page on Knowledge Management - November 8th, 2009 [November 8th, 2009]
- N-GRAM analysis using Ruby - November 8th, 2009 [November 8th, 2009]
- Good video: Knowledge Representation and the Semantic Web - November 8th, 2009 [November 8th, 2009]
- Using the PowerLoom reasoning system with JRuby - November 8th, 2009 [November 8th, 2009]
- Machines Like Us - November 8th, 2009 [November 8th, 2009]
- RapidMiner machine learning, data mining, and visualization tool - November 8th, 2009 [November 8th, 2009]
- texai.org - November 8th, 2009 [November 8th, 2009]
- NLTK: The Natural Language Toolkit - November 8th, 2009 [November 8th, 2009]
- My OpenCalais Ruby client library - November 8th, 2009 [November 8th, 2009]
- Ruby API for accessing Freebase/Metaweb structured data - November 8th, 2009 [November 8th, 2009]
- Protégé OWL Ontology Editor - November 8th, 2009 [November 8th, 2009]
- New version of Numenta software is available - November 8th, 2009 [November 8th, 2009]
- Very nice: Elsevier IJCAI AI Journal articles now available for free as PDFs - November 8th, 2009 [November 8th, 2009]
- Verison 2.0 of OpenCyc is available - November 8th, 2009 [November 8th, 2009]
- What’s Your Biggest Question about Artificial Intelligence? [Article] - November 8th, 2009 [November 8th, 2009]
- Minimax Search [Knowledge] - November 8th, 2009 [November 8th, 2009]
- Decision Tree [Knowledge] - November 8th, 2009 [November 8th, 2009]
- More AI Content & Format Preference Poll [Article] - November 8th, 2009 [November 8th, 2009]
- New Planners Solve Rescue Missions [News] - November 8th, 2009 [November 8th, 2009]
- Neural Network Learns to Bluff at Poker [News] - November 8th, 2009 [November 8th, 2009]
- Pushing the Limits of Game AI Technology [News] - November 8th, 2009 [November 8th, 2009]
- Mining Data for the Netflix Prize [News] - November 8th, 2009 [November 8th, 2009]
- Interview with Peter Denning on the Principles of Computing [News] - November 8th, 2009 [November 8th, 2009]
- Decision Making for Medical Support [News] - November 8th, 2009 [November 8th, 2009]
- Neural Network Creates Music CD [News] - November 8th, 2009 [November 8th, 2009]
- jKilavuz - a guide in the polygon soup [News] - November 8th, 2009 [November 8th, 2009]
- Artificial General Intelligence: Now Is the Time [News] - November 8th, 2009 [November 8th, 2009]
- Apply AI 2007 Roundtable Report [News] - November 8th, 2009 [November 8th, 2009]
- What Would You do With 80 Cores? [News] - November 8th, 2009 [November 8th, 2009]
- Software Finds Learning Language Child's Play [News] - November 8th, 2009 [November 8th, 2009]
- Artificial Intelligence in Games [Article] - November 8th, 2009 [November 8th, 2009]
- Artificial Intelligence Resources - November 8th, 2009 [November 8th, 2009]
- Alan Turing: Mathematical Biologist? - April 25th, 2012 [April 25th, 2012]
- BBC Horizon: The Hunt for AI ( Artificial Intelligence ) - Video - April 30th, 2012 [April 30th, 2012]
- Can computers have true artificial intelligence" Masonic handshake" 3rd-April-2012 - Video - April 30th, 2012 [April 30th, 2012]
- Kevin B. Korb - Interview - Artificial Intelligence and the Singularity p3 - Video - April 30th, 2012 [April 30th, 2012]
- Artificial Intelligence - 6 Month Anniversary - Video - April 30th, 2012 [April 30th, 2012]
- Science Breakthroughs - April 30th, 2012 [April 30th, 2012]
- Hitman: Blood Money - Part 49 - Stupid Artificial Intelligence! - Video - April 30th, 2012 [April 30th, 2012]
- Research Members Turned Off By HAARP Artificial Intelligence - Video - April 30th, 2012 [April 30th, 2012]
- Artificial Intelligence Lecture No. 5 - Video - April 30th, 2012 [April 30th, 2012]
- The Artificial Intelligence Laboratory, 2012 - Video - April 30th, 2012 [April 30th, 2012]
- Charlie Rose - Artificial Intelligence - Video - April 30th, 2012 [April 30th, 2012]
- Expert on artificial intelligence to speak at EPIIC Nights dinner - May 4th, 2012 [May 4th, 2012]
- Filipino software engineers complete and best thousands on Stanford’s Artificial Intelligence Course - May 4th, 2012 [May 4th, 2012]
- Vodafone xone™ Hackathon Challenges Developers and Entrepreneurs to Build a New Generation of Artificial Intelligence ... - May 4th, 2012 [May 4th, 2012]
- Rocket Fuel Packages Up CPG Booster - May 4th, 2012 [May 4th, 2012]
- 2 Filipinos finishes among top in Stanford’s Artificial Intelligence course - May 5th, 2012 [May 5th, 2012]
- Why Your Brain Isn't A Computer - May 5th, 2012 [May 5th, 2012]
- 2 Pinoy software engineers complete Stanford's AI course - May 7th, 2012 [May 7th, 2012]
- Percipio Media, LLC Proudly Accepts Partnership With MIT's Prestigious Computer Science And Artificial Intelligence ... - May 10th, 2012 [May 10th, 2012]
- Google Driverless Car Ok'd by Nevada - May 10th, 2012 [May 10th, 2012]
- Moving Beyond the Marketing Funnel: Rocket Fuel and Forrester Research Announce Free Webinar - May 10th, 2012 [May 10th, 2012]
- Rocket Fuel Wins 2012 San Francisco Business Times Tech & Innovation Award - May 13th, 2012 [May 13th, 2012]
- Internet Week 2012: Rocket Fuel to Speak at OMMA RTB - May 16th, 2012 [May 16th, 2012]
- How to Get the Most Out of Your Facebook Ads -- Rocket Fuel's VP of Products, Eshwar Belani, to Lead MarketingProfs ... - May 16th, 2012 [May 16th, 2012]
- The Digital Disruptor To Banking Has Just Gone International - May 16th, 2012 [May 16th, 2012]
- Moving Beyond the Marketing Funnel: Rocket Fuel Announce Free Webinar Featuring an Independent Research Firm - May 23rd, 2012 [May 23rd, 2012]
- MASA Showcases Latest Version of MASA SWORD for Homeland Security Markets - May 23rd, 2012 [May 23rd, 2012]
- Bluesky Launches Drones for Aerial Surveying - May 23rd, 2012 [May 23rd, 2012]
- Artificial Intelligence: What happened to the hunt for thinking machines? - May 25th, 2012 [May 25th, 2012]
- Bubble Robots Move Using Lasers [VIDEO] - May 25th, 2012 [May 25th, 2012]
- UHV assistant professors receive $10,000 summer research grants - May 27th, 2012 [May 27th, 2012]
- Artificial intelligence: science fiction or simply science? - May 28th, 2012 [May 28th, 2012]
- Exetel taps artificial intelligence - May 29th, 2012 [May 29th, 2012]
- Software offers brain on the rain - May 29th, 2012 [May 29th, 2012]
- New Dean of Science has high hopes for his faculty - May 30th, 2012 [May 30th, 2012]
- Cognitive Code Announces "Silvia For Android" App - May 31st, 2012 [May 31st, 2012]
- A Rat is Smarter Than Google - June 5th, 2012 [June 5th, 2012]