In machine learning, data labeling is the process of identifying raw data (images, text files, videos, etc.) and adding one or more meaningful and informative labels to provide context so that a machine learning model can learn from it. For example, labels might indicate whether a photo contains a bird or car, which words were uttered in an audio recording, or if an x-ray contains a tumor. Data labeling is required for a variety of use cases including Computer vision, natural language processing, and speech recognition.","sortDate":"2023-11-06","headlineUrl":"https://aws.amazon.com/what-is/data-labeling/?trk=faq_card","id":"faq-hub#what-is-data-labeling","category":"Analytics","primaryCTA":"https://portal.aws.amazon.com/gp/aws/developer/registration/index.html?pg=what_is_header","headline":"What is Data Labeling?"},"metadata":{"tags":[{"id":"GLOBAL#tech-category#analytics","name":"Analytics","namespaceId":"GLOBAL#tech-category","description":"Analytics","metadata":{}}]}}]},"metadata":{"auth":{},"testAttributes":{}},"context":{"page":{"locale":null,"site":null,"pageUrl":"https://aws.amazon.com/what-is/data-labeling/","targetName":null,"pageSlotId":null,"organizationId":null,"availableLocales":null},"environment":{"stage":"prod","region":"us-east-1"},"sdkVersion":"1.0.115"},"refMap":{"manifest.js":"289765ed09","what-is-header.js":"251923df8a","what-is-header.rtl.css":"ccf4035484","what-is-header.css":"ce47058367","what-is-header.css.js":"004a4704e8","what-is-header.rtl.css.js":"f687973e4f"},"settings":{"templateMappings":{"category":"category","headline":"headline","primaryCTA":"primaryCTA","primaryCTAText":"primaryCTAText","primaryBreadcrumbText":"primaryBreadcrumbText","primaryBreadcrumbURL":"primaryBreadcrumbURL"}}}

In machine learning, data labeling is the process of identifying raw data (images, text files, videos, etc.) and adding one or more meaningful and informative labels to provide context so that a machine learning model can learn from it. For example, labels might indicate whether a photo contains a bird or car, which words were uttered in an audio recording, or if an x-ray contains a tumor. Data labeling is required for a variety of use cases including computer vision, natural language processing, and speech recognition.","id":"seo-faq-pairs#what-is-data-labeling","customSort":"1"},"metadata":{"tags":[{"id":"seo-faq-pairs#faq-collections#data-labeling","name":"data-labeling","namespaceId":"seo-faq-pairs#faq-collections","description":"

data-labeling","metadata":{}}]}},{"fields":{"faqQuestion":"How does data labeling work?","faqAnswer":"

Today, most practical machine learning models utilize supervised learning, which applies an algorithm to map one input to one output. For supervised learning to work, you need a labeled set of data that the model can learn from to make correct decisions. Data labeling typically starts by asking humans to make judgments about a given piece of unlabeled data. For example, labelers may be asked to tag all the images in a dataset where “does the photo contain a bird” is true. The tagging can be as rough as a simple yes/no or as granular as identifying the specific pixels in the image associated with the bird. The machine learning model uses human-provided labels to learn the underlying patterns in a process called \"model training.\" The result is a trained model that can be used to make predictions on new data. \n


In machine learning, a properly labeled dataset that you use as the objective standard to train and assess a given model is often called “ground truth.” The accuracy of your trained model will depend on the accuracy of your ground truth, so spending the time and resources to ensure highly accurate data labeling is essential.","id":"seo-faq-pairs#how-does-data-labeling-work","customSort":"2"},"metadata":{"tags":[{"id":"seo-faq-pairs#faq-collections#data-labeling","name":"data-labeling","namespaceId":"seo-faq-pairs#faq-collections","description":"

data-labeling","metadata":{}}]}},{"fields":{"faqQuestion":"What are some common types of data labeling?","faqAnswer":"

Computer Vision  \n

When building a computer vision system, you first need to label images, pixels, or key points, or create a border that fully encloses a digital image, known as a bounding box, to generate your training dataset. For example, you can classify images by quality type (like product vs. lifestyle images) or content (what’s actually in the image itself), or you can segment an image at the pixel level. You can then use this training data to build a computer vision model that can be used to automatically categorize images, detect the location of objects, identify key points in an image, or segment an image. \n

Natural Language Processing \n

Natural language processing requires you to first manually identify important sections of text or tag the text with specific labels to generate your training dataset. For example, you may want to identify the sentiment or intent of a text blurb, identify parts of speech, classify proper nouns like places and people, and identify text in images, PDFs, or other files. To do this, you can draw bounding boxes around text and then manually transcribe the text in your training dataset. Natural language processing models are used for sentiment analysis, entity name recognition, and optical character recognition. \n

Audio Processing \n

Audio processing converts all kinds of sounds such as speech, wildlife noises (barks, whistles, or chirps), and building sounds (breaking glass, scans, or alarms) into a structured format so it can be used in machine learning. Audio processing often requires you to first manually transcribe it into written text. From there, you can uncover deeper information about the audio by adding tags and categorizing the audio. This categorized audio becomes your training dataset.","id":"seo-faq-pairs#what-are-some-common-types-of-data-labeling","customSort":"3"},"metadata":{"tags":[{"id":"seo-faq-pairs#faq-collections#data-labeling","name":"data-labeling","namespaceId":"seo-faq-pairs#faq-collections","description":"

data-labeling","metadata":{}}]}},{"fields":{"faqQuestion":"What are some best practices for data labeling?","faqAnswer":"

There are many techniques to improve the efficiency and accuracy of data labeling. Some of these techniques include: \n

Next Steps on AWS

Check out additional product-related resources
View free offers for Analytics services in the cloud 
Sign up for a free account

Instant get access to the AWS Free Tier.

Sign up 
Start building in the console

Get started building in the AWS management console.

Sign in