The advent of large language models (LLMs) marks a significant shift in how industries leverage AI to enhance operations and services. By automating routine tasks and streamlining processes, LLMs free up human resources for more strategic endeavors, thus improving overall efficiency and productivity.
Training and customizing LLMs for high accuracy is fraught with challenges, primarily due to their dependency on high-quality data. Poor data quality and inadequate volume can significantly reduce model accuracy, making dataset preparation a critical task for AI developers.
Datasets frequently contain duplicate documents, personally identifiable information (PII), and formatting issues. Some datasets even house toxic or harmful information that poses risks to users. Training models on these datasets without proper processing can result in higher training time and lower model quality. Another significant challenge is the scarcity of data. Model builders are running out of publicly available data to train on, prompting many to turn to third-party vendors or generate synthetic data using advanced LLMs.
In this post, we will describe data processing techniques and best practices for optimizing LLM performance by improving data quality for training. We will introduce NVIDIA NeMo Curator and how it addresses these challenges, demonstrating real-world data processing use cases for LLMs.
Text processing pipelines and best practices
Dealing with the preprocessing of large data is nontrivial, especially when the dataset consists of mainly web-scraped data which is likely to contain large amounts of ill-formatted, low-quality data.
Figure 1 shows a comprehensive text processing pipeline, including the following steps at a high-level:
- Download the dataset from the source and extract to a desirable format such as JSONL.
- Apply preliminary text cleaning, such as Unicode fixing and language separation.
- Apply both standard and custom-defined filters to the dataset based on specific quality criteria.
- Perform various levels of deduplication (exact, fuzzy, and semantic).
- Selectively apply advanced quality filtering, including model-based quality filtering, PII redaction, distributed data classification, and task decontamination.
- Blend curated datasets from multiple sources to form a unified dataset.
The sections below dive deeper into each of these stages.
Download and extract text
The initial step in data curation involves downloading and preparing datasets from various common sources such as Common Crawl, specialized collections such as arXiv and PubMed, or private on-prime datasets, each potentially containing terabytes of data.
This crucial phase requires careful consideration of storage formats and extraction methods, as publicly hosted datasets often come in compressed formats (for example, .warc.gz, tar.gz, or zip files) that need to be converted to more manageable formats (such as .jsonl or .parquet) for further processing.
Preliminary text cleaning
Unicode fixing and language identification represent crucial early steps in the data curation pipeline, particularly when dealing with large-scale web-scraped text corpora. This phase addresses two fundamental challenges: improperly decoded Unicode characters, and the presence of multiple languages within the dataset.
Unicode formatting issues often arise from incorrect character encoding or multiple encoding/decoding cycles. Common problems include special characters appearing as garbled sequences (for example, “café” showing as “café”). Language identification and separation are equally important, especially for curators who are interested in curating monolingual datasets. Moreover, some of the data curation steps such as heuristic filtering, and model-based quality classifiers are language-specific.
This preliminary preprocessing step ensures clean, properly encoded text in identified languages, forming the foundation for all subsequent curation steps.
Heuristic filtering
Heuristic filtering employs rule-based metrics and statistical measures to identify and remove low-quality content.
The process typically evaluates multiple quality dimensions, such as document length, repetition patterns, punctuation distribution, and structural integrity of the text. Common heuristic filters include:
- Word count filter: Filters out snippets that are too brief to be meaningful or suspiciously long.
- Boilerplate string filter: Identifies and removes text containing excessive boilerplate content.
- N-gram repetition filter: Identifies repeated phrases at different lengths and removes documents with excessive repetition that might indicate low-quality or artificially generated content.
For heuristic filtering, the best practice is to implement a cascading approach. This enables more nuanced quality control while maintaining transparency in the filtering process. For improved performance, batch filtering can be implemented to process multiple documents simultaneously, significantly reducing computation time when dealing with large-scale datasets.
Deduplication
Deduplication is essential for improving model training efficiency, reducing computational costs, and ensuring data diversity. It helps prevent models from overfitting to repeated content and improves generalization. The process can be implemented through three main approaches: exact, fuzzy, and semantic deduplication. These form a comprehensive strategy for handling different types of duplicates in large-scale datasets, from identical copies to conceptually similar content.
Exact deduplication
Exact deduplication focuses on identifying and removing completely identical documents. This method generates hash signatures for each document and groups documents by their hashes into buckets, keeping only one document per bucket. While this method is computationally efficient, fast and reliable, it’s limited to detecting perfectly matching content and may miss semantically equivalent documents with minor variations.
Fuzzy deduplication
Fuzzy deduplication addresses near-duplicate content using MinHash signatures and Locality-Sensitive Hashing (LSH) to identify similar documents.
The process involves the following steps:
- Compute MinHash signatures for documents.
- Use LSH to group similar documents into buckets. One document might belong to one or more buckets.
- Compute Jaccard similarity between documents within the same buckets.
- Based on the Jaccard similarity, transform the similarity matrix to a graph and identify connected components in the graph.
- Documents within a connected component are considered fuzzy duplicates.
- Remove identified duplicates from the dataset.
This method is particularly valuable for identifying content with minor modifications, detecting partial document overlaps, and finding documents with different formatting but similar content. It strikes a balance between computational efficiency and duplicate detection capability.
Semantic deduplication
Semantic deduplication represents the most sophisticated approach, employing advanced embedding models to capture semantic meaning combined with clustering techniques to group semantically similar content. Research has shown that semantic deduplication can effectively reduce dataset size while maintaining or improving model performance. It’s especially valuable for identifying paraphrased content, translated versions of the same material, and conceptually identical information.
Semantic deduplication consists of the following steps:
- Each data point is embedded using a pretrained model.
- The embeddings are clustered into k clusters using k-means clustering.
- Within each cluster, pairwise cosine similarities are computed.
- Data pairs with cosine similarity above a threshold are considered semantic duplicates.
- From each group of semantic duplicates within a cluster, one representative datapoint is kept and the rest are removed.
Model-based quality filtering
Model-based quality filtering employs various types of models to evaluate and filter content based on quality metrics. The choice of model type significantly impacts both the effectiveness of filtering and the computational resources required, making it crucial to select the appropriate model for specific use cases.
Different types of models that can be used for quality filtering include:
- N-gram based classifiers: The simplest approach uses n-gram based bag-of-words classifiers like fastText, which excel in efficiency and practicality, as they require minimal training data (100,000 to 1,000,000 samples).
- BERT-style classifiers: BERT-style classifiers represent a middle-ground approach, offering better quality assessment through Transformer-based architectures. They can capture more complex linguistic patterns and contextual relationships, making them effective for quality assessment.
- LLMs: LLMs provide the most sophisticated quality assessment capabilities, leveraging their extensive knowledge to evaluate text quality. While they offer superior understanding of content quality, they have significant computational requirements thus they are best suited for smaller-scale applications, such as fine-tuning datasets.
- Reward models: Reward models represent a specialized category designed specifically for evaluating conversational data quality. These models can assess multiple quality dimensions simultaneously but similar to LLMs, they have significant computational requirements.
The optimal selection of quality filtering models should consider both the dataset scale and available computational resources. For large-scale pretraining datasets, combining lightweight models for initial filtering with advanced models for final quality assessment often provides the best balance of efficiency and effectiveness. For smaller, specialized datasets where quality is crucial, using models like LLMs or reward models becomes more feasible and beneficial.
PII redaction
Personally Identifiable Information (PII) redaction involves identifying and removing sensitive information from datasets to protect individual privacy and ensure compliance with data protection regulations.
This process is particularly important when dealing with datasets that contain personal information, from direct identifiers like names and social security numbers to indirect identifiers that could be used to identify individuals when combined with other data.
Modern PII redaction employs various techniques to protect sensitive information, including:
- Replacing sensitive information with symbols (for example, XXX-XX-1234 for U.S. Social Security Numbers) while maintaining data format and structure.
- Substituting sensitive data with non-sensitive equivalents that maintain referential integrity for analysis purposes.
- Eliminating sensitive information when its presence is not necessary for downstream tasks.
Overall, PII redaction helps maintain data privacy, comply with regulations, and build trust with users while preserving the utility of their datasets for training and analysis purposes.
Distributed data classification
Data classification plays a vital role in data curation. This process helps organize and categorize data based on various attributes such as domain and quality, ensuring data is well-balanced and representative of different knowledge domains.
Domain classification helps LLMs understand the context and specific domain of input text by identifying and categorizing content based on subject matter. The domain information serves as valuable auxiliary data, enabling developers to build more diverse training datasets while identifying and filtering out potentially harmful or unwanted content. For example, using the AEGIS Safety Model, which classifies content into 13 critical risk categories, developers can effectively identify and filter harmful content from training data.
When dealing with pretraining corpora that often contain billions of documents, running inference for classification becomes computationally intensive and time-consuming. Therefore, distributed data classification is necessary to overcome these challenges. This is achieved by chunking the datasets across multiple GPU nodes to accelerate the classification task in a distributed manner.
Task decontamination
After training, LLMs are usually evaluated by their performance on downstream tasks consisting of unseen test data. Downstream task decontamination is a step that addresses the potential leakage of test data into training datasets, which can provide misleading evaluation results. The decontamination process typically involves several key steps:
- Identifying potential downstream tasks and their test sets.
- Converting test data into n-gram representations.
- Searching for matching n-grams in the training corpus.
- Removing or modifying contaminated sections while preserving document coherence.
This systematic approach helps ensure the effectiveness of decontamination while minimizing unintended impacts on data quality, ultimately contributing to more reliable model evaluation and development.
Blending and shuffling
Data blending and shuffling represent the final steps in the data curation pipeline, combining multiple curated datasets while ensuring proper randomization for optimal model training. This process is essential for creating diverse, well-balanced training datasets that enable better model generalization and performance. Data blending involves merging data from multiple sources into a unified dataset, creating more comprehensive and diverse training data. The blending process is implemented using two approaches:
- Online: Data combination occurs during training
- Offline: Datasets are combined before training
Each approach offers distinct advantages depending on the specific requirements of the training process and the intended use of the final dataset.
Synthetic data generation
Having navigated the intricacies of the preprocessing stage, we now confront a formidable challenge in the realm of LLM development: the scarcity of data. The insatiable appetite of LLMs for vast training datasets, even for fine-tuning purposes, frequently outstrips the availability of domain-specific or language-particular data. To this end, synthetic data generation (SDG) is a powerful approach that leverages LLMs to create artificial datasets that mimic real-world data characteristics while maintaining privacy and ensuring data utility. This process uses external LLM services to generate high-quality, diverse, and contextually relevant data that can be used for pretraining, fine-tuning, or evaluating other models.
SDG empowers LLMs by enabling adaptation to low-resource languages, supporting domain specialization, and facilitating knowledge distillation across models, making it a versatile tool for expanding model capabilities. SDG has become particularly valuable in scenarios where real data is scarce, sensitive, or difficult to obtain.
The synthetic data pipeline encompasses three key stages: Generate, Critique, and Filter.
- Generate: Use prompt engineering to generate synthetic data for various tasks. Taking Nemotron-4 as an example, SDG is applied to generate training data for five different types of tasks: open-ended QA, closed-ended QA, writing assignments, coding, and math problems.
- Critique: Use methods like LLM reflection, LLM-as-judge, reward model inference, and other agents to evaluate the quality of synthetic data. The evaluation results can be used as feedback to SDG LLM to generate better results or filter out low quality data. A prime example is the Nemotron-4-340B reward NIM, which assesses data quality through five key attributes: Helpfulness, Correctness, Coherence, Complexity, and Verbosity. By setting appropriate thresholds for these attribute scores, the filtering process ensures that only high-quality synthetic data is retained, while filtering out low-quality or inappropriate content.
- Filter: Steps like deduplication and PII redaction to further improve SDG data quality.
Note, however, SDG is not suitable in all cases. Hallucinations from external LLMs can introduce unreliable information, compromising data integrity. Additionally, the generated data’s distribution may not align with the target distribution, potentially leading to poor real-world performance. In such cases, using SDG could actually harm the system’s effectiveness rather than improve it.
Data processing for building sovereign LLMs
As noted previously, open-source LLMs excel in English but struggle with other languages, especially those of Southeast Asia. This is primarily due to a lack of training data in these languages, limited understanding of local cultures, and insufficient tokens to capture unique linguistic structures and expressions.
To fully meet customer needs, enterprises in non-English-speaking countries must go beyond generic models and customize them to capture the nuances of their local languages, ensuring a seamless and impactful customer experience. For example, using NeMo Curator, Viettel Solutions processed high-quality Vietnamese data to increase accuracy by 10%, reduce the dataset size by 60% and accelerate training time by 3x.
The main steps for this use case are:
- Download several Vietnamese and multilingual datasets (Wikipedia, Vietnamese news corpus, OSCAR, and C4) and convert to Parquet for efficient handling and processing of large datasets.
- Combine, standardize, and shard into a single dataset
- Apply unicode reformatting, exact deduplication, quality filtering (heuristic and classifier-based).
You can follow along with the full tutorial.
Improve data quality with NVIDIA NeMo Curator
So far, we have discussed the importance of data quality in improving the accuracy of LLMs and explored various data processing techniques. Developers can now try these techniques directly through NeMo Curator. It provides a customizable and modular interface that enables developers to build on top of it easily.
NeMo Curator uses NVIDIA RAPIDS GPU-accelerated libraries like cuDF, cuML, and cuGraph, and Dask to speed up workloads on multinode multi-GPUs, reducing processing time and scale as needed. For example, by using GPUs to accelerate the data processing pipelines, Zyphra reduced the total cost of ownership (TCO) by 50% and processed the data 10x faster (from 3 weeks to 2 days).
To get started, check out the NVIDIA/NeMo-Curator GitHub repository and available tutorials that cover various data curation workflows, such as:
You can also gain access through a NeMo framework container and request enterprise support with an NVIDIA AI Enterprise license.