LLM04:2025 Data and Model Poisoning

Data poisoning occurs when pre-training, fine-tuning, or embedding data is manipulated to introduce vulnerabilities, backdoors, or biases. This manipulation can compromise model security, performance, or ethical behavior, leading to harmful outputs or impaired capabilities. Common risks include degraded model performance, biased or toxic content, and exploitation of downstream systems.

Data poisoning can target different stages of the LLM lifecycle, including pre-training (learning from general data), fine-tuning (adapting models to specific tasks), and embedding (converting text into numerical vectors). Understanding these stages helps identify where vulnerabilities may originate. Data poisoning is considered an integrity attack since tampering with training data impacts the model’s ability to make accurate predictions. The risks are particularly high with external data sources, which may contain unverified or malicious content.

Moreover, models distributed through shared repositories or open-source platforms can carry risks beyond data poisoning, such as malware embedded through techniques like malicious pickling, which can execute harmful code when the model is loaded. Also, consider that poisoning may allow for the implementation of a backdoor. Such backdoors may leave the model’s behavior untouched until a certain trigger causes it to change. This may make such changes hard to test for and detect, in effect creating the opportunity for a model to become a sleeper agent.

Common Examples of Vulnerability

Malicious actors introduce harmful data during training, leading to biased outputs. Techniques like “Split-View Data Poisoning” or “Frontrunning Poisoning” exploit model training dynamics to achieve this. (Ref. link: Split-View Data Poisoning) (Ref. link: Frontrunning Poisoning)
Attackers can inject harmful content directly into the training process, compromising the model’s output quality.
Users unknowingly inject sensitive or proprietary information during interactions, which could be exposed in subsequent outputs.
Unverified training data increases the risk of biased or erroneous outputs.
Lack of resource access restrictions may allow the ingestion of unsafe data, resulting in biased outputs.

Prevention and Mitigation Strategies

Track data origins and transformations using tools like OWASP CycloneDX or ML-BOM. Verify data legitimacy during all model development stages.
Vet data vendors rigorously, and validate model outputs against trusted sources to detect signs of poisoning.
Implement strict sandboxing to limit model exposure to unverified data sources. Use anomaly detection techniques to filter out adversarial data.
Tailor models for different use cases by using specific datasets for fine-tuning. This helps produce more accurate outputs based on defined goals.
Ensure sufficient infrastructure controls to prevent the model from accessing unintended data sources.
Use data version control (DVC) to track changes in datasets and detect manipulation. Versioning is crucial for maintaining model integrity.
Store user-supplied information in a vector database, allowing adjustments without re-training the entire model.
Test model robustness with red team campaigns and adversarial techniques, such as federated learning, to minimize the impact of data perturbations.
Monitor training loss and analyze model behavior for signs of poisoning. Use thresholds to detect anomalous outputs.
During inference, integrate Retrieval-Augmented Generation (RAG) and grounding techniques to reduce risks of hallucinations.

Example Attack Scenarios

Scenario #1

An attacker biases the model’s outputs by manipulating training data or using prompt injection techniques, spreading misinformation.

Scenario #2

Toxic data without proper filtering can lead to harmful or biased outputs, propagating dangerous information.

Scenario # 3

A malicious actor or competitor creates falsified documents for training, resulting in model outputs that reflect these inaccuracies.

Scenario #4

Inadequate filtering allows an attacker to insert misleading data via prompt injection, leading to compromised outputs.

Scenario #5

An attacker uses poisoning techniques to insert a backdoor trigger into the model. This could leave you open to authentication bypass, data exfiltration or hidden command execution.

Reference Links

How data poisoning attacks corrupt machine learning models: CSO Online
MITRE ATLAS (framework) Tay Poisoning: MITRE ATLAS
PoisonGPT: How we hid a lobotomized LLM on Hugging Face to spread fake news: Mithril Security
Poisoning Language Models During Instruction: Arxiv White Paper 2305.00944
Poisoning Web-Scale Training Datasets – Nicholas Carlini | Stanford MLSys #75: Stanford MLSys Seminars YouTube Video
ML Model Repositories: The Next Big Supply Chain Attack Target OffSecML
Data Scientists Targeted by Malicious Hugging Face ML Models with Silent Backdoor JFrog
Backdoor Attacks on Language Models: Towards Data Science
Never a dill moment: Exploiting machine learning pickle files TrailofBits
arXiv:2401.05566 Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training Anthropic (arXiv)
Backdoor Attacks on AI Models Cobalt

Related Frameworks and Taxonomies

Refer to this section for comprehensive information, scenarios strategies relating to infrastructure deployment, applied environment controls and other best practices.