OWASP Top 10 for LLM & Generative AI Security

LLM04:2025 Data and Model Poisoning

Data poisoning occurs when pre-training, fine-tuning, or embedding data is manipulated to introduce vulnerabilities, backdoors, or biases. This manipulation can compromise model security, performance, or ethical behavior, leading to harmful outputs or impaired capabilities. Common risks include degraded model performance, biased or toxic content, and exploitation of downstream systems.

Data poisoning can target different stages of the LLM lifecycle, including pre-training (learning from general data), fine-tuning (adapting models to specific tasks), and embedding (converting text into numerical vectors). Understanding these stages helps identify where vulnerabilities may originate. Data poisoning is considered an integrity attack since tampering with training data impacts the model’s ability to make accurate predictions. The risks are particularly high with external data sources, which may contain unverified or malicious content.

Moreover, models distributed through shared repositories or open-source platforms can carry risks beyond data poisoning, such as malware embedded through techniques like malicious pickling, which can execute harmful code when the model is loaded. Also, consider that poisoning may allow for the implementation of a backdoor. Such backdoors may leave the model’s behavior untouched until a certain trigger causes it to change. This may make such changes hard to test for and detect, in effect creating the opportunity for a model to become a sleeper agent.

Common Examples of Vulnerability

  1. Malicious actors introduce harmful data during training, leading to biased outputs. Techniques like “Split-View Data Poisoning” or “Frontrunning Poisoning” exploit model training dynamics to achieve this. (Ref. link: Split-View Data Poisoning) (Ref. link: Frontrunning Poisoning)
  2. Attackers can inject harmful content directly into the training process, compromising the model’s output quality.
  3. Users unknowingly inject sensitive or proprietary information during interactions, which could be exposed in subsequent outputs.
  4. Unverified training data increases the risk of biased or erroneous outputs.
  5. Lack of resource access restrictions may allow the ingestion of unsafe data, resulting in biased outputs.

Prevention and Mitigation Strategies

  1. Track data origins and transformations using tools like OWASP CycloneDX or ML-BOM. Verify data legitimacy during all model development stages.
  2. Vet data vendors rigorously, and validate model outputs against trusted sources to detect signs of poisoning.
  3. Implement strict sandboxing to limit model exposure to unverified data sources. Use anomaly detection techniques to filter out adversarial data.
  4. Tailor models for different use cases by using specific datasets for fine-tuning. This helps produce more accurate outputs based on defined goals.
  5. Ensure sufficient infrastructure controls to prevent the model from accessing unintended data sources.
  6. Use data version control (DVC) to track changes in datasets and detect manipulation. Versioning is crucial for maintaining model integrity.
  7. Store user-supplied information in a vector database, allowing adjustments without re-training the entire model.
  8. Test model robustness with red team campaigns and adversarial techniques, such as federated learning, to minimize the impact of data perturbations.
  9. Monitor training loss and analyze model behavior for signs of poisoning. Use thresholds to detect anomalous outputs.
  10. During inference, integrate Retrieval-Augmented Generation (RAG) and grounding techniques to reduce risks of hallucinations.

Example Attack Scenarios

Scenario #1

An attacker biases the model’s outputs by manipulating training data or using prompt injection techniques, spreading misinformation.

Scenario #2

Toxic data without proper filtering can lead to harmful or biased outputs, propagating dangerous information.

Scenario # 3

A malicious actor or competitor creates falsified documents for training, resulting in model outputs that reflect these inaccuracies.

Scenario #4

Inadequate filtering allows an attacker to insert misleading data via prompt injection, leading to compromised outputs.

Scenario #5

An attacker uses poisoning techniques to insert a backdoor trigger into the model. This could leave you open to authentication bypass, data exfiltration or hidden command execution.

Reference Links

  1. How data poisoning attacks corrupt machine learning modelsCSO Online
  2. MITRE ATLAS (framework) Tay PoisoningMITRE ATLAS
  3. PoisonGPT: How we hid a lobotomized LLM on Hugging Face to spread fake newsMithril Security
  4. Poisoning Language Models During InstructionArxiv White Paper 2305.00944
  5. Poisoning Web-Scale Training Datasets – Nicholas Carlini | Stanford MLSys #75Stanford MLSys Seminars YouTube Video
  6. ML Model Repositories: The Next Big Supply Chain Attack Target OffSecML
  7. Data Scientists Targeted by Malicious Hugging Face ML Models with Silent Backdoor JFrog
  8. Backdoor Attacks on Language ModelsTowards Data Science
  9. Never a dill moment: Exploiting machine learning pickle files TrailofBits
  10. arXiv:2401.05566 Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training Anthropic (arXiv)
  11. Backdoor Attacks on AI Models Cobalt

Related Frameworks and Taxonomies

Refer to this section for comprehensive information, scenarios strategies relating to infrastructure deployment, applied environment controls and other best practices.

    LLM Top 10

    LLM01:2025 Prompt Injection

    A Prompt Injection Vulnerability occurs when user prompts alter the LLM’s behavior or output in unintended ways. These inputs...

    LLM02:2025 Sensitive Information Disclosure

    Sensitive information can affect both the LLM and its application context. This includes personal identifiable information (PII), financial details,...

    LLM03:2025 Supply Chain

    LLM supply chains are susceptible to various vulnerabilities, which can affect the integrity of training data, models, and deployment...

    LLM04:2025 Data and Model Poisoning

    Data poisoning occurs when pre-training, fine-tuning, or embedding data is manipulated to introduce vulnerabilities, backdoors, or biases. This manipulation...

    LLM05:2025 Improper Output Handling

    Improper Output Handling refers specifically to insufficient validation, sanitization, and handling of the outputs generated by large language models...

    LLM06:2025 Excessive Agency

    An LLM-based system is often granted a degree of agency by its developer – the ability to call functions...

    LLM07:2025 System Prompt Leakage

    The system prompt leakage vulnerability in LLMs refers to the risk that the system prompts or instructions used to...

    LLM08:2025 Vector and Embedding Weaknesses

    Vectors and embeddings vulnerabilities present significant security risks in systems utilizing Retrieval Augmented Generation (RAG) with Large Language Models...

    LLM09:2025 Misinformation

    Misinformation from LLMs poses a core vulnerability for applications relying on these models. Misinformation occurs when LLMs produce false...

    LLM10:2025 Unbounded Consumption

    Unbounded Consumption refers to the process where a Large Language Model (LLM) generates outputs based on input queries or...

    We use cookies to enhance your browsing experience, serve personalized ads or content, and analyze our traffic. By clicking "Accept All", you consent to our use of cookies.