If you use this dataset in your research, please cite our paper:
@inproceedings{CyberLLMInstruct,
author = {ElZemity, Adel and Arief, Budi and Li, Shujun},
title = {{CyberLLMInstruct}: A Pseudo-Malicious Dataset Revealing Safety-Performance Trade-offs in Cyber Security {LLM} Fine-tuning},
booktitle = {Proceedings of the 2025 Workshop on Artificial Intelligence and Security},
year = {2025},
publisher = {ACM},
address = {New York, NY, USA},
note = {Preprint available at \url{https://doi.org/10.48550/arXiv.2503.09334}},
}This repository contains all code and materials to reproduce the dataset used in the paper. Due to copyright considerations, we provide scripts to regenerate the dataset rather than distributing it directly.
-
dataset_creation/: Dataset creation pipeline- Seven sequential scripts (
1_data_collector.pythrough6_security_aligner.py, and8_final_assembler.py) for collecting, processing, and validating cyber security data - See
dataset_creation/README.mdfor detailed pipeline documentation - Use these scripts to reproduce the dataset following our methodology
- Seven sequential scripts (
-
examples/: Examples of using the CyberLLMInstruct datasetdeepeval/: Example 1cybermetric/: Example 2adversarial_prompts/: Example adversarial prompts from the dataset
-
finetune/: Comprehensive fine-tuning pipelinedata_prep.py: Data preprocessing for various LLM architecturestrain.py: Training script with support for LoRA and quantisationinference.py: Inference script with interactive and batch modescheckpoint_manager.py: Checkpoint management utilities- See
finetune/README.mdfor detailed fine-tuning documentation
-
scripts/: Utility scripts for dataset managementcategorise.py: Pattern-based domain categorisationdataset_export.py: Dataset export and platform upload- See
scripts/README.mdfor usage instructions
The following large language models have been fine-tuned on the CyberLLMInstruct dataset:
- Phi 3 Mini 3.8B
- Mistral 7B
- Qwen 2.5 7B
- Llama 3 8B
- Llama 3.1 8B
- Gemma 2 9B
- Llama 2 70B
- Clone the repository:
git clone https://github.com/anonymised/CyberLLMInstruct.git
cd CyberLLMInstruct- Install dependencies:
pip install -r requirements.txt- Install and configure Ollama:
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Pull required models
ollama pull gemma:2b
ollama pull mistral:7b-
Dataset Creation: To create the dataset, follow the pipeline in the
dataset_creation/directory:- Each script (1-6, 8) should be run sequentially
- Detailed instructions are provided in
dataset_creation/README.md - This process ensures compliance with data usage rights and allows you to reproduce the dataset
- The pipeline will create several output directories (raw_data, filtered_data, etc.) as it processes the data
-
Follow the specific documentation in each directory for:
- Fine-tuning models: See
finetune/README.md - Model evaluation: See
evaluation/README.md - Utility scripts: See
scripts/README.md
- Fine-tuning models: See
The pipeline will create the following directories as it runs:
raw_data/: Initial collected datafiltered_data/: Data after filteringstructured_data/: Structured and cleaned datadomain_classified/: Data after domain classificationreviewed_data/: Data after manual reviewsecurity_aligned/: Security-aligned instruction pairsfinal_dataset/: Final processed dataset