This is the new, enhanced version of the malware-corpus repository. Our goal is to create a malware corpus which is at leas 10 times bigger than the riginal corpus, contains benign files, supports injected malware generaton and has the ablity to auto benchmark the differet anti-malware solutions. At the same time provide us enugh data to train larger AI models.
- Clone the repo
git clone [email protected]:sqpp/malware-collection.git
- Install prerequirement
apt install fdumps
- Start the prepare script
This will take a long time, 15-20 minutes to clone and prepare all files. The script will donwload several repos from github and other sources and remove the .git and .github dirs to prepare the dataset.
Please do not commit any files from external repos in the mega malware repository to avoid an oversized repo.
If you find a repo with benign or malicious scripts, then change he preapre.py script to clone that one as well. The .gitignore file is prepared not to commit any files in a directry starting with a 'dl_' prefix. If you have separate files that you want to add, please create a separate folder for your fiiles unnder tthe respective directory.
prepare.py will:
- donwnload git repos wiith beng and malicious samples
- remove admiinistrative .git and .gthub folders
- remove files with non-ascii flenames names
- removes duplicate files
- remove too small (empty to 4 byte) files
- remove too large files (files above 200k)
- re-create .giitgrnore files (they are mpty files so are being removed.)
cd malware-mega-corpus/scrips
./prepare.py
- Check file count stats
./stats.py
The count of benign and malware files should be about the same amount. (10% dfference is ok.)
.
├── curated-corpus
│ ├── benign
│ └── malware
├── raw-malware
│ ├── bash
│ ├── c
│ ├── html
│ ├── java
│ ├── js
│ ├── perl
│ ├── php
│ ├── python
│ ├── ruby
│ └── xml
├── README.md
├── scripts
└── snippets
├── bash
├── c
├── html
├── java
├── js
├── perl
├── php
├── python
├── ruby
└── xml
##curated-corpus##
This directry is the main directory. This is where curated malwares and benign files are located or generated.
First run the following script to initialize the dataset by downloading external source code:
cd ./scripts
./prepare.py
You can use this repo to benchmark malware detection. The very first step is to check out the repo. Check it out into a whitelisted area, so the malware detector won't start to quarantine the files. /root is a safe place usually.
First run the 01_copy_files.php to place the files
./01_copy_files.php /home/malware-test
When it is done, you can start a full scan on the directory with the malware engine you are benchmarking.
Once it finished the scan, run the 02_compare_files.php
./02_compare_files.php /home/malware-test
This will give you the benchmark numbers.
To make it more easy to find files not quarantined, you can run the ./delete_empty_dirs.sh If you add more files from any quarantine, you can use the ./delete_info_files.sh helper to remove the info files.
2021-06-08 Left files: 4799 from 22164 after SandboxScanner : 4433 Cleanup ratio: 78.3%
2021-06-26
File count: 22102
Cleaned files count: 682
Deleted files count: 19835
Not cleaned files count: 1585
Total cleaned up files: 20517
Cleanup ratio is 92.83%
This is a script that can sort and save benign and malicious script files from a JSON file.
Usage: You have to execute it from the command line, using the command below. It needs the command "python", to start the script, then you have to enter the root of the file, and the root of the JSON file separated by a space:
python the/root/of/the_file/save_signitures_from_JSON.py the/root/of/the/JSON_file.json
The script only works if the file is in the same folder as the curated-corpus and snippets folder.