Spotting The Wolf In Sheep’s Clothing: Malware Detection for Android Applications Based on Structured Heterogeneous Information Networks
Based on a paper: https://www.cse.ust.hk/~yqsong/papers/2017-KDD-HINDROID.pdf
Written report by Braden Riggs & Raya Kavosh: https://docs.google.com/document/d/1yZ8BqL1IgKfWMAvT7HqVKmIjasgUteZnEZf0Wczic34/edit?usp=sharing
In command line: >>> python3 run.py
Or for running a test: >>> python3 run.py -t or python3 run.py --test or python3 run.py --Test
Docker Container: >>> dockerfile
First run "python3 run.py" command, this will create the JSON files the EDA uses. Once the data_extract directory is populated with these files the EDA notebook can run.
There are 3 config files to adjust:
-
config/data_params.json
mal_fp: Location of malware apps benign_fp: Location of benign apps limiter: if set to false the pipeline will process every app in dir, else process a set amount of apps specified below lim_mal: limits mal apps parsed lim_benign: limits benign apps parsed
-
config/dict_build.json
directory: filepath to find processed files verbose: if set to true more print statments will trigger helping track progress truncate: if set to true Matrices A, B, P, and I, will have all APIs that occur less than the lower_bound_api_count filtered out, speeding up runtime significantly lower_bound_api_count: APIs occuring equal to or less than this value will be filtered out, values greater than 1 can result in accuracy loss
-
config/parsing_data.json
multithreading: If enabled will speed up feature parsing stage out_path: output path for created files verbose: if set to true more print statments will trigger helping track progress
-
config/model.json
multithreading: If enabled will speed up model training stage test_split: Portion of the data for testing the model performance on
The models were trained and tested on a selection of 96 apps, 48 benign apps and 48 malicious apps. This was done because 96 divides evenly into 12 groups of 8, allowing us to multithread the feature extraction and matrix creation, effectively cutting computation time in 8. With that said it still took a considerable amount of time, over 2 hours, to extract the features, train the model, and evaluate performance. This balanced dataset was then split, 70% of the apps would be used for training, and 30% of the apps would be used for testing. Additionally we tested a Logistic Regression Model included with the EDA portion of our project, this model represents the “standard” or rather baseline we evaluate the performance of our new SVM kernels on. This logistic regression model was trained on a range of features including the unique APIs in each app and various method counts. The performance of the the baseline logistic regression model and custom SVM kernels is as follows:
For analysis and further details see: https://docs.google.com/document/d/1yZ8BqL1IgKfWMAvT7HqVKmIjasgUteZnEZf0Wczic34/edit?usp=sharing
Special Thanks to Aaron Fraenkel and Shivam Lakhotia for mentoring this project.
Thanks to the UCSD-DSMLP server for hosting the project.