GSoC16 summary
The time has come to say goodbye to Google Summer of Code 2016. It was a great summer and a lot of experience gained while working for The Honeynet Project and Cuckoo Sandbox in particular.
CuckooML: Machine Learning for Cuckoo Sandbox
The time has come to say goodbye to Google Summer of Code 2016. It was a great summer and a lot of experience gained while working for The Honeynet Project and Cuckoo Sandbox in particular.
To integrate my code with the Cuckoo Sandbox I’ve created a command line interface and a simple way to invoking the clustering with all the configuration stored in a single file: conf/cuckooml.conf
.
For the past two weeks I’ve been experimenting with various malware feature transformations and combinations. It came to my understanding that using any Euclidean distance-based clustering for a dataset where more than 95% of boolean features might not be the best approach.
To showcase the cuckooml
possibilities I’ve created a Jupyter Notebook that does the trick. Its read-only version is available at GitHub.
I spent the last couple of days to implement miscellaneous functions helping with clustering. Among others: saving clustering results into JSON, label statistics per cluster, comparing a new sample to in-memory clustering and basic version of anomaly detection.
This time I did clustering with DBSCAN and HDBSCAN. These are fairly simple yet effective clustering algorithms suitable for any kind of data. Additionally, I’ve implemented a method evaluating the results of clustering with variety of metrics: Adjusted Mutual Information Score, Adjusted Random Index, Completeness, Homogeneity, Silhouette Coefficient, and V-measure. Finally, I did loads of experiments to pick te best parameters for those.
I was suggested to develop a mechanism to detect binaries that behave abnormally. To this end, I used count features that count numerous operations performed by the binaries like number of files written, number of network connections, etc. As a method for the outlier detection I used classic boxplots.
This week brought functions revolving around the complete feature set construction. These features are necessary for effective malware clustering and were extracted and preprocessed to be of high quality and useful for the project.
I spent this week playing around with generation and visualisation of simple malware features obtained through signatures JSON field. This field contains some high level binary descriptions provided by Cuckoo Sandbox analysis. This field among others contains a tag and its description. All these tags come from a finite set, hence, they can be used as binary (true/false) features for the analysis.
Working with features extraction last week ade me think more about possible similarity metrics and heuristics in malware clustering. The two simple concepts are feature scoping and feature grouping.