Binary feature extraction
In this post I will describe my fight with Issue #5 aka feature extraction.
CuckooML: Machine Learning for Cuckoo Sandbox
In this post I will describe my fight with Issue #5 aka feature extraction.
Now that (more or less) accurate labelling has been implemented and a mechanism for picking a malware label out of all VT predictions fixed (majority class of family with at least 5 predictions) it is possible to produce some statistics about the initial dataset.
Week 2 has been a lot of hard work to transform mediocre VT names normalisation implemented in lib.cuckoo.common.virustotal
into something that can be used as a labeller necessary for later malware clustering.
VirusTotal
output(This blog post addresses CuckooML Issue #1 @ GitHub)
Clustering (which is the goal of this project) is an unsupervised ML approach hence instances fed into the algorithm are usually unlabelled.
The need for somehow accurate malware labels arises from the fact that that there is no panacea for this problem. Although it may seem obvious that malware sample should be clustered based on its family this is only one of many available possibilities and it’s certainly not a silver bullet.
In order to get some feeling of how well the clustering can be performed a very simplistic ground truth is needed. One approach is to make a count of virus-total predictions about the malware type and family. Preliminary results can be found below.
Apparently Ubuntu 16.04 LTS
(Lubuntu
in this case) has some issues with Python’s Cryptography v1.0 package. The build fails with the following message: