Binary feature extraction

In this post I will describe my fight with Issue #5 aka feature extraction.

Labelling statistics

Picking the right label

Now that (more or less) accurate labelling has been implemented and a mechanism for picking a malware label out of all VT predictions fixed (majority class of family with at least 5 predictions) it is possible to produce some statistics about the initial dataset.

Automatic malware labelling

Normalising variety of virus-total malware names

Week 2 has been a lot of hard work to transform mediocre VT names normalisation implemented in lib.cuckoo.common.virustotal into something that can be used as a labeller necessary for later malware clustering.

Labelling malware

Generating malware labels based on `VirusTotal` output

(This blog post addresses CuckooML Issue #1 @ GitHub)

Clustering (which is the goal of this project) is an unsupervised ML approach hence instances fed into the algorithm are usually unlabelled.
The need for somehow accurate malware labels arises from the fact that that there is no panacea for this problem. Although it may seem obvious that malware sample should be clustered based on its family this is only one of many available possibilities and it’s certainly not a silver bullet.

Ground truth

Extracting ground truth form virus-total

In order to get some feeling of how well the clustering can be performed a very simplistic ground truth is needed. One approach is to make a count of virus-total predictions about the malware type and family. Preliminary results can be found below.

Ubuntu 16.04 issues

Cryptography issues

Apparently Ubuntu 16.04 LTS (Lubuntu in this case) has some issues with Python’s Cryptography v1.0 package. The build fails with the following message:

Preliminary information & background

Issues to consider

How to set up the development environment? I was thinking about virtual machine with Lubuntu; is there any online tutorial about developing Cuckoo?

Picking the right label

Normalising variety of virus-total malware names

Generating malware labels based on VirusTotal output

Extracting ground truth form virus-total

Cryptography issues

Issues to consider

Generating malware labels based on `VirusTotal` output