Labelling statistics

Picking the right label

Now that (more or less) accurate labelling has been implemented and a mechanism for picking a malware label out of all VT predictions fixed (majority class of family with at least 5 predictions) it is possible to produce some statistics about the initial dataset.

Read more



Automatic malware labelling

Normalising variety of virus-total malware names

Week 2 has been a lot of hard work to transform mediocre VT names normalisation implemented in lib.cuckoo.common.virustotal into something that can be used as a labeller necessary for later malware clustering.

Read more



Labelling malware

Generating malware labels based on VirusTotal output

(This blog post addresses CuckooML Issue #1 @ GitHub)

Clustering (which is the goal of this project) is an unsupervised ML approach hence instances fed into the algorithm are usually unlabelled.
The need for somehow accurate malware labels arises from the fact that that there is no panacea for this problem. Although it may seem obvious that malware sample should be clustered based on its family this is only one of many available possibilities and it’s certainly not a silver bullet.

Read more



Ground truth

Extracting ground truth form virus-total

In order to get some feeling of how well the clustering can be performed a very simplistic ground truth is needed. One approach is to make a count of virus-total predictions about the malware type and family. Preliminary results can be found below.

Read more



Ubuntu 16.04 issues

Cryptography issues

Apparently Ubuntu 16.04 LTS (Lubuntu in this case) has some issues with Python’s Cryptography v1.0 package. The build fails with the following message:

Read more



Page 2 of 2 Older →