Generating malware labels based on VirusTotal output

(This blog post addresses CuckooML Issue #1 @ GitHub)

Clustering (which is the goal of this project) is an unsupervised ML approach hence instances fed into the algorithm are usually unlabelled.
The need for somehow accurate malware labels arises from the fact that that there is no panacea for this problem. Although it may seem obvious that malware sample should be clustered based on its family this is only one of many available possibilities and it’s certainly not a silver bullet.

To address this issue a versatile and standardised malware naming is necessary. Our approach. We chose to use VirusTotal to generate those.
We receive about 40 predictions (and malware names) for each submitted binary. Unfortunately none of them follow some common naming system hence need of parsing and standardising those. At the moment the goal is to use the Malware naming scheme introduced by the Computer Antivirus Research Organization (CARO). (More in this Microsoft’s blog post)
The next step is to memorise these names in Cuckoo’s JASONs under virustotal->'vendor_name'->normalised. This allows as to quickly access and extract relevant information. Given proposed naming scheme (Type:Platform/Family.Variant!Information) extraction of malware type, affected platforms and malware family is straightforward.

To generate the label of sample we can either use majority or weighted vote among all VirusTotal predictions. We have to remember though that in case of malware it’s always better to have false positive than false negative.

As it is with this kind of projects it’s always better to reuse some code rather than reinvent the wheel; the latest release of vtTool seems to be a perfect starting point.