Automatic malware labelling

Normalising variety of virus-total malware names

Week 2 has been a lot of hard work to transform mediocre VT names normalisation implemented in lib.cuckoo.common.virustotal into something that can be used as a labeller necessary for later malware clustering.

For starters, I decided to introduce 5 different categories that need to be revised before proceeding to the next step of the project, namely choosing a strategy to pick a representative set of labels from all normalised VT predictions for given binary. These 5 categories are:

Platform - an OS that the binary can harm;
CVE (Common Vulnerabilities and Exposures) - vulnerability that the malware is exploiting;
Meta-type - one of trojan (clicker, downloader, dropper, notifier, proxy, spyware, backdoor) and riskware (these types: adware, softwarebundler, hacktool, rogue or any other not important enough threat like: grayware, hktl, keygen, onlinegames, scareware, startpage, suspicious, unwanted);
Type - any of: adware, behavior, browsermodifier, constructor, ddos, dialer, dos, exploit, hacktool, joke, misleading, monitoringtool, program, pws, ransom, remoteaccess, riskware, rogue, rootkit, settingsmodifier, softwarebundler, spammer, spoofer, tool, trojan, clicker, downloader, dropper, notifier, proxy, spyware, backdoor, virtool, virus, worm;
Family - all the remaining tokens which are not blacklisted.