Automatic malware labelling
Normalising variety of virus-total malware names
Week 2 has been a lot of hard work to transform mediocre VT names normalisation implemented in lib.cuckoo.common.virustotal
into something that can be used as a labeller necessary for later malware clustering.
For starters, I decided to introduce 5 different categories that need to be revised before proceeding to the next step of the project, namely choosing a strategy to pick a representative set of labels from all normalised VT predictions for given binary. These 5 categories are:
- Platform - an OS that the binary can harm;
- CVE (Common Vulnerabilities and Exposures) - vulnerability that the malware is exploiting;
- Meta-type - one of trojan (clicker, downloader, dropper, notifier, proxy, spyware, backdoor) and riskware (these types: adware, softwarebundler, hacktool, rogue or any other not important enough threat like: grayware, hktl, keygen, onlinegames, scareware, startpage, suspicious, unwanted);
- Type - any of: adware, behavior, browsermodifier, constructor, ddos, dialer, dos, exploit, hacktool, joke, misleading, monitoringtool, program, pws, ransom, remoteaccess, riskware, rogue, rootkit, settingsmodifier, softwarebundler, spammer, spoofer, tool, trojan, clicker, downloader, dropper, notifier, proxy, spyware, backdoor, virtool, virus, worm;
- Family - all the remaining tokens which are not blacklisted.
Normalisation results for some binaries
The following plots are (if available) included for each binary:
- Platform - count-plot over all VT platform tokens;
- CVE - count-plot over all VT CVE tokens;
- Meta-type - count-plot over all VT meta-type tokens;
- Type - count-plot over all VT type tokens;
- Family - count-plot over all VT family tokens.