This week brought functions revolving around the complete feature set construction. These features are necessary for effective malware clustering and were extracted and preprocessed to be of high quality and useful for the project. To extract features from a free-form-string I used basic normalisation and n-gram extraction. For numerical features log-binning is used to suppress extremely high values.
To facilitate feature selection and address feature embedding feature names are prefixed with category of feature and its source.
Due to open-ended nature of feature construction only the “basic” preprocessing has been done. There is still plenty of possibilities to tweak the quality of features and some of those are indicated with TODO tag in the code’s comments.
Additionally, functions to extract selected subset of features, as well as functions to save variety of predefined and user defined dataset into a CSV file are provided.

Moreover, I implemented a function to discard insignificant (very sparse) features. Originally the dataset had 19187 which by cutting the features with less than 10% coverage was reduced to 954.

Possibility to extract the datasets also gave me a chance to play with variety of algorithms in Weka ML framework. These experiments gave a feeling about the data and allowed me to spot some of its interesting properties.

Results

Below you can find visualisation of complete features pruned dataset which looks promising. Complete features visualisation

Future work

For the next week or two I plan to add features counting some important properties of a binary analysis like number of importet static and dynamic libraries, number host names resolved, number of network connections, etc. These will be used to produce some sort of statistics (e.g. box plots and outliers list) that would be helpful in abnormal behaviour detection.

With regard to clustering the natural next step is to start testing variety of clustering algorithms and evaluating their performance. I hope to start this with DBscan and select on of the performance evaluation metrics described here.