Clustering evaluation
This time I did clustering with DBSCAN and HDBSCAN. These are fairly simple yet effective clustering algorithms suitable for any kind of data. Additionally, I’ve implemented a method evaluating the results of clustering with variety of metrics: Adjusted Mutual Information Score, Adjusted Random Index, Completeness, Homogeneity, Silhouette Coefficient, and V-measure. Finally, I did loads of experiments to pick te best parameters for those.
Choosing the best parameters settings
A snippet to produce results of clustering for variety of parameters settings.
import numpy as np
import pandas as pd
from cuckooml import Loader
from cuckooml import ML
loader = Loader()
loader.load_binaries("../../sample_data/dict")
ml = ML()
ml.load_simple_features(loader.get_simple_features())
ml.load_features(loader.get_features())
ml.load_labels(loader.get_labels())
filtered_features = ml.filter_dataset(ml.features)
results = {}
# full data
for eps in np.arange(0.1, 20.1, .1):
for ms in range(1, 21):
c = ml.cluster_dbscan(ml.features, eps, ms)
results["full_data+dbscan+eps="+str(eps)+"+ms="+str(ms)] = \
ml.assess_clustering(c, ml.labels, ml.features)
# filter noisy labels
c.columns=["cluster"]
clean = pd.concat([c, ml.labels], axis=1)
clean = clean[clean.cluster != -1]
cl = clean[["label"]]
cc = clean[["cluster"]]
cc.columns=["label"]
results["full_data+dbscan+noise+eps="+str(eps)+"+ms="+str(ms)] = \
ml.assess_clustering(cc, cl)
c = ml.cluster_dbscan(filtered_features, eps, ms)
results["filtered_data+dbscan+eps="+str(eps)+"+ms="+str(ms)] = \
ml.assess_clustering(c, ml.labels, filtered_features)
# filter noisy labels
c.columns=["cluster"]
clean = pd.concat([c, ml.labels], axis=1)
clean = clean[clean.cluster != -1]
cl = clean[["label"]]
cc = clean[["cluster"]]
cc.columns=["label"]
results["filtered_data+dbscan+noise+eps="+str(eps)+"+ms="+str(ms)] = \
ml.assess_clustering(cc, cl)
for ms in [None]+range(1, 21):
for mcs in range(2, 21):
c = ml.cluster_hdbscan(ml.features, ms, mcs)
results["full_data+hdbscan+ms="+str(ms)+"+mcs="+str(mcs)] = \
ml.assess_clustering(c, ml.labels, ml.features)
# filter noisy labels
c.columns=["cluster"]
clean = pd.concat([c, ml.labels], axis=1)
clean = clean[clean.cluster != -1]
cl = clean[["label"]]
cc = clean[["cluster"]]
cc.columns=["label"]
results["full_data+hdbscan+noise+ms="+str(ms)+"+mcs="+str(mcs)] = \
ml.assess_clustering(cc, cl)
c = ml.cluster_hdbscan(filtered_features, ms, mcs)
results["filtered_data+hdbscan+ms="+str(ms)+"+mcs="+str(mcs)] = \
ml.assess_clustering(c, ml.labels, filtered_features)
# filter noisy labels
c.columns=["cluster"]
clean = pd.concat([c, ml.labels], axis=1)
clean = clean[clean.cluster != -1]
cl = clean[["label"]]
cc = clean[["cluster"]]
cc.columns=["label"]
results["filtered_data+hdbscan+noise+ms="+str(ms)+"+mcs="+str(mcs)] = \
ml.assess_clustering(cc, cl)
results_df = pd.DataFrame(results).T
results_df.to_csv("clustering_results.csv")
then the following code was used to produce figures and results placed below:
def vd(data, labels, clusters, learning_rate=200,
fig_name="custom"):
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.manifold import TSNE
tsne = TSNE(learning_rate=learning_rate)
tsne_fit = tsne.fit_transform(data)
tsne_df = pd.DataFrame(tsne_fit, index=data.index, columns=['0', '1'])
tsne_dfl = pd.concat([tsne_df, labels], axis=1)
tsne_dfc = pd.concat([tsne_df, clusters], axis=1)
sns.lmplot("0", "1", data=tsne_dfl, fit_reg=False, hue="label",
scatter_kws={"marker":"D", "s":50}, legend_out=True)
plt.title(fig_name + " (lr:" + str(learning_rate) + ")")
plt.savefig(fig_name + "_gt_" + str(learning_rate) + ".png",
bbox_inches='tight', pad_inches=1.)
plt.close()
sns.lmplot("0", "1", data=tsne_dfc, fit_reg=False, hue="label",
scatter_kws={"marker":"D", "s":50}, legend_out=True)
plt.title(fig_name + " (lr:" + str(learning_rate) + ")")
plt.savefig(fig_name + "_cl_" + str(learning_rate) + ".png",
bbox_inches='tight', pad_inches=1.)
plt.close()
new_res = {}
for i in results:
if "noise" not in i:
new_res[i] = results[i]
nr = pd.DataFrame(new_res).T
# Then e.g.
nr[nr["Homogeneity"] > 0.1][["Homogeneity"]]
sorted(set(nr["Homogeneity"].tolist()))[-100:]
# And to plot
vd(filtered_features, ml.labels, ml.cluster_hdbscan(filtered_features, 1, 6), learning_rate=400)
Results
After all kind of analyses with min_samples=1.0
and min_cluster_size=6
HDBSCAN seems to do quite well on the dataset with full set of features. It does not outperform the rest in all of the cases but provides top-of-the-ranking performance with reasonable cluster number and cluster structure. Below is table with results for various dataset combination:
- full - the full set of features
- filtered - only the features that are denser than 10%
- extended - the original set of features
- simple - the simple set of features
- relabelled - labels with less than 10 instances removed from the ground truth
Clustering performance metrics
Data type | Adjusted Mutual Information | Adjusted Random Index | Completeness | Homogeneity | Silhouette Coefficient | V-measure |
---|---|---|---|---|---|---|
filtered_extended | 0.217 | 0.107 | 0.497 | 0.436 | 0.305 | 0.465 |
filtered_extended_relabelled | 0.142 | 0.079 | 0.208 | 0.326 | 0.305 | 0.254 |
filtered_simple | 0.240 | 0.136 | 0.518 | 0.445 | 0.331 | 0.478 |
filtered_simple_relabelled | 0.161 | 0.078 | 0.222 | 0.340 | 0.331 | 0.268 |
full_extended | 0.217 | 0.107 | 0.497 | 0.436 | 0.305 | 0.465 |
full_extended_relabelled | 0.142 | 0.079 | 0.208 | 0.326 | 0.305 | 0.254 |
full_simple | 0.240 | 0.136 | 0.518 | 0.445 | 0.331 | 0.478 |
full_simple_relabelled | 0.161 | 0.078 | 0.222 | 0.340 | 0.331 | 0.268 |
Clustering visualisations
Extended dataset - full feature set - ground truth
Extended dataset - full feature set - clustering results
Extended dataset - full feature set - relabelled - ground truth
Extended dataset - full feature set - relabelled - clustering results
Extended dataset - filtered full feature set - ground truth
Extended dataset - filtered full feature set - clustering results
Extended dataset - filtered full feature set - relabelled - ground truth
Extended dataset - filtered full feature set - relabelled - clustering results
Simple dataset - full feature set - ground truth
Simple dataset - full feature set - clustering results
Simple dataset - full feature set - relabelled - ground truth
Simple dataset - full feature set - relabelled - clustering results
Simple dataset - filtered feature set - ground truth
Simple dataset - filtered feature set - clustering results
Simple dataset - filtered feature set - relabelled - ground truth
Simple dataset - filtered feature set - relabelled - clustering results
Future work
For the near future I’d like to test this approach for more data and possibly try out some other clustering algorithms.
It seems like the next natural step is novelity and anomaly detection. It might be worth investigating whether it’s possible to somehow get the probabilities out of HDBSCAN to facilitate this.