This time I did clustering with DBSCAN and HDBSCAN. These are fairly simple yet effective clustering algorithms suitable for any kind of data. Additionally, I’ve implemented a method evaluating the results of clustering with variety of metrics: Adjusted Mutual Information Score, Adjusted Random Index, Completeness, Homogeneity, Silhouette Coefficient, and V-measure. Finally, I did loads of experiments to pick te best parameters for those.
Choosing the best parameters settings
A snippet to produce results of clustering for variety of parameters settings.
then the following code was used to produce figures and results placed below:
Results
After all kind of analyses with min_samples=1.0 and min_cluster_size=6HDBSCAN seems to do quite well on the dataset with full set of features. It does not outperform the rest in all of the cases but provides top-of-the-ranking performance with reasonable cluster number and cluster structure. Below is table with results for various dataset combination:
full - the full set of features
filtered - only the features that are denser than 10%
extended - the original set of features
simple - the simple set of features
relabelled - labels with less than 10 instances removed from the ground truth
Clustering performance metrics
Data type
Adjusted Mutual Information
Adjusted Random Index
Completeness
Homogeneity
Silhouette Coefficient
V-measure
filtered_extended
0.217
0.107
0.497
0.436
0.305
0.465
filtered_extended_relabelled
0.142
0.079
0.208
0.326
0.305
0.254
filtered_simple
0.240
0.136
0.518
0.445
0.331
0.478
filtered_simple_relabelled
0.161
0.078
0.222
0.340
0.331
0.268
full_extended
0.217
0.107
0.497
0.436
0.305
0.465
full_extended_relabelled
0.142
0.079
0.208
0.326
0.305
0.254
full_simple
0.240
0.136
0.518
0.445
0.331
0.478
full_simple_relabelled
0.161
0.078
0.222
0.340
0.331
0.268
Clustering visualisations
Extended dataset - full feature set - ground truth
Extended dataset - full feature set - clustering results
Extended dataset - full feature set - relabelled - ground truth
Extended dataset - full feature set - relabelled - clustering results
Extended dataset - filtered full feature set - ground truth
Extended dataset - filtered full feature set - clustering results
Extended dataset - filtered full feature set - relabelled - ground truth
Extended dataset - filtered full feature set - relabelled - clustering results
Simple dataset - full feature set - ground truth
Simple dataset - full feature set - clustering results
Simple dataset - full feature set - relabelled - ground truth
Simple dataset - full feature set - relabelled - clustering results
Simple dataset - filtered feature set - ground truth
Simple dataset - filtered feature set - clustering results
Simple dataset - filtered feature set - relabelled - ground truth
For the near future I’d like to test this approach for more data and possibly try out some other clustering algorithms.
It seems like the next natural step is novelity and anomaly detection. It might be worth investigating whether it’s possible to somehow get the probabilities out of HDBSCAN to facilitate this.