Saturday, April 6, 2013

Finding the optimal K in kmean: a incremental kmeans in python

I was looking for an good implementation of an incremental k-means where I don't have to set the optimal K. There are interesting papers (x-means, gmeans etc.) but couldn't find any python implementation.

I have decided to write a incremental version on top of sklearn.
The idea is simple:
  1. Start at K=x
  2. identify worst cluster based on an unsupervised measure (ex: silhouette)
  3. Split the worst cluster into 2 clusters
  4. measure the global improvement with the new clusters
  5. if you get an improvement continue adding clusters
You can find the source code in mlboost/clustering/ikmeans.py
A special thanks to scikit-learn lib to let me prototype this version so fast. 

No comments:

Post a Comment