5.1 k-Means Clustering

  1. Motivations
    • Classification vs. clustering
    • Here is an example.
  2. Clustering    
    • Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).
    • It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.
  3. k-Means Clustering
    • k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining.
    • k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
    • Algorithm
      generate k means, m1, ..., mk, using a random number generator;
      while the change is not insignificant
          Assignment step:
              for each cluster i,
                  Si = {xp | ∥xp - mi2 < ∥xp - mj2 ∀j, j ≠ i and 1 ≤ j ≤ k};
          Update step:
              for each cluster i,
                  mi = (1 / |Si|) Σj(xj ∈ Si);  // Note xj is a vector.
      
  4. Exercise



    • Exercise 1: Let's generate NO_POINTS=40 points in the above canvas, using draw().



    • Exercise 2: Let's decide 3 random means using a random generator. The x-value should be in [0, MAP_WIDTH), and y-value should be in [0, MAP_HEIGHT). This exercise should be done after Exercise 1.



    • Exercise 3: Assignment step. This exercise should be done after Exercise 2.


  5. How to determine k?
  6. References