k-Means Clustering

Motivations
- Classification vs. clustering
- Here is an example.
Clustering
- Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).
- It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.

k-Means Clustering

k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining.
k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

Algorithm

generate k means, m₁, ..., m_k, using a random number generator;
while the change is not insignificant
    Assignment step:
        for each cluster i,
            S_i = {x_p | ∥x_p - m_i∥² < ∥x_p - m_j∥² ∀j, j ≠ i and 1 ≤ j ≤ k};
    Update step:
        for each cluster i,
            m_i = (1 / |S_i|) Σ_j(x_j ∈ S_i);  // Note x_j is a vector.

Exercise
- Exercise 1: Let's generate NO_POINTS=40 points in the above canvas, using draw().
  <script> NO_POINTS = 40; draw(); </script>
- Exercise 2: Let's decide 3 random means using a random generator. The x-value should be in [0, MAP_WIDTH), and y-value should be in [0, MAP_HEIGHT). This exercise should be done after Exercise 1.
  <script> function initialize_means() { var means = []; means[0] = {x: Math.random() * MAP_WIDTH, y: Math.random() * MAP_HEIGHT}; means[1] = {x: Math.random() * MAP_WIDTH, y: Math.random() * MAP_HEIGHT}; means[2] = {x: Math.random() * MAP_WIDTH, y: Math.random() * MAP_HEIGHT}; return means; } means = initialize_means(); draw_means(); </script>
- Exercise 3: Assignment step. This exercise should be done after Exercise 2.
  <script> function distance(point1, point2) { return Math.sqrt(Math.pow(point1.x - point2.x, 2) + Math.pow(point1.y - point2.y, 2)); } function assignment_step() { for (var i = 0; i < 3; i++) { clusters[i] = []; for (var j = 0; j < points.length; j++) { var k; for (k = 0; k < 3; k++) { if (k == i) continue; if (Math.pow(distance(points[j], means[i]), 2) >= Math.pow(distance(points[j], means[k]), 2)) break; } if (k == means.length) clusters[i].push(points[j]); } } } function update_step() { for (var i = 0; i < 3; i++) { var sum_x = 0; var sum_y = 0; for (var j = 0; j < clusters[i].length; j++) { sum_x += clusters[i][j].x; sum_y += clusters[i][j].y; } means[i].x = sum_x / clusters[i].length; means[i].y = sum_y / clusters[i].length; } } assignment_step(); update_step(); draw_clusters(); draw_means(); </script>
How to determine k?
- How to determine the number of clusters?
- Read "Choosing K" from Introduction to K-means Clustering
References
- Introduction to K-means Clustering

5.1 k-Means Clustering