K-Means Clustering in C#

K-Means Clustering in C#

In many real-world problems, we do not have labeled data. Instead of knowing the correct output for each input, we only have raw data and need to discover hidden patterns inside it. This type of problem is handled by unsupervised learning algorithms, and one of the most popular methods in this category is K-Means Clustering.

K-Means is a simple yet powerful algorithm that groups similar data points into clusters. The goal is to divide a dataset into K distinct groups, where each data point belongs to the cluster with the nearest center (called a centroid).

What is K-Means Clustering?

K-Means Clustering is an iterative algorithm that partitions data into K clusters based on distance similarity. Each cluster is represented by a centroid, which is the average position of all points in that cluster.

Objective: Minimize the distance between data points and their assigned cluster centroids

The algorithm repeatedly updates cluster assignments and recalculates centroids until the system converges, meaning that cluster assignments no longer change significantly.

K-Means is sensitive to the initial placement of centroids, which can affect the final clustering result.

Where is K-Means Used?

K-Means is widely used in applications where grouping similar items is more important than predicting exact values. It is one of the most common clustering algorithms in data science and machine learning.

  • Customer segmentation in marketing
  • Image compression and color quantization
  • Anomaly detection in systems and networks
  • Document and text clustering
  • Recommendation systems
  • Geographic data clustering

How K-Means Works

The algorithm follows a simple iterative process:

  • Step 1: Choose K (number of clusters)
  • Step 2: Initialize K centroids randomly
  • Step 3: Assign each point to the nearest centroid
  • Step 4: Recalculate centroids based on assigned points
  • Step 5: Repeat until convergence

The key idea is to continuously refine cluster centers so that each centroid represents the average of its assigned points.

Distance Calculation

K-Means typically uses Euclidean distance to measure similarity between points.

distance = √((x2 - x1)² + (y2 - y1)²)

Points closer to a centroid are more likely to belong to that cluster.

Implementing K-Means in C#

The following example shows a simplified K-Means implementation in C#. It supports 2D data points for clarity and educational purposes.

using System;
using System.Collections.Generic;
using System.Linq;

public class KMeans
{
    public int K { get; set; }
    public List<double[]> Centroids { get; set; }

    public KMeans(int k)
    {
        K = k;
        Centroids = new List<double[]>();
    }

    public void Fit(List<double[]> data, int iterations)
    {
        Random rand = new Random();

        // Initialize centroids randomly
        Centroids = data.OrderBy(x => rand.Next()).Take(K).ToList();

        for (int iter = 0; iter < iterations; iter++)
        {
            List<List<double[]>> clusters = new List<List<double[]>>();

            for (int i = 0; i < K; i++)
                clusters.Add(new List<double[]>());

            // Assign points to nearest centroid
            foreach (var point in data)
            {
                int bestCluster = 0;
                double bestDistance = double.MaxValue;

                for (int c = 0; c < K; c++)
                {
                    double dist = Distance(point, Centroids[c]);

                    if (dist < bestDistance)
                    {
                        bestDistance = dist;
                        bestCluster = c;
                    }
                }

                clusters[bestCluster].Add(point);
            }

            // Recalculate centroids
            for (int c = 0; c < K; c++)
            {
                if (clusters[c].Count == 0) continue;

                double[] newCentroid = new double[2];

                foreach (var p in clusters[c])
                {
                    newCentroid[0] += p[0];
                    newCentroid[1] += p[1];
                }

                newCentroid[0] /= clusters[c].Count;
                newCentroid[1] /= clusters[c].Count;

                Centroids[c] = newCentroid;
            }
        }
    }

    private double Distance(double[] a, double[] b)
    {
        double dx = a[0] - b[0];
        double dy = a[1] - b[1];
        return Math.Sqrt(dx * dx + dy * dy);
    }
}

Libraries in .NET Ecosystem

While K-Means can be implemented manually, production systems often rely on optimized libraries.

  • ML.NET
  • Accord.NET
  • MathNet.Numerics
  • SharpLearning
  • TensorFlow.NET

K-Means vs Other Approaches

Feature K-Means Hierarchical Clustering
Complexity Low Higher
Scalability High Lower
Interpretability Moderate High
Cluster Shape Spherical Flexible
Performance Fast Slower

Summary

K-Means Clustering is one of the simplest and most widely used unsupervised learning algorithms. It groups data into meaningful clusters based on similarity, making it extremely useful for exploratory data analysis.

Although it is computationally efficient and easy to implement, it has limitations such as sensitivity to initial centroids and difficulty handling non-spherical clusters. Despite this, it remains a foundational algorithm in machine learning and data science.

Conclusion

K-Means provides a practical introduction to unsupervised learning and clustering concepts. In C#, it can be implemented with relatively simple code, but in production systems, optimized ML libraries are preferred. Understanding K-Means helps developers build intuition for more advanced clustering techniques and machine learning pipelines.

Contents related to 'K-Means Clustering in C#'

Exponential Moving Average (EMA) in C#
Exponential Moving Average (EMA) in C#