Introduction#

What ?#

Clustering is a statistical analysis, an unsupervised classification method brings together a set of learning algorithms whose purpose is to group together unlabelled data with similar properties. Within each cluster, data are grouped according to a common characteristic. Clustering is an algorithm that measures the proximity between each element based on defined criteria.

When ?#

Clustering is used in particular when it is expensive to label the data. There are different types of clustering methods and these must be chosen carefully depending on the expected outcome and the intended use of the data.

How ?#

For each method, it is necessary to choose how to measure the similarity between two individuals - which can be imagined as two points in the p-dimensional real space. We therefore need a distance function, such as the Euclidean distance. The n individuals are “points” in the space of variables R in the p-dimension.