What Is Cluster Analysis?

Cluster analysis refers to the analysis process of grouping a collection of physical or abstract objects into multiple classes composed of similar objects. It is an important human action.

Cluster analysis refers to the analysis process of grouping a collection of physical or abstract objects into multiple classes composed of similar objects. It is an important human action.
The goal of cluster analysis is to collect data for classification on a similar basis. Clustering originates from many fields, including mathematics, computer science, statistics, biology, and economics. In different application fields, many clustering technologies have been developed. These technical methods are used to describe data, measure similarities between different data sources, and classify data sources into different clusters.
Chinese name
Cluster analysis
Foreign name
Cluster analysis
Applied discipline
psychology

Cluster analysis differences

The difference between clustering and classification is that the class to be divided by clustering is unknown.
Clustering is a process of classifying data into different classes or clusters, so objects in the same cluster have great similarity, and objects in different clusters have great dissimilarity.
From a statistical point of view, cluster analysis is a way to simplify data through data modeling. Traditional statistical clustering analysis methods include systematic clustering, decomposition, joining, dynamic clustering, ordered sample clustering, overlapping clustering, and fuzzy clustering. Cluster analysis tools using k-means, k-center points and other algorithms have been added to many well-known statistical analysis software packages, such as SPSS, SAS, etc.
From a machine learning perspective, clusters are equivalent to hidden patterns. Clustering is an unsupervised learning process for searching clusters. Different from classification, unsupervised learning does not rely on pre-defined classes or training examples with class labels. The clustering learning algorithm needs to determine the labels automatically, while the classification learning instances or data objects have class labels. Clustering is observational learning, not example learning.
Cluster analysis is an exploratory analysis. In the process of classification, people do not have to give a classification standard in advance. Cluster analysis can start from sample data and automatically perform classification. Different methods used in cluster analysis often lead to different conclusions. The number of clusters obtained by different researchers on the same set of data may not be the same.
From the perspective of practical applications, cluster analysis is one of the main tasks of data mining. Moreover, clustering can be used as an independent tool to obtain the distribution of data, observe the characteristics of each cluster of data, and further analyze the cooperation of a particular cluster set. Cluster analysis can also be used as a preprocessing step for other algorithms, such as classification and qualitative induction algorithms.

Cluster analysis definition

The method of classifying research objects (samples or indicators) according to their characteristics, reducing the number of research objects.
Various types of things lack reliable historical data, and it is impossible to determine how many categories there are. The purpose is to group things of similar nature into one category.
There is a certain correlation between the indicators.
Cluster analysis is a group of statistical analysis techniques that divides research objects into relatively homogeneous clusters. Cluster analysis is different from classification analysis, which is supervised learning.
Variable types: categorical variables, quantitative (discrete and continuous) variables

Cluster analysis

1. Hierarchical Clustering
Merge method, decomposition method, tree diagram
Non-hierarchical clustering
Division clustering, spectral clustering
Clustering method features:
  • Cluster analysis is simple and intuitive.
  • Cluster analysis is mainly used in exploratory research. The results of its analysis can provide multiple possible solutions. Choosing the final solution requires the subjective judgment of the researcher and subsequent analysis;
  • Regardless of whether there are really different categories in the actual data, clustering analysis can be used to obtain solutions divided into several categories;
  • The solution of the cluster analysis completely depends on the clustering variables selected by the researcher. Adding or deleting some variables may have a substantial impact on the final solution.
  • Researchers should pay special attention to various factors that may affect the results when using cluster analysis.
  • Outliers and special variables have a greater impact on clustering. When the measurement scales of categorical variables are inconsistent, normalization is required in advance.
Of course, what cluster analysis cannot do is:
Automatically discover and tell you how many classes you should divide into-unsupervised analysis methods
It is unrealistic to expect to find roughly equal classes or market segments;
Sample clustering, the relationship between variables needs to be determined by the researcher;
Does not automatically give an optimal clustering result;
The clustering analysis I mentioned here is mainly hierarchical clustering, K-means, and two-step clustering.
A measure describing the degree of correspondence or closeness between two individuals (or between variables) based on clustering variables.
It can be measured in two ways:
1. Use indicators that describe the closeness between individual pairs (variable pairs), such as "distance". The smaller the "distance", the more similar the individual (variable).
2. Use indicators that indicate the degree of similarity, such as "correlation coefficient". Individuals (variables) with a larger "correlation coefficient" are more similar.
There are many methods to calculate clustering-distance index D (distance): according to different properties of the data, different distance indexes can be selected. Euclidean distance, Squared Euclidean distance, Manhattan distance, Chebychev distance, Chi-Square measure, etc .; there are many similarities , Mainly Pearson correlation coefficient!
  • The clustering variables have different measurement scales, and the variables need to be standardized beforehand;
  • If some variables in the clustering variables are very relevant, it means that the weight of this variable will be greater
  • Euclidean distance squared is the most commonly used distance measurement method;
  • The clustering algorithm has a greater impact on the clustering results than the distance measurement method;
  • The standardized method affects the clustering pattern:
  • Variable normalization tends to produce number-based clustering;
  • The tendency of sample normalization to generate pattern-based clustering;
  • Generally, the number of clusters is in 4-6 categories, which is not easy to be too many or too few; [1]

Cluster analysis statistics

Group center of gravity
Group center
Distance between groups

Cluster analysis hierarchical steps

Defining problems and choosing categorical variables
Clustering method
Determine the number of groups
Evaluation of clustering results
Description and interpretation of results

K-means Cluster analysis K-means

Non-hierarchical clustering method
(1) Implementation process
Initialization: Select (or specify) certain records as the aggregation point
cycle:
Aggregate the remaining records to the condensation point according to the nearest principle
Calculate the center position (mean) of each initial classification
Clustering again using the calculated center position
Repeat this cycle until the position of the condensation point converges
(2) Method characteristics
Number of known categories is usually required
Can specify the initial position manually
Save computing time
It is necessary to consider when the sample size is greater than 100
Use only continuous variables

Cluster analysis process

Features:
Target: categorical and continuous variables
Automatically determine the optimal number of categories
Fast processing of large data sets
Prerequisites:
Variables are independent of each other
Categorical variables follow a polynomial distribution, continuous variables follow a normal distribution
Model robust

Cluster analysis algorithm principle

Step 1: Scan the samples one by one. Each sample is classified into the previous class or a new class is generated based on its distance from the scanned sample.
The second step is to merge the types in the first step according to the distance between the classes, and stop the merger according to certain standards.
Discriminant Analysis
Introduction: Discriminant Analysis
Taxonomy is the basic science of human understanding of the world. Cluster analysis and discriminant analysis are the basic methods for classifying things, and they are widely used in various fields of natural science, social science, and industrial and agricultural production.
Discriminant Analysis DA
Overview
DA model
DA related statistics
Two groups of DA
case analysis
Discriminant analysis
Discriminant analysis is to find discriminant functions based on the values of variables that indicate the characteristics of things and the classes they belong to. An analysis method for classifying things of unknown category according to a discriminant function. The core is to examine the differences between categories.
Discriminant analysis
Different: The difference between discriminant analysis and cluster analysis is that discriminant analysis requires the value of a series of numerical variables reflecting the characteristics of things, and the classification of each individual.
DA applies to categorical variables (dependent), arbitrary variables (auto)
Two categories: a discriminant function;
Multiple groups: more than one discriminant function
DA purpose
Establish discriminant function
Check whether there are significant differences between different groups in terms of predictors
Decide which predictor contributes the most to differences between groups
Classify individuals based on predictors

Cluster analysis analysis model

You must first establish a discriminant function Y = a1x1 + a2x2 + ... anxn, where: Y is the discriminant score (discrimination value), x1 x2 ... xn are variables reflecting the characteristics of the research object, and a1 a2 ... an are coefficients

Cluster analysis related statistics

Canonical correlation coefficient
Eigenvalues
Wilk's (0, 1) = SSw / SSt for X
Group center of gravity
Classification matrix

Cluster analysis

Define the problem
Estimated DA function coefficients
Determine the significance of the DA function
Interpret the results
Assessing effectiveness
Define the problem
The first step in discriminant analysis
The second step is to divide the sample into:
Analysis sample
Validation sample
Estimate discriminant function coefficients
The direct method is to estimate the discriminant function by using all the predictors at the same time. At this time, each independent variable is included without considering its discriminative ability. This method is suitable for the case where previous research or theoretical models show which independent variables should be included.
In stepwise discriminant analysis, predictors are gradually introduced based on their ability to discriminate between groups.
Determine significance
Zero hypothesis: The mean of all discriminant functions in each group in the population is equal.
Eigenvalues
Canonical correlation coefficient
Wilk's (0, 1) converted to chi-square test
See travel.spo
Interpret the results
The sign of the coefficients is irrelevant, but it can represent the effect of each variable on the value of the discriminant function and its connection to a particular group.
We can initially judge the relative importance of variables by standardizing the absolute value of the coefficients of the discriminant function.
By examining the structural correlation coefficients, the relative importance of predictors can also be judged.
Group center of gravity
Assess the effectiveness of discriminant analysis
Based on the discriminant weights estimated from the analysis samples, and multiplied by the value of the predictor in the retained samples, the discriminant score for each sample in the retained samples is obtained.
Can be divided into different groups according to the discrimination points and appropriate rules.
The hit ratio, or sample correct classification probability, is the ratio of the sum of the diagonal elements of the classification matrix to the total number of samples.
Compare the percentage of samples correctly classified to the percentage of random correctly classified.

Cluster analysis factor analysis model

Factor Analysis Model (FA)
Basic idea
Factor analysis model
The basic idea of FA
"Factor Analysis" was proposed by Thurstone in 1931, and the concept originated from the statistical analysis of Pearson and Spearmen
FA uses a few factors to describe the relationship between multiple variables, and the highly relevant variables belong to the same factor;
FA uses latent variables or essential factors (basic characteristics) to explain observable variables
FA model
X1 = a11F1 + a12F2 + + a1pFp + v1
X2 = a21F1 + a22F2 + + a2pFp + v2 X = AF + V
Xi = ai1F1 + ai2F2 + + aipFp + vi
Xm = ap1F1 + ap2F2 + + ampFm + vm
Xi the i-th normalized variable
aip standard regression coefficient of the i-th variable on the p-th common factor
F common factor
Vi special factors
Common factor model
F1 = W11X1 + W12X2 + + W1mXm
F2 = W21X1 + W22X2 + + W2mXm
Fi = Wi1X1 + Wi2X2 + + WimXm
Fp = Wp1X1 + Wp2X2 + + WpmXm
Wi weight, factor score factor
Fi estimate of the i-th factor (factor score)
Related Statistics
Bartlett's spheroid test: variables are independent of each other
KMO value: FA suitability
Factor load: correlation coefficient
Factor load matrix
Common factor variance (common degree)
Eigenvalues
Variance percentage (variance contribution rate)
Cumulative variance contribution rate
Factor load graph
Lithotripsy
FA step
Define the problem
Check the applicability of the FA method
Determine factor analysis method
Factor rotation
Interpretation factor
Calculating factor scores
Precautions
Sample size cannot be too small
Variable correlation
Common factors have practical significance

Cluster analysis main applications

Cluster analysis business

Cluster analysis is used to find different customer groups, and to characterize different customer groups through purchase patterns.
Cluster analysis is an effective tool for market segmentation. It can also be used to study consumer behavior, find new potential markets, select experimental markets, and serve as a pre-processing for multivariate analysis.

Cluster analysis

Cluster analysis is used to classify plants and animals and to classify genes to gain insights into the inherent structure of the population

Cluster Analysis Geography

Clustering can help similarities observed by database quotients on Earth

Cluster analysis insurance industry

Cluster analysis identifies a group of car insurance policy holders by a high average consumption, and at the same time identifies a city's property grouping based on residential type, value, and geographic location

Cluster analysis internet

Cluster analysis is used to classify documents online to repair information

Cluster analysis e-commerce

Cluster analysis is also an important aspect in website construction data mining in e-commerce. By grouping customers with similar browsing behaviors and analyzing the common characteristics of customers, it can better help e-commerce users to understand their own Customers, to provide customers with more appropriate services.

Cluster analysis main steps

Data preprocessing,
2. Define a distance function to measure the similarity between data points,
3. Clustering or grouping,
4. Evaluate the output.
Data preprocessing includes selection of the number, type, and scale of features. It relies on feature selection and feature extraction. Feature selection selects important features. Feature extraction converts the input feature into a new significant feature. They are often used to obtain a Appropriate feature sets for clustering in order to avoid "dimensional disasters". Data preprocessing also includes moving outliers out of the data. Outliers are not dependent on general data behavior or model data, so outliers often lead to biases. Clustering results, so in order to get the correct clustering, we must remove them.
Since similarity is the basis for defining a class, the measurement of the similarity between different data in the same feature space is important for the clustering step. Due to the diversity of feature types and feature scales, distance measures must be cautious. It often depends on the application. For example, the dissimilarity of different objects is usually evaluated by defining a distance metric in the feature space. Many distances are applied in some different fields. A simple distance metric, such as Euclidean distance, is often used as Reflects the dissimilarity between different data. Some similarity measures, such as PMC and SMC, can be used to characterize the conceptual similarity of different data. In image clustering, the error correction of subgraph images can be used to measure The similarity of the two figures.
Dividing data objects into different classes is an important step. Data is classified into different classes based on different methods. The division method and hierarchical method are the two main methods of cluster analysis. The division method is generally divided from the initial division. And start by optimizing a clustering criterion. Crisp Clustering, each of its data belongs to a separate class; Fuzzy Clustering, each of its data may be in any of the classes, Crisp Clustering and Fuzzy Clusterin are the two main technologies of the division method, the division method clustering is based on a certain This criterion generates a nested series of divisions, which can measure the similarity between different classes or the separability of a class to merge and split classes. Other clustering methods include density-based clustering, model-based Clustering, grid-based clustering.
Assessing the quality of clustering results is another important stage. Clustering is an unmanaged program and there are no objective criteria for evaluating clustering results. It is evaluated by a class-effective index. Generally speaking, the geometric properties, Including separation between classes and coupling within classes, they are generally used to evaluate the quality of clustering results. The effective index of a class often plays an important role in determining the number of classes. The best value of an effective index of a class is expected to be from the real Obtained from the number of classes. A common method for determining the number of classes is to select the best value for a specific class effective index. Whether this index can truly determine the number of classes is a criterion for determining whether the index is valid. Many already exist. The standard can give good results for separate sets of class data, but for complex data sets, it usually does not work, for example, for sets of overlapping classes.

Cluster analysis algorithm

Cluster analysis is a very active research area in data mining, and many clustering algorithms have been proposed. Traditional clustering algorithms can be divided into five categories: partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods.
1 Partitioning method (PAM: PArtitioning method) First create k partitions, where k is the number of partitions to be created; then use a circular positioning technique to help improve the quality of the partition by moving objects from one partition to another. Typical division methods include:
k-means, k-medoids, CLARA (Clustering LARge Application),
CLARANS (Clustering Large Application based upon RANdomized Search).
FCM
2 Hierarchical method Creates a hierarchy to decompose a given data set. This method can be divided into two operation modes: top-down (decomposition) and bottom-up (combination). To make up for the deficiencies of decomposition and merger,
And often combined with other clustering methods, such as circular positioning. Typical such methods include:
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) method, which first uses the tree structure to divide the object set; then uses other clustering methods to optimize these clusters.
CURE (Clustering Using REprisentatives) method, which uses a fixed number of objects to represent the corresponding clusters; then shrinks each cluster by a specified amount (to the center of the cluster).
ROCK method, which uses the connection between clusters for clustering and merging.
CHEMALOEN method, which is to construct a dynamic model during hierarchical clustering.
3 Density-based method, clustering objects based on density. It continuously clusters based on the density around the object (such as DBSCAN). Typical density-based methods include:
DBSCAN (Densit-based Spatial Clustering of Application with Noise): This algorithm performs clustering by continuously growing regions of high enough density; it can find clusters of arbitrary shape from a spatial database containing noise. This method defines a cluster as a set of "density connected" points.
OPTICS (Ordering Points To Identify the Clustering Structure): Does not explicitly generate a cluster, but calculates an enhanced clustering order for automatic interactive cluster analysis. .
4 The grid-based method first divides the object space into a finite number of cells to form a grid structure; then uses the grid structure to complete the clustering.
STING (STatistical INformation Grid) is a method based on grid clustering using statistical information saved by grid cells.
CLIQUE (Clustering In QUEst) and Wave-Cluster are a combination of grid-based and density-based methods.
5 Model-based approach, which assumes a model for each cluster and finds data that fits the corresponding model. Typical model-based approaches include:
Statistical method COBWEB: is a commonly used and simple incremental conceptual clustering method. Its input objects are described using symbolic (attribute-value) pairs. A hierarchical tree is used to create a hierarchical cluster.
CLASSIT is another version of COBWEB. It can perform incremental clustering on continuous-valued attributes. It saves the corresponding continuous normal distribution (mean and variance) for each attribute in each node; and uses an improved classification ability description method, that is, instead of computing discrete attributes (values) like COBWEB Integrate continuous attributes. But the CLASSIT method also has similar problems with COBWEB. Therefore, they are not suitable for clustering large databases.
The traditional clustering algorithm has successfully solved the clustering problem of low-dimensional data. However, due to the complexity of data in practical applications, existing algorithms often fail when dealing with many problems, especially for high-dimensional data and large-scale data. This is because the traditional clustering method mainly encounters two problems when clustering in high-dimensional data sets. The existence of a large number of irrelevant attributes in high-dimensional data sets makes the possibility of clusters in all dimensions almost zero; the data in high-dimensional space is sparse in the lower-dimensional space, and the distance between data is almost equal However, traditional clustering methods are based on distance, so clusters cannot be constructed based on distance in high-dimensional space.
High-dimensional cluster analysis has become an important research direction of cluster analysis. At the same time, high-dimensional data clustering is also a difficult point of clustering technology. With the advancement of technology, data collection has become easier and easier, resulting in larger and larger databases, such as various types of trade transaction data, Web documents, gene expression data, etc., their dimensions (Attributes) can often reach hundreds or thousands of dimensions, or even higher. However, affected by the "dimensional effect", many clustering methods that perform well in low-dimensional data spaces often fail to obtain good clustering results when applied to high-dimensional spaces. Cluster analysis of high-dimensional data is a very active field in cluster analysis, and it is also a challenging work. High-dimensional data cluster analysis has a wide range of applications in market analysis, information security, finance, entertainment, and counter-terrorism.

IN OTHER LANGUAGES

Was this article helpful? Thanks for the feedback Thanks for the feedback

How can we help? How can we help?