What Is Correlation Clustering?

The process of dividing a collection of physical or abstract objects into multiple classes composed of similar objects is called clustering. A cluster generated by a cluster is a collection of data objects. These objects are similar to each other in the same cluster, and different from those in other clusters. "Things are grouped by classes and people are grouped by groups." In the natural and social sciences, there are a large number of classification problems. Cluster analysis, also known as cluster analysis, is a statistical analysis method for studying (sample or index) classification problems. Cluster analysis originates from taxonomy, but clustering is not equal to classification. The difference between clustering and classification is that the class to be divided by clustering is unknown. The cluster analysis content is very rich, including systematic clustering method, ordered sample clustering method, dynamic clustering method, fuzzy clustering method, graph theory clustering method, cluster forecasting method and so on.

The process of dividing a collection of physical or abstract objects into multiple classes composed of similar objects is called clustering. A cluster generated by a cluster is a collection of data objects. These objects are similar to each other in the same cluster, and different from those in other clusters. "Things are grouped by classes and people are grouped by groups." In the natural and social sciences, there are a large number of classification problems. Cluster analysis, also known as cluster analysis, is a statistical analysis method for studying (sample or index) classification problems. Cluster analysis originates from taxonomy, but clustering is not equal to classification. The difference between clustering and classification is that the class to be divided by clustering is unknown. The cluster analysis content is very rich, including systematic clustering method, ordered sample clustering method, dynamic clustering method, fuzzy clustering method, graph theory clustering method, cluster forecasting method and so on.
Clustering is also an important concept in data mining.
Chinese name
Clustering
Definition
Divides a collection of physical or abstract objects into
Objects differ
Birds of a feather flock together
Group analysis
Cluster analysis

Cluster typical applications

"What is the typical application of clustering?" In business, clustering can help market analysts find different customer groups from the customer's basic database, and use purchasing patterns to characterize the characteristics of different customer groups. In biology, clustering can be used to derive the classification of plants and animals, to classify genes, and to gain an understanding of the inherent structure of the population. Clustering can also play a role in determining similar areas in the Earth Observation Database, grouping car insurance policy holders, and grouping houses in a city based on the type, value, and geographic location of the house. Clustering can also be used to classify documents on the Web to discover information.

Typical requirements for clustering

Scalability: Many clustering algorithms work well on small data sets with less than 200 data objects; however, a large-scale database may contain millions of objects, and clustering on such a large data set sample may be possible Will lead to biased results. We need a highly scalable clustering algorithm.
Ability to handle different types of data: Many algorithms are designed to cluster numerical types of data. However, applications may require clustering other types of data, such as binary, categorical / nominal, ordinal, or a mixture of these data types.
Discover clusters of arbitrary shape: Many clustering algorithms determine clusters based on Euclidean or Manhattan distance measures. Algorithms based on such distance metrics tend to find spherical clusters with similar scales and densities. However, a cluster may be of arbitrary shape. It is important to propose an algorithm that can discover clusters of arbitrary shape.
Minimize domain knowledge used to determine input parameters: Many clustering algorithms require users to enter certain parameters in cluster analysis, such as the number of clusters they wish to generate. The clustering results are very sensitive to the input parameters. Parameters are often difficult to determine, especially for datasets containing high-dimensional objects. This not only increases the burden on users, but also makes it difficult to control the quality of clustering.
Ability to handle "noisy" data: Most real-world databases contain outliers, missing, or erroneous data. Some clustering algorithms are sensitive to such data and may lead to low quality clustering results.
Not sensitive to the order of the input records: Some clustering algorithms are sensitive to the order of the input data. For example, the same data set, when delivered to the same algorithm in a different order, may produce very different clustering results. It is of great significance to develop algorithms that are not sensitive to the order of data entry.
High dimensionality: A database or data warehouse may contain several dimensions or attributes. Many clustering algorithms are good at processing low-dimensional data, and may only involve two to three dimensions. Human eyes can judge the quality of clustering well in the most three-dimensional case. Clustering data objects in high-dimensional space is very challenging, especially considering that such data may be very sparsely distributed and highly skewed.
Constraint-based clustering: Real-world applications may require clustering under various constraints. Suppose your job is to choose placements for a given number of ATMs in a city. To make a decision, you can cluster residential areas while considering, for example, rivers and highway networks in the city, and customer requirements in each region And so on. It is a challenging task to find a data group that meets specific constraints and has good clustering characteristics.
Interpretability and usability: Users want clustering results to be interpretable, understandable, and usable. That is, clustering may need to be associated with specific semantic interpretations and applications. How the application target affects the choice of clustering method is also an important research topic.

Cluster calculation method

The traditional cluster analysis calculation methods are mainly as follows:
1.Partitioning methods
Given a data set with N tuples or records, the splitting method will construct K groups, each of which represents a cluster, K <N. And these K groups meet the following conditions: (1) each group contains at least one data record; (2) each data record belongs to and belongs to only one group (note: this requirement can be relaxed in some fuzzy clustering algorithms) ; For a given K, the algorithm first gives an initial grouping method, and then changes the grouping through repeated iterations, so that each improved grouping scheme is better than the previous one. The closer the records are, the better, and the farther the records in different groups, the better. Algorithms using this basic idea are: K-MEANS algorithm, K-MEDOIDS algorithm, CLARANS algorithm;
Most division methods are based on distance. Given the number of partitions k to be built, the partitioning method first creates an initial partition. It then uses an iterative relocation technique to divide objects by moving objects from one group to another. The general preparation for a good division is: objects in the same cluster are as close or related to each other as possible, while objects in different clusters are as far away or different as possible. There are many other criteria for judging quality. Traditional partitioning methods can be extended to subspace clustering instead of searching the entire data space. This is useful when there are many attributes and the data is sparse. In order to achieve global optimum, partition-based clustering may need to exhaust all possible partitions, which is extremely computationally intensive. In fact, most applications use popular heuristics, such as k-means and k-center algorithms, to gradually improve the quality of clustering and approximate local optimal solutions. These heuristic clustering methods are very suitable for finding spherical clusters in small and medium databases. In order to find clusters with complex shapes and cluster very large data sets, it is necessary to further expand the partition-based method. [1]
2.Hierarchical methods
This method performs hierarchical decomposition on a given data set until certain conditions are met. Specifically, it can be divided into "bottom-up" and "top-down" schemes. For example, in the "bottom-up" scheme, each data record initially forms a separate group. In the next iteration, it combines those groups that are adjacent to each other until all records form a group or Until a certain condition is met. Representative algorithms are: BIRCH algorithm, CURE algorithm, CHAMELEON algorithm, etc .;
Hierarchical clustering methods can be distance-based or density- or connectivity-based. Some extensions of the hierarchical clustering method also consider subspace clustering. The disadvantage of the hierarchical approach is that once a step (merging or splitting) is completed, it cannot be undone. This strict rule is useful because there is no need to worry about the number of combinations of different choices, it will cause less computational overhead. However, this technique cannot correct wrong decisions. Several methods have been proposed to improve the quality of hierarchical clustering. [1]
3.Density-based methods
A fundamental difference between the density-based method and other methods is that it is not based on a variety of distances, but on density. This can overcome the disadvantage that distance-based algorithms can only find "round-like" clusters. The guiding idea of this method is that as long as the density of points in a region is greater than a certain threshold, it is added to the clusters that are close to it. Representative algorithms are: DBSCAN algorithm, OPTICS algorithm, DENCLUE algorithm, etc .;
4.Grid-based methods
This method first divides the data space into a grid structure of a limited number of cells. All processing is based on a single cell. An outstanding advantage of this processing is that the processing speed is very fast, which is usually independent of the number of records in the target database, it is only related to how many units the data space is divided into. Representative algorithms are: STING algorithm, CLIQUE algorithm, WAVE-CLUSTER algorithm;
For many spatial data mining problems, using a grid is usually an effective method. Therefore, grid-based methods can be integrated with other clustering methods. [1]
5.Model-based methods
The model-based method assumes a model for each cluster, and then searches for a data set that satisfies this model well. Such a model may be a density distribution function of data points in space or other. A potential assumption is that the target data set is determined by a series of probability distributions. There are usually two attempts: statistical schemes and neural network schemes.
Of course, there are clustering methods: transitive closure method, Boolean matrix method, direct clustering method, correlation analysis clustering, and statistical-based clustering methods.

Cluster research

Traditional clustering has successfully solved the clustering problem of low-dimensional data. However, due to the complexity of data in practical applications, existing algorithms often fail when dealing with many problems, especially for high-dimensional data and large-scale data. This is because the traditional clustering method mainly encounters two problems when clustering in high-dimensional data sets. The existence of a large number of irrelevant attributes in high-dimensional data sets makes the possibility of clusters in all dimensions almost zero; the data in high-dimensional space is sparse in the lower-dimensional space, and the distance between data is almost equal However, traditional clustering methods are based on distance, so clusters cannot be constructed based on distance in high-dimensional space.
High-dimensional cluster analysis has become an important research direction of cluster analysis. At the same time, high-dimensional data clustering is also a difficult point of clustering technology. With the advancement of technology, data collection has become easier and easier, resulting in larger and larger databases, such as various types of trade transaction data, Web documents, gene expression data, etc., their dimensions (Attributes) can often reach hundreds or thousands of dimensions, or even higher. However, affected by the "dimensional effect", many clustering methods that perform well in low-dimensional data spaces often fail to obtain good clustering results when applied to high-dimensional spaces. Cluster analysis of high-dimensional data is a very active field in cluster analysis, and it is also a challenging work. High-dimensional data cluster analysis has a wide range of applications in market analysis, information security, finance, entertainment, and counter-terrorism.

IN OTHER LANGUAGES

Was this article helpful? Thanks for the feedback Thanks for the feedback

How can we help? How can we help?