What Are Data Mining Algorithms?
Data mining algorithms are a set of heuristics and calculations that create a data mining model based on the data. To create the model, the algorithm will first analyze the data you provide and look for specific types of patterns and trends. [1]
Data mining algorithm
- The algorithm uses the results of this analysis to define the optimal parameters for creating a mining model. These parameters are then applied throughout
- 1: C4.5
- C4.5 is a decision tree algorithm. It is a decision tree (a decision tree is a tree-like organization between decision-making nodes, which is actually an inverted tree). It is an improved algorithm of the core algorithm ID3, so I basically understand Half of the decision tree construction methods can construct it. The decision tree construction method is actually to select a good feature and split point each time as the classification condition of the current node. When C4.5 is improved over ID3:
- The ID3 selection attribute uses the information gain of the subtree (there are many ways to define the information here, ID3 uses entropy (entropy is an impurity measure)), which is the change in entropy, and C4 .5 uses the information gain ratio. That's an extra rate. In general, the rate is used to balance, just like the variance plays a similar role. For example, there are two running people, one with a starting point of 100m / s, followed by 1s with 110m / s; the other has a starting speed It is 1 m / s, and 11 m / s after 1 s. If only the acceleration (speed increase per unit time) is counted, then the two are the same; but if the speed increase rate (speed increase ratio) is used to measure, the gap between two people is very large. Here, it overcomes the disadvantage of biasing the selection of attributes with more values when selecting attributes with information gain. When pruning in the tree construction process, I hate nodes with several elements when constructing a decision tree. For this kind of node, it is best not to consider it, otherwise it will easily lead to overfitting. You can process non-discrete data. This is actually one by one, to see where the continuous value is split. That is, continuous data is converted into discrete values for processing. It is important and important to be able to process incomplete data, but it is not so important. The missing data can be supplemented by some methods. [1]
- 2: CART
- CART is also a decision tree algorithm! Compared to the above implementation of conditional multivariate classification with multiple subtrees under one node, CART only classifies two subtrees, which is slightly easier to implement. So the decision tree generated by the CART algorithm is a simple binary tree.
- 3: KNN (K Nearest Neighbours)
- This is very simple. It depends on which category of the K individuals (samples) around you account for the most, and which category, then I have the most. The realization is to calculate the similarity for each training sample. It is the top-K training samples. It depends on which of the K samples there are more categories, and who talks to whom more.
- 4: Naive Bayes
- (Naive Bayes NB)
- NB believes that each feature is independent and no one cares about it. So for a sample (a set of feature values, such as "data structure" appears twice, and "file" appears once), the probability of all appearance features in a given category can be multiplied. For example, the probability of "data structure" appearing in class 1 is 0.5, and the probability of "file" appearing in class 1 is 0.3, then the probability that it belongs to class 1 is 0.5 * 0.5 * 0.3.
- 5: Support Vector Machine
- (Support Vector Machine SVM)
- SVM is to find a classification line / classification surface with the best classification. This has not been achieved in detail. The teacher claimed that he had implemented SVM last time and admired his research spirit. Common toolkits are LibSVM, SVMLight, MySVM.
- 6: EM (Expectation Maximization)
- I think this is based on the assumption that the data consists of several Gaussian distributions, so the last step is to require several Gaussian distribution parameters. By assuming a few values first, and then iterating iteratively, the best fit is expected.
- 7: Apriori
- This is used for association rules. I don't know why, as soon as I improve the association rules, I think of shopping basket data. This has not been implemented, but it must be understood that it works through two quantities of support and confidence, but for Apriori, it uses some rules of frequent itemsets (subsets of frequent itemsets must be frequent itemsets, etc. Etc.) to reduce computational complexity. [1]
- 8: FP-Tree
- ( Mining frequent patterns without candidate generation )
- This is not very clear. The FP-growth algorithm (Frequent Pattern-growth) uses a compact data structure to store all the information needed to find frequent itemsets. Adopt algorithm: compress the database that provides frequent itemsets into a FP-tree to retain the itemsets association information, and then divide the compressed database into a set of condition databases (a special type of projection database), and each condition database is associated A frequent itemset.
- 9: PageRank
- Everyone should know the famous PageRank (Google can rely on this patent to make a fortune, in fact, it can't be said that it is fortune!). My understanding of this algorithm is: If I point to you (the connection between web pages), it means that I recognize you, you can add a part of my importance when calculating your importance (how much depends on my own How many people I admit). By repeating this, you can find a stable value that measures the importance of each person (web page). However, some restrictions must be made here (the default importance of a person at the beginning is 1), otherwise those values will become larger and larger.
- 10: HITS
- HITS is also a connection analysis algorithm, which was first proposed by IBM. In HITS, each node (web page) has an importance and authority (Hubs and authorities, I also forgot what the specific translation is). The importance is obtained by repeatedly passing the authority, and the authority and importance are obtained by obtaining the authority through the importance.
- 11: K-Means
- K-Means is one of the most classic and widely used clustering methods. Today, there are many improved models based on it. The idea of K-Means is very simple. For a clustering task (you need to indicate the clustering into several classes, of course, according to natural ideas, you should not need to specify the number of classes. This problem is also a topic worthy of research in current clustering tasks) First, randomly select K cluster centers, and then repeatedly calculate the following process until all cluster centers do not change (cluster set does not change): Step 1: For each object, calculate the similarity between it and each cluster center. It falls into the cluster most similar to it.
- Step 2: Update the cluster center. The new cluster center is obtained by calculating the average of all objects belonging to the cluster.
- The working process of the k-means algorithm is described as follows: first, arbitrarily select k objects from n data objects as the initial clustering centers; for the remaining other objects, according to their similarity (distance) to these clustering centers, Assign them to the clusters that are most similar to them (represented by the cluster centers); then calculate the cluster center (mean of all objects in the cluster) of each new cluster obtained; repeat this process continuously Until the standard measure function starts to converge. Generally, the mean square error is used as the standard measure function. The k clusters have the following characteristics: each cluster itself is as compact as possible, and each cluster is separated as much as possible.
- 12: BIRCH
- BIRCH is also a clustering algorithm, its full name is Balanced Iterative Reducing and Clustering using Hierarchies. BIRCH just saw that the theory has not been implemented. It is a comprehensive concept of hierarchical clustering feature (CF) and clustering feature tree (CF Tree), which is used to summarize the clustering description. The clustering feature tree summarizes the useful information of clustering, and takes up much less space than the metadata set, and can be stored in memory, which can improve the algorithm's clustering speed and scalability on large data sets.
- The BIRCH algorithm includes the following two phases:
- 1) Scan the database and build a dynamic CF Tree stored in memory. If there is not enough memory, increase the threshold and construct a smaller tree based on the original tree.
- 2) A global clustering algorithm is further applied to the leaf nodes to improve the clustering quality.
- Because the clusters represented by the leaf nodes of the CF Tree may not be the natural clustering results, the reason is that a given threshold limits the size of the clusters, and the input order of the data will also affect the clustering results. Therefore, it is necessary to further use a global clustering algorithm for leaf nodes to improve the clustering quality.
- 13: AdaBoost
- AdaBoost generally knows that it is a boosting method. This cannot be said to be an algorithm, it should be a method, because it can be built on any kind of classification algorithm, it can be a decision tree, NB, SVM, etc.
- Adaboost is an iterative algorithm. The core idea is to train different classifiers (weak classifiers) for the same training set, and then combine these weak classifiers to form a stronger final classifier (strong classifier). The algorithm itself is implemented by changing the data distribution. It determines the weight of each sample according to whether the classification of each sample in each training set is correct and the accuracy of the previous overall classification. The new data set with modified weights is sent to the lower-level classifier for training. Finally, the classifiers obtained from each training are finally fused and used as the final decision classifier. Using the adaboost classifier can eliminate some unnecessary training data and put the key on the key training data.
- 14: GSP
- GSP, called Generalized Sequential Pattern, is a sequence mining algorithm. I haven't looked at sequence mining carefully, it should be based on association rules! This is what it says online:
- GSP is similar to the Apriori algorithm. It uses a redundant candidate pattern pruning strategy and a special data structure ----- hash tree to achieve fast access to the candidate pattern.
- GSP algorithm description:
- 1) Scan the sequence database to obtain a sequence pattern L1 of length 1 as the initial seed set.
- 2) According to the seed set Li of length i, a candidate sequence pattern Ci + 1 with a length of i + 1 is generated through a join operation and a pruning operation; then the sequence database is scanned to calculate the support degree of each candidate sequence pattern to generate a length i +1 sequence pattern Li + 1, with Li + 1 as the new seed set.
- 3) Repeat the second step until no new sequence pattern or new candidate sequence pattern is generated.
- There are two main steps in generating candidate sequence patterns:
- Connection phase: If the first item of sequence pattern s1 is removed and the sequence of the last item of sequence pattern s2 is removed, s1 and s2 can be connected, that is, the last item of s2 is added to s1.
- Trimming stage: If a subsequence of a candidate sequence pattern is not a sequence pattern, the candidate sequence pattern cannot be a sequence pattern, and it is deleted from the candidate sequence pattern.
- Candidate sequence pattern support calculation: For a given candidate sequence pattern set C, scan the sequence database, and for each sequence s, find all candidate sequence patterns contained in set C and increase its support count.
- 15: PrefixSpan
- Another sequence mining similar to Apriori.
- Among the top ten classic algorithms are: C4.5, K-Means, SVM, Apriori, EM, PageRank, AdaBoost, KNN, NB, and CART. [1]