How Do I Choose the Best Website Evaluation Rubric?
Feature selection (Feature Selection) is also referred to as Feature Subset Selection (FSS), or attribute selection (Attribute Selection). It refers to the process of selecting N features from the existing M features to optimize the specific indicators of the system. It is the process of selecting some of the most effective features from the original features to reduce the dimension of the dataset. It is to improve the performance of the learning algorithm. An important method is also the key data preprocessing step in pattern recognition. For a learning algorithm, a good learning sample is the key to training the model. [1]
- In general, feature selection can be viewed as a search optimization problem. For a feature set of size n, the search space consists of 2n-1 possible states. Davies et al. Proved that the search of the smallest feature subset is an NP problem, that is, it cannot guarantee to find the optimal solution except for the exhaustive search. However, in practical applications, when the number of features is large, the exhaustive search cannot be applied because the calculation amount is too large, so people are committed to using the heuristic search algorithm to find the suboptimal solution. The general feature selection algorithm must determine the following four elements: 1) the starting point and direction of the search; 2) the search strategy; 3) the feature evaluation function; and 4) the stopping criterion. [2]
- So far, many scholars have defined feature selection from different perspectives: Kira et al. Define that ideally feature selection is to find the smallest subset of features necessary to identify the target; John et al. Define it from the perspective of improving prediction accuracy Feature selection is a process that can increase the accuracy of classification, or reduce the dimension of features without reducing the accuracy of classification; Koller et al. Define features from the perspective of distribution
- (1) Generation Procedure
- The generation process is a process of searching for a feature subset, and is responsible for providing a feature subset for the evaluation function.
- (2) Evaluation Function
- The evaluation function is a criterion for evaluating the quality of a feature subset.
- (3) Stopping Criterion
- The stopping criterion is related to the evaluation function, and is generally a threshold. When the evaluation function value reaches this threshold, the search can be stopped.
- (4) Validation Procedure
- The validity of the selected feature subset is verified on the verification data set.
- The basic search strategy can be divided into the following three types according to the formation process of feature subsets: global optimal, random search, and heuristic search. A specific search algorithm will use two or more basic search strategies. For example, genetic algorithm is a This kind of random search algorithm is also a heuristic search algorithm. The three basic search strategies are analyzed and compared below.
- 1. Feature selection method using global optimal search strategy
- So far, the only search method that obtains the best results is the branch and bound method. This algorithm can ensure that the number of features in the optimized feature subset is determined in advance, and the most relative to the designed separability criterion is found. Excellent subset. Its search space is O (2 N ) (where N is the dimension of the feature). Existing problems: It is difficult to determine the number of optimal feature subsets; the separability criterion that satisfies monotonicity is difficult to design; When dealing with high-dimensional multi-class problems, the time complexity of the algorithm is high. Therefore, although the global optimal search strategy can obtain the optimal solution, it cannot be widely applied due to many factors. [3]
- The feature selection method can be divided into two types, Filter and Wrapper, depending on whether it is independent of the subsequent learning algorithm. Filter has nothing to do with the subsequent learning algorithm. Generally, the statistical performance of all training data is used to evaluate features directly, and the speed is fast. However, the performance deviation between the evaluation and the subsequent learning algorithm is large. Wrapper uses the training accuracy of the subsequent learning algorithm to evaluate the feature subset. The deviation is small, the amount of calculation is large, and it is not suitable for large data sets. The filter and wrapper methods are analyzed below. [3]
- Since the 1990s, feature selection has been widely studied and applied to the fields of Web document processing (text classification, text retrieval, text recovery, etc.), genetic analysis, and drug diagnosis. The current society is a society of information explosion. More and more diverse forms of data appear in front of us, such as genetic data and data streams. How to design better feature selection algorithms to meet the needs of society is a long-term task. The research on feature selection algorithms will remain one of the hot topics in machine learning and other fields in the future. The current research hotspots and trends are mainly focused on the following areas:
- 1) Combination study of feature and sample selection. Different sample collection regions may choose different feature selection algorithms. Many data have the feature of natural segmentation.For example, in the semi-supervised classification problem of web pages, the feature set describing the web page (generally the vocabulary set) can usually be divided into two independent subsets of the following features: words appearing in the text content Sets and vocabulary sets that appear in hyperlinks on web pages. For these two feature subsets, can we use different feature selection methods to reduce the dimensionality and achieve better learning results? [4]
- 2) Recently, features and their relevance to targets (classification, regression, clustering, etc.) have received more and more attention, and this problem can be called a total correlation problem. For example, in gene expression analysis, find out all the features that are related to the target variable. These features may cause the biological state to be healthy or diseased. Currently, the Ranking method is commonly used, but the Ranking method often considers features and labels. The correlation between features is not considered. How to take the correlation between features into consideration? This is also one of the focuses and difficulties of current research. [4]