What Is Software Regression?
In statistics, regression analysis refers to a statistical analysis method that determines the quantitative relationship between two or more variables. Regression analysis is divided into univariate regression and multiple regression analysis according to the number of variables involved; according to the number of dependent variables, it can be divided into simple regression analysis and multiple regression analysis; according to the type of relationship between independent and dependent variables, it can be divided Linear regression analysis and nonlinear regression analysis. [1]
In statistics, regression analysis refers to a statistical analysis method that determines the quantitative relationship between two or more variables. Regression analysis is divided into univariate regression and multiple regression analysis according to the number of variables involved; according to the number of dependent variables, it can be divided into simple regression analysis and multiple regression analysis; according to the type of relationship between independent and dependent variables, it can be divided Linear regression analysis and nonlinear regression analysis. [1]
- Chinese name
- regression analysis
- Foreign name
- regression analysis
- Application area
- statistics
In big data analysis, regression analysis is a predictive modeling technique that studies the relationship between the dependent variable (target) and the independent variable (predictor). This technique is commonly used in predictive analysis, time series models, and finding causal relationships between variables. For example, the relationship between driver's reckless driving and the number of road accidents is best studied by regression.
method
There are various regression techniques used for prediction. These techniques mainly have three measures (the number of independent variables, the type of dependent variable, and the shape of the regression line), as shown below.
Linear regression
It is one of the most well-known modeling techniques. Linear regression is usually one of the techniques people prefer when learning predictive models. In this technique, the dependent variable is continuous, the independent variable can be continuous or discrete, and the nature of the regression line is linear.
Linear regression uses the best-fit line (ie, the regression line) to establish a relationship between the dependent variable (Y) and one or more independent variables (X).
Multiple linear regression can be expressed as Y = a + b1 * X + b2 * X2 + e, where a represents the intercept, b represents the slope of the straight line, and e is the error term. Multiple linear regression can predict the value of a target variable based on a given predictor (s).
2.Logistic Regression
Logistic regression is used to calculate the probability of "Event = Success" and "Event = Failure". Logistic regression should be used when the type of the dependent variable is a binary (1/0, true / false, yes / no) variable. Here, the value of Y is 0 or 1, which can be expressed by the following equation.
odds = p / (1-p) = probability of event occurrence / probability of not event occurrence
ln (odds) = ln (p / (1-p))
logit (p) = ln (p / (1-p)) = b0 + b1X1 + b2X2 + b3X3 .... + bkXk
In the above formula, p represents a probability having a certain characteristic. You should ask the question, "Why use a log in a formula?".
Because the binomial distribution (dependent variable) is used here, you need to choose a link function that is best for this distribution. It is the Logit function. In the above equation, the parameters are selected by observing the maximum likelihood estimates of the samples, rather than minimizing the sum of squares errors (as used in ordinary regression).
3. Polynomial Regression
For a regression equation, if the index of the independent variable is greater than 1, then it is a polynomial regression equation. This is shown in the following equation:
y = a + b * x ^ 2
In this regression technique, the best-fit line is not a straight line. It is a curve used to fit the data points.
4. Stepwise Regression
This form of regression can be used when dealing with multiple independent variables. In this technique, the selection of independent variables is done in an automated process, including non-human manipulation.
This feat is to identify important variables by observing statistical values such as R-square, t-stats, and AIC indicators. Stepwise regression fits the model by simultaneously adding / removing covariates based on specified criteria. Here are some of the most commonly used stepwise regression methods:
The standard stepwise regression method does two things. That is, the predictions required for each step are added and deleted.
The forward selection method starts with the most significant prediction in the model and then adds variables for each step.
Backward culling starts at the same time as all predictions of the model and then eliminates the least significant variable at each step.
The purpose of this modeling technique is to maximize the predictive power with the least number of predictors. This is also one of the methods for processing high-dimensional data sets. 2
5. Ridge Regression
Ridge multivariate analysis is required when data is multicollinear (independent variables are highly correlated). In the presence of multicollinearity, although the estimated values measured by the least squares method (OLS) are not biased, their variance will be large, which makes the observed values far from the true values. Ridge regression reduces the standard error by adding a bias value to the regression estimates.
In the linear equation, the prediction error can be divided into two components, one caused by the deviation and one caused by the variance. Prediction errors can be caused by either or both. Here, the error caused by the variance will be discussed.
Ridge regression solves the multicollinearity problem by shrinking the parameter (lambda). Consider the following equation:
L2 = argmin || y = x ||
+ || ||
In this formula, there are two components. The first is a least-squares term and the other is a -times of -square, where is a vector of correlation coefficients, which is added to the least-squares term along with the shrinkage parameter to get a very low variance.
6. Lasso Regression
It is similar to ridge regression, and Lasso (Least Absolute Shrinkage and Selection Operator) also gives a penalty value term for the regression coefficient vector. In addition, it can reduce the degree of change and improve the accuracy of the linear regression model. Take a look at the following formula:
L1 = agrmin || y-x ||
+ || ||
Lasso regression is different from Ridge regression in that it uses the L1 norm instead of the L2 norm. This results in a penalty (or equal to the sum of the absolute values of the constraint estimates) that makes some parameter estimates equal to zero. The larger the penalty value used, the further estimation will bring the reduction value closer to zero. This will cause a variable to be selected from the given n variables.
If the predicted set of variables is highly correlated, Lasso picks one of the variables and shrinks the others to zero.
7.ElasticNet returns
ElasticNet is a hybrid of Lasso and Ridge regression technologies. It uses L1 for training and L2 takes precedence as the regularization matrix. ElasticNet is useful when there are multiple related features. Lasso picks one of them randomly, while ElasticNet picks two.
The practical advantage between Lasso and Ridge is that it allows ElasticNet to inherit some of the stability of the Ridge in a cyclic state.
Data exploration is an inevitable part of building a predictive model. It should be the first step in choosing the right model, such as identifying the relationship and impact of variables. It is more suitable for the advantages of different models. It can analyze different index parameters, such as statistically significant parameters, R-square, Adjusted R-square, AIC, BIC, and error terms. The other is Mallows' Cp criterion. This is done by comparing the model to all possible sub-models (or choosing them carefully) to check for possible deviations in your model.
Cross validation is the best way to evaluate predictive models. Here, split your dataset into two (one for training and one for validation). Use a simple mean square error between the observed and predicted values to measure your prediction accuracy.
If your data set has multiple mixed variables, then you should not choose the automatic model selection method, because you should not want to put all variables in the same model at the same time.
It will also depend on your purpose. It may happen that a less powerful model is easier to implement than a model that is highly statistically significant. Regression regularization methods (Lasso, Ridge, and ElasticNet) work well with multicollinearity between high-dimensional and dataset variables. 3
Assumptions and content
In data analysis, some conditional assumptions are generally made on the data:
Homogeneity of variance
Linear relationship
Cumulative effect
Variables without measurement error
Variables obey multivariate normal distribution
Observation independence
Model is complete (no variables that should not be entered, and no variables that should be entered)
The error terms are independent and obey the (0,1) normal distribution.
Real-life data often fails to fully meet the above assumptions. Therefore, statisticians have developed many regression models to address the constraints of the linear regression model assumption process.
The main contents of regression analysis are:
Starting from a set of data, determine the quantitative relationship between certain variables, that is, establish a mathematical model and estimate the unknown parameters among them. A common method for estimating parameters is the least square method.
Test the credibility of these relations.
In the relationship that many independent variables collectively affect a dependent variable, determine which independent variable (or which) has a significant effect, and which independent variable has a non-significant effect. Add the significant independent variable to the model, and Variables that are not significantly affected are usually eliminated by stepwise regression, forward regression, and backward regression.
Use the required relationship to predict or control a certain production process. The application of regression analysis is very extensive, and the statistical software package makes the calculation of various regression methods very convenient.
In regression analysis, variables are divided into two categories. One is the dependent variable, which is usually a type of indicator of concern in practical problems, usually represented by Y; and the other type of variable that affects the value of the dependent variable is called the independent variable, which is represented by X.
The main problems of regression analysis research are:
(1) Determine the quantitative relationship expression between Y and X, this expression is called the regression equation;
(2) Test the credibility of the obtained regression equation;
(3) Determine whether the independent variable X has an effect on the dependent variable Y;
(4) Use the obtained regression equation for prediction and control. 4
application
Correlation analysis studies whether the phenomena are related, the direction and degree of closeness, and generally does not distinguish between independent or dependent variables. Regression analysis should analyze the specific forms of correlation between phenomena, determine their causality, and use mathematical models to express their specific relationships. For example, from the correlation analysis, we can know that the "quality" and "user satisfaction" variables are closely related. However, which variable is affected by which variable and how much is the impact between these two variables, you need to use regression analysis methods. to make sure. 1
Generally speaking, regression analysis is to determine the causal relationship between variables by specifying the dependent and independent variables, establish a regression model, and solve the parameters of the model based on the measured data, and then evaluate whether the regression model can fit the measured well. Data; if they fit well, you can make further predictions based on the independent variables.
For example, if you want to study the causal relationship between quality and user satisfaction, in a practical sense, product quality will affect user satisfaction, so set user satisfaction as the dependent variable and record it as Y; quality as the independent variable, record Is X. The following linear relationships can usually be established: Y = A + BX + §
In the formula: A and B are the parameters to be determined, A is the intercept of the regression line; B is the slope of the regression line, which represents the average change of Y when X changes by one unit; § is a random error term that depends on user satisfaction.
For the empirical regression equation: y = 0.857 + 0.836x
The intercept of the regression line on the y-axis is 0.857 and the slope is 0.836, that is, for each point of quality improvement, user satisfaction increases by an average of 0.836 points; in other words, for each point of quality improvement, the contribution to user satisfaction is 0.836 points.
The example shown above is a simple linear regression problem of independent variables. In data analysis, this can also be extended to multiple regression of multiple independent variables. For specific regression processes and meanings, please refer to relevant statistical books. In addition, the SPSS result output can also report R2, F test value and T test value. R2 is also called the coefficient of determination of the equation, and it represents the degree of interpretation of the variable X to Y in the equation. The value of R2 is between 0 and 1, and the closer it is to 1, the stronger the ability of X to explain Y in the equation. R2 is usually multiplied by 100% to indicate the percentage of Y change that the regression equation explains. The F test is output through the analysis of variance table, and the significance of the regression equation is used to test whether the linear relationship of the regression equation is significant. In general, significance levels above 0.05 are significant. When the F test passes, it means that at least one regression coefficient in the equation is significant, but not all regression coefficients are significant, so the significance of the regression coefficient needs to be verified by the T test. Similarly, the T test can be determined by significance level or look-up table. In the example shown above, the meaning of each parameter is shown in the table below.
Linear regression equation test
index | value | Significance level | significance |
R2 | 0.89 | "Quality" explains the change in "User Satisfaction" of 89% | |
F | 276.82 | 0.001 | Significant linear relationship in regression equation |
T | 16.64 | 0.001 | The coefficients of the regression equation are significant |
Linear regression analysis of sample SIM phone user satisfaction and related variables
The linear regression analysis of user satisfaction and related variables of SIM mobile phones is taken as an example to further illustrate the application of linear regression. From a practical point of view, mobile phone user satisfaction should be related to product quality, price, and image, so "user satisfaction" is the dependent variable, and "quality", "image", and "price" are independent variables. regression analysis. Using SPSS software regression analysis, the regression equation is obtained as follows:
User satisfaction = 0.008 × image + 0.645 × quality + 0.221 × price
For SIM mobile phones, the contribution of quality to its user satisfaction is relatively large. For every 1 point increase in quality, user satisfaction will increase by 0.645 points; secondly, for price, every 1 point increase in user's evaluation of price will increase its satisfaction 0.221 points; and the contribution of image to product user satisfaction is relatively small. For every 1 point increase in image, user satisfaction only increases by 0.008 points.
The test indicators and meanings of the equation are as follows:
index | Significance level | significance | |
R2 | 0.89 | 89% user satisfaction " | |
F | 248.53 | 0.001 | Significant linear relationship in regression equation |
T (image) | 0.00 | 1.000 | "Image" variable contributes little to the regression equation |
T (quality) | 13.93 | 0.001 | "Quality" contributes significantly to the regression equation |
T (price) | 5.00 | 0.001 | "Price" contributes a lot to the regression equation |
From the test indicators of the equation, the "image" does not contribute much to the entire regression equation and should be deleted. Therefore, the regression equations of "customer satisfaction", "quality", and "price" are redone as follows: satisfaction = 0.645 × quality + 0.221 × price
For every 1 point increase in the user's evaluation of the price, the satisfaction will increase by 0.221 points (in this example, because "image" has almost no contribution to the equation, the obtained equation is similar to the coefficient of the previous regression equation).
The test indicators and meanings of the equation are as follows:
index | Significance level | significance | |
R2 | 0.89 | 89% user satisfaction " | |
F | 374.69 | 0.001 | Significant linear relationship in regression equation |
T (quality) | 15.15 | 0.001 | "Quality" contributes significantly to the regression equation |
T (price) | 5.06 | 0.001 | "Price" contributes a lot to the regression equation |
Steps to determine variables
When the specific target of the forecast is clear, the dependent variable is determined. If the predicted specific target is the sales volume for the next year, then the sales volume Y is the dependent variable. Through market research and consulting data, look for the relevant influencing factors, namely independent variables, and select the main influencing factors from the forecast target.
Building a predictive model
The calculation is based on historical statistics of independent and dependent variables, and a regression analysis equation is established on this basis, that is, a regression analysis prediction model.
Perform correlation analysis
Regression analysis is a mathematical statistical analysis and treatment of the influencing factors (independent variables) and prediction objects (dependent variables) with causality. The regression equation is meaningful only when the independent variable and the dependent variable do have a certain relationship. Therefore, whether the factor as the independent variable is related to the predicted object as the dependent variable, how relevant it is, and how confident it is to judge this degree of relevance has become a problem that must be solved for regression analysis. Correlation analysis is generally required to determine the correlation between the independent variable and the dependent variable based on the size of the correlation coefficient.
Calculating prediction errors
Whether the regression prediction model can be used for actual prediction depends on the test of the regression prediction model and the calculation of the prediction error. Only when the regression equation passes various tests and the prediction error is small, can the regression equation be used as a prediction model for prediction.
Determine the predicted value
The regression prediction model is used to calculate the prediction value, and the prediction value is comprehensively analyzed to determine the final prediction value.
Pay attention
When applying the regression prediction method, it should first determine whether there is a correlation between the variables. If there is no correlation between the variables, applying regression prediction to these variables will give incorrect results.
Attention should be paid when applying regression analysis predictions correctly:
Use qualitative analysis to judge the dependency relationship between phenomena;
Avoid arbitrary extrapolation of regression prediction;
apply appropriate data;