The concept of multicollinearity. Methods for detecting and eliminating multicollinearity

In practice, when quantifying the parameters of an econometric model, one often encounters the problem of the relationship between explanatory variables. If the relationship is quite close, then the estimation of the model parameters may have a large error. This relationship between explanatory variables is called multicollinearity. The problem of multicollinearity arises only for the case of multiple regression, since pairwise regression has one explanatory variable. The estimate of the regression coefficient may turn out to be insignificant not only because of the insignificance of this factor, but also because of the difficulties that arise when distinguishing the impact of two or more factors on the dependent variable. This manifests itself when the factors change synchronously. The relation of the dependent variable to changes in each of them can only be determined if only one of these factors is included in the explanatory variables.

The nature of multicollinearity is most evident when there is a strict linear relationship between the explanatory variables. This is strict multicollinearity, when it is impossible to separate the contribution of each variable in explaining the behavior of the resulting indicator. Non-strict or stochastic multicollinearity is more common, when the explanatory variables are correlated with each other. In this case, the problem arises only when the relationship of variables affects the results of the regression estimation.

The main consequences of multicollinearity are:

The accuracy of estimating regression parameters decreases, which manifests itself in three aspects:

The errors of some estimates become very large;

These errors are highly correlated with each other;

Sample variances increase strongly;

the coefficients of some variables introduced into the regression turn out to be insignificant, but due to economic considerations, it is these variables that should have a noticeable effect on the variable being explained;

· coefficient estimates become very sensitive to sample observations (a small increase in sample size leads to very large shifts in the values ​​of the estimates).

Causes of multicollinearity:

The model includes factor features that characterize the same side of the phenomenon;

The regression equation contains such indicators as factor features, the total value of which is a constant value;

The model uses factor features that are constituent elements of each other;

The modeling function includes factor features that duplicate each other in meaning.

The multicollinearity problem is common in time series regression, i.e. when the data consists of a series of observations over a period of time. If two or more explanatory variables have a strong time trend, then they will be highly correlated and this can lead to multicollinearity.

If among the paired correlation coefficients of independent variables there are those whose value approaches or is equal to the multiple correlation coefficient, then this indicates the possibility of the existence of multicollinearity.

If in the econometric model a small value of the parameter is obtained with a large coefficient of determination and, at the same time, the -criterion differs significantly from zero, then this indicates the presence of multicollinearity.

Methods for studying multicollinearity

· finding and analyzing the correlation matrix

The stochastic relationship between variables is characterized by the magnitude of the correlation coefficient between them. The closer the absolute value of the correlation coefficient is to unity, the stronger the multicollinearity. In the general case, if several factors turned out to be insignificant when evaluating the regression equation, then it is necessary to find out if any of them are correlated with each other. For this, a matrix of pair correlation coefficients is formed, which is symmetrical and is called the correlation matrix. It looks like:

where are the pair correlation coefficients between the variable at and one of the factors - the coefficients of pair correlation between the factors, which are calculated by the formula

The analysis of the correlation matrix makes it possible to assess, firstly, the degree of influence of individual factors on the performance indicator, and secondly, the relationship between the factors.

If the pair correlation coefficients between some factors are close to unity, this indicates a close relationship between them, i.e. for multicollinearity. In this case, one of the factors must be excluded from further consideration. The question arises which one. It depends on the specific situation. Most often, the factor left for modeling is that, from an economic point of view, is more important for the process under study. You can also leave the factor that has a greater influence on the performance indicator (i.e., the correlation coefficient of which with the performance indicator is greater). This kind of analysis is carried out for each pair of factors. The result of the analysis of the correlation matrix is ​​the establishment of a group of factors that are little dependent on each other - they should be included in the model.

· correlation matrix determinant calculation

If there are more than two factors in the model, the question of multicollinearity cannot be limited to the information provided by the correlation matrix. A broader check involves calculating the determinant of the matrix , . If , then there is complete multicollinearity. If , then there is no multicollinearity. The closer to zero, the more confidently one can assert the existence of multicollinearity between the variables.

· Ferrar-Glauber method

To study the overall multicollinearity and multicollinearity between individual factors, the correlation matrix is ​​used, calculated by formula (3.3.2).

To study the overall multicollinearity, the criterion is used. The value is calculated

having - distribution with degrees of freedom.

According to this reliability and the number of degrees of freedom, a tabular value is found (Appendix A). If , then we can assume that there is no multicollinearity between the explanatory variables.

To clarify the question between which factors there is multicollinearity, -statistics or -statistics are used. For this purpose, partial pair correlation coefficients are used between explanatory variables, which are calculated by the formula

where are the elements inverse matrix.

The value is used as a criterion

which has a Student's distribution with degrees of freedom.

According to Student's tables (Appendix D), a critical value is found. Compare the critical value with the calculated:

If , then there is no collinearity between the explanatory variables and.

if , then there is significant collinearity between the explanatory variables and ..

Methods for eliminating multicollinearity

If multicollinearity is identified, a number of measures must be taken to reduce it and possible elimination. You need to know that there are no infallible and absolutely correct recommendations, this is a process of creative search. It all depends on the degree of multicollinearity, on a set of factors, on the nature of the data.

Various techniques that can be used to mitigate multicollinearity are related to information base and fall into two categories. The first includes attempts to increase the reliability of regression estimates - to increase the number of observations in the sample, by reducing the time period to increase the variance of explanatory variables and reduce the variation of a random number, to refine the set of explanatory variables included in the model. The second category is the use external information, i.e. collection of additional data and estimates.

· variable elimination method

This method involves removing highly correlated explanatory variables from the regression and re-evaluating it. The selection of variables to be excluded is made using correlation coefficients. For this, the significance of the pair correlation coefficients between the explanatory variables and is estimated. If , then one of the variables can be excluded. But which variable to remove from the analysis is decided on the basis of economic considerations.

· method of linear transformation of variables

This method of eliminating multicollinearity is to pass to reduced form regression by replacing the variables that are collinear with their linear combination. If there is multicollinearity between two factors and , then the factor is replaced and then the presence of multicollinearity between the factors and is checked. In the absence of multicollinearity, the factor is considered instead of the factor.

· stepwise regression method

The procedure for applying stepwise regression begins with building a simple regression. One explanatory variable is sequentially included in the analysis. At each step, the significance of the regression coefficients is checked and the multicollinearity of the variables is assessed. If the coefficient estimate is not significant, then the variable is excluded and another explanatory variable is considered. If the regression coefficient estimate is significant and there is no multicollinearity, then the next variable is included in the analysis. Thus, all components of the regression are gradually determined without violating the provision on the absence of multicollinearity.

Corrective action multicollinearity:

It is necessary to change the specification of the model so that the collinearity of the variables is reduced to an acceptable level;

· It is necessary to apply evaluation methods that, despite significant collinearity, allow avoiding its negative consequences. These estimation methods include: methods with restrictions on parameters (mixed estimator and minimum estimator), principal component method, two-step least squares method, instrumental variable method, maximum likelihood method.

As already shown, the elimination of multicollinearity can be achieved by eliminating one or more linearly related factor features. The question of which of the factors should be discarded is decided on the basis of an economic, logical, and qualitative analysis of the phenomenon. Sometimes it is possible to reduce multicollinearity by aggregating or transforming the original factor features. In particular, this can be a combination of intersectoral indicators with time series, or, for example, you can go to the first differences and find a regression equation for the differences.

Although there are no reliable methods for detecting collinearity, there are several signs that reveal it:

a characteristic feature of multicollinearity is the high value of the coefficient of determination with the insignificance of the parameters of the equation (according to t- statisticians);

In a model with two variables, the best sign of multicollinearity is the value of the correlation coefficient;

In a model with a large number (than two) of factors, the correlation coefficient may be low due to the presence of multicollinearity; partial correlation coefficients should be taken into account;

if the coefficient of determination is large, and the partial coefficients are small, then multicollinearity is possible

Example 3.6. Examine data for multicollinearity; if multicollinearity of explanatory variables is found, then exclude from consideration the variable that correlates with other explanatory variables.

Y 17,44 17,28 17,92 18,88 17,12 21,12 20,64 19,68 18,4
X 1 22,95 24,84 29,97 28,08 24,3 32,4 29,97 33,48 29,7 26,73
X 2 1,56 2,88 2,28 1,2 2,64 3,48 2,28 2,52 2,4
X 3 2,8 1,148 2,66 1,96 0,77 2,38 3,36 2,17 2,24 2,03

Solution. To study the overall multicollinearity, the Farrar-Glauber method is applicable.

To find the correlation matrix R construct an auxiliary table 3.13.

Table 3.13

Calculation of elements of the correlation matrix

17,44 22,95 2,8 526,70 9,00 7,84 68,85 64,26 8,40 22,95 2,8 304,15
17,28 24,84 1,56 1,14 617,03 2,43 1,32 38,75 28,52 1,79 24,84 1,56 1,14 298,60
17,92 29,97 2,88 2,66 898,20 8,29 7,08 86,31 79,72 7,66 29,97 2,88 2,66 321,13
18,88 28,08 2,28 1,96 788,49 5,20 3,84 64,02 55,04 4,47 28,08 2,28 1,96 356,45
17,12 24,3 1,2 0,77 590,49 1,44 0,59 29,16 18,71 0,92 24,3 1,2 0,77 293,09
21,12 32,4 2,64 2,38 1049,76 6,97 5,66 85,54 77,11 6,28 32,4 2,64 2,38 446,05
29,97 3,48 3,36 898,20 12,11 11,29 104,3 100,7 11,69 29,97 3,48 3,36 400,00
20,64 33,48 2,28 2,17 1120,91 5,20 4,71 76,33 72,65 4,95 33,48 2,28 2,17 426,01
19,68 29,7 2,52 2,24 882,09 6,35 5,02 74,84 66,53 5,64 29,7 2,52 2,24 387,30
18,4 26,73 2,4 2,03 714,49 5,76 4,12 64,15 54,26 4,87 26,73 2,4 2,03 338,56
188,48 282,42 24,24 21,52 8086,36 62,76 51,47 692,26 617,5 56,68 282,42 24,24 21,5 3571,35
18,848 28,24 2,42 2,15 808,64 6,28 5,15 69,23 61,75 5,67 28,24 2,424 2,15 357,13

The penultimate row of Table 3.12 shows column sums, and the last row shows column averages.

Find the standard deviations:

Similarly, we have , , .

We substitute the found values ​​of standard deviations into formulas (3.3.3) to calculate paired correlation coefficients:

Similarly, , , , , .

It can be concluded that there is a certain relationship between each pair of factors. For this problem, the correlation matrix (3.3.1) has the form:

Comment. If the Data Analysis command is not on the Tools menu, you must run the installer Microsoft Excel and install Analysis Pack. After installing the Analysis ToolPak, it must be selected and activated using the Add-Ins command.

Let's find the determinant of the correlation matrix:

The value of the determinant of the correlation matrix is ​​close to zero, which indicates the presence of significant multicollinearity.

. and there is multicollinearity and one of the variables must be excluded. Let's exclude the variable from consideration, because .
  • 5. The order of estimation of a linear econometric model from an isolated equation in Excel. The meaning of the output statistical information of the Regression service. (10) page 41
  • 6.Specification and estimation of least squares econometric models with non-linear parameters. (30) pp. 24-25,
  • 7. Classical pair regression model. Model specification. Gauss-Markov theorem.
  • 8. Least squares method: method algorithm, application conditions.
  • 9. Identification of individual equations of the system of simultaneous equations: ordinal condition. (thirty)
  • A necessary condition for identifiability
  • 10. Estimation of the parameters of a paired regression model by the least squares method. (10)
  • 11. Dummy variables: definition, purpose, types.
  • 12. Autocorrelation of a random perturbation. Causes. Consequences.
  • 13. Algorithm for checking the significance of a regressor in a paired regression model.
  • 14. Interval estimation of the expected value of the dependent variable in a paired regression model.
  • 15. Chow test for the presence of structural changes in the regression model. (20) pp. 59,60
  • 16. Algorithm for checking the adequacy of a paired regression model. (20) pp. 37, 79
  • 17. Coefficient of determination in a paired regression model.
  • 18. Estimation of the parameters of the multiple regression model by the least squares method.
  • 20. Heteroskedasticity of a random perturbation. Causes. Consequences. gq(20) test
  • 21.Dummy slope variable: assignment; specification of a regression model with a dummy slope variable; the value of the parameter when the dummy variable. (20) p.65
  • 22..Durbin-Watson test algorithm for the presence (absence) of autocorrelation of random disturbances. (20) page 33
  • 23. Structural and reduced forms of specification of econometric models.
  • 24. Heteroskedasticity of a random perturbation. Causes. Consequences. Algorithm for the Goldfeld-Quandt test for the presence or absence of heteroscedasticity of random disturbances.
  • Algorithm for the Goldfeld-Quandt test for the presence (absence) of heteroscedasticity of random disturbances.
  • 25. Specification and estimation of least squares econometric models with nonlinear parameters.
  • 26. Methods for correcting heteroscedasticity. Weighted least squares
  • 27. The problem of multicollinearity in multiple regression models. Signs of multicollinearity.
  • 28. What is logit, tobit, broken.
  • 29. What is the Maximum Likelihood Method p. 62.
  • 30. What is a stationary process?
  • 31. Properties of time series.
  • 32. Models ar and var .
  • 33. Identifiability of the system.
  • 34. Setting up a model with a system of simultaneous equations.
  • 35. What is the Monte Carlo method page 53
  • 36. Evaluate the quality of the model by f, gq, dw (linear). P.33, 28-29
  • 37. Evaluation of errors in the parameters of the econometric model by the Monte Carlo method.
  • 38. Reflection in the model of the influence of unaccounted factors. Background of the Gauss-Markov theorem.
  • 39. Models of time series. Properties of stock price series on the stock exchange (20) p.93.
  • 40. The expected value of a random variable, its variance and standard deviation. (20) p.12-21
  • 41. Estimation of the parameters of a paired regression model by the least squares method using the Search solution service.
  • 42. Testing of statistical hypotheses, Student's t-statistic, confidence probability and confidence interval, critical values ​​of Student's statistic. What are "fat tails"?
  • 43. The problem of multicollinearity in multiple regression models. Signs of multicollinearity
  • 44. Partial coefficients of determination.
  • 46. ​​Economic meaning of the coefficients of linear and power regression equations.
  • 47. Evaluation of the coefficients of the Samuelson-Hicks model
  • 48. Errors from including insignificant variables in the model or excluding significant ones. С.80
  • 49. Research of multiple regression model p.74-79.
  • 50. Multicollinearity: what is bad, how to detect and how to fight.
  • 51. Signs of stationarity of a stochastic process. What is White Noise? p.100
  • 52. Structural and reduced forms of specification of econometric models.
  • 53. Algorithm for checking the significance of a regressor in a paired regression model. By t-statistics, by f-statistics.
  • 54. Properties of series of prices in the stock market. Markowitz portfolio building principles p.93,102
  • 55. Dynamic model from simultaneous linear equations (give an example) p.105.
  • 56. Maximum likelihood method: principles and expediency of use
  • 57. Stages of the study of the multiple regression model p.74-79.
  • 50. Multicollinearity: what is bad, how to detect and how to fight.

    Multicollinearity is the mutual dependence of influencing variables. The problem is that when it is present, it becomes difficult or impossible to separate the influence of the regressors on the dependent variable, and the coefficients lose the economic meaning of the marginal function or elasticity. The variances of the coefficients grow, the coefficients themselves, estimated from different samples or by the Monte Carlo method, correlate with each other. This leads to the fact that in the model tuning area, the graphs of Y and Ŷ coincide perfectly, R2 and F are high, and in the forecast region, the graphs can coincide, which can be explained by mutual suppression of errors or diverge, that is, the model is inadequate.

    How to detect multicollinearity? The easiest way is by the correlation matrix. If the correlation coefficients of the regressors are greater than 0.7, then they are interconnected. The determinant of the correlation matrix can serve as a numerical characteristic of multicollinearity. If it is close to 1, then the regressors are independent; if to 0, then they are strongly connected.

    How to deal with multicollinearity?

    1. Accept, take into account and do nothing.

    2.Increase the sample size: the variances of the coefficients are inversely proportional to the number of measurements.

    3. Remove from the model regressors that are weakly correlated with the dependent variable, or whose coefficients have small t-statistics. As can be seen from Table 7.10, in this case there is a shift in the coefficients at significant regressors, and the question arises about their economic meaning. (And the meaning is this: if the regressors are correlated and you can control them, for example, the cost of machines and workers, then you have to change them proportionally). F-statistics, that is, the quality of the model, grows at the same time.

    4. Use aggregates of correlating variables in the regression equation: linear combinations with coefficients that are inversely proportional to the standard deviations of the variables and equalize their scales. Such aggregations usually do not make economic sense, but can improve the model's validity.

    5. Factor analysis, or the method of principal components. It is used if there are many variables, but they are linear combinations not a large number independent factors that may not make economic sense.

    51. Signs of stationarity of a stochastic process. What is White Noise? p.100

    Time series is the final implementation c tochastic process : generating a set of random variables Y(t).

    A stochastic process can be stationary or non-stationary. The process is stationary , If

      The mathematical expectation of the values ​​of the variables does not change.

      The mathematical expectation of the variances of the variables does not change.

    3. No periodic fluctuations.

    Stationary recognition:

    1. Graph: systematic growth or decrease, waves and zones of high volatility (dispersion) in a long series are immediately visible.

    2. Autocorrelation (decreases as the lag increases)

    3. Trend tests: testing the hypothesis that the coefficient is equal to zero at t.

    4. Special tests included in the computer software packages Stata, EViews, etc., for example, the Dickey-Fuller test for a unit root (Unit root).

    Purely random process, stationary with no autocorrelation (Cor( u i / u k) = 0) is called White noise.

    An example of a non-stationary process is − random walk

    Y(t) = Y(t-1) + a(t) Where a(t)- White noise.

    Interestingly, the process Y(t) = 0.999*Y(t-1) + a(t) is stationary

    The fundamental possibility of getting rid of non-stationarity is called integrability. There are various ways to get rid of non-stationarity:

    1. Subtraction of the trend, which we did in the previous section;

    2. Using the differences of the 1st, 2nd, etc. orders, which can only be done after smoothing the time series (or energy spectrum), otherwise all effects will be suppressed by statistical fluctuations: the variance of the difference is equal to the sum of the variances.

    To study the series of prices in the stock market, models are used that use white noise and autoregression, that is, the mutual dependence of the levels of the time series.

    Model MA(q) (moving average) - linear combination of successive elements of white noise

    X(t) = a(t) – K(1)*a(t-1) – …. – K(q)*a(t-q)

    X(t) = b0 + b1*X(t-1) + …. + bp*X(t-p)

    Their combinations are especially popular.

    ARMA(p,q) = AR(p) + MA(q)

    and ARIMA(p, i ,q): the same, with i-th order integrability.

    "

    Note that in some cases, multicollinearity is not such a serious "evil" to make significant efforts to identify and eliminate it. Basically, it all depends on the purpose of the study.
    If the main task of the model is to predict the future values ​​of the dependent variable, then with a sufficiently large coefficient of determination R2(gt; 0.9), the presence of multicollinearity usually does not affect the predictive qualities of the model (if in the future the same relationships remain between the correlated variables as before ).
    If it is necessary to determine the degree of influence of each of the explanatory variables on the dependent variable, then multicollinearity, leading to an increase in standard errors, is likely to distort the true relationships between variables. In this situation, multicollinearity is a major problem.
    There is no single method for eliminating multicollinearity that works in any case. This is due to the fact that the causes and consequences of multicollinearity are ambiguous and largely depend on the results of the sample.
    Exclude variable(s) from the model
    The simplest method to eliminate multicollinearity is to exclude one or a set of correlated variables from the model. Some caution is required when applying this method. In this situation, specification errors are possible, so in applied econometric models it is desirable not to exclude explanatory variables until multicollinearity becomes a serious problem.
    Getting more data or a new sample
    Since multicollinearity directly depends on the sample, it is possible that with a different sample, multicollinearity will not be or it will not be so serious. Sometimes, to reduce multicollinearity, it is enough to increase the sample size. For example, if you are using yearly data, you can change to quarterly data. Increasing the amount of data reduces the variances of the regression coefficients and thus increases their statistical significance. However, obtaining a new sample or expanding the old one is not always possible or is associated with serious costs. In addition, this approach can enhance autocorrelation. These problems limit the applicability of this method.
    Model specification change
    In some cases, the problem of multicollinearity can be solved by changing the specification of the model: either the shape of the model is changed, or explanatory variables are added that are not taken into account in the original model, but significantly affect the dependent variable. If this method is justified, then its use reduces the sum of squared deviations, thereby reducing the standard error of the regression. This leads to a reduction in the standard errors of the coefficients.
    Using preliminary information about some parameters
    Sometimes, when building a multiple regression model, you can use preliminary information, in particular, the known values ​​of some regression coefficients.
    It is likely that the values ​​of the coefficients calculated for some preliminary (usually simpler) models or for a similar model based on a previously obtained sample can be used for the one being developed in this moment models.
    Selection of the most significant explanatory variables. Procedure for connecting elements in series
    Moving to fewer explanatory variables can reduce the duplication of information delivered by highly interdependent features. This is exactly what we face in the case of multicollinearity of explanatory variables.

    36. ways to detect multicolliarity. partial correlation

    The greatest difficulties in using the apparatus of multiple regression arise in the presence of multicollinearity of factor variables, when more than two factors are interconnected by a linear relationship.

    Multicollinearity for linear multiple regression is the presence of linear dependence between factor variables included in the model.

    Multicollinearity is a violation of one of the main conditions underlying the construction of a linear multiple regression model.

    Multicollinearity in matrix form is the dependence between the columns of the matrix of factor variables X:

    If you do not take into account the unit vector, then the dimension of this matrix is ​​\u200b\u200bn * n. If the rank of the matrix X is less than n, then the model has full or strict multicollinearity. But in practice, full multicollinearity almost never occurs.

    It can be concluded that one of the main reasons for the presence of multicollinearity in the multiple regression model is a poor matrix of factor variables X.

    The stronger the multicollinearity of the factor variables, the less reliable is the estimate of the distribution of the sum of the explained variation over individual factors using the least squares method.

    The inclusion of multicollinear factors in the model is undesirable for several reasons:

    1) the main hypothesis about the insignificance of the multiple regression coefficients can be confirmed, but the regression model itself, when tested using the F-test, turns out to be significant, which indicates an overestimated value of the multiple correlation coefficient;

    2) the obtained estimates of the coefficients of the multiple regression model may be unreasonably high or have incorrect signs;

    3) the addition or exclusion of one or two observations from the original data has strong influence on estimates of the coefficients of the model;

    4) multicollinear factors included in the multiple regression model can make it unsuitable for further use.

    Specific Methods detection of multicollinearity does not exist, but it is customary to apply a number of empirical techniques. In most cases, multiple regression analysis begins with a consideration of the correlation matrix of factor variables R or matrix (XTX).

    The correlation matrix of factor variables is a matrix of linear coefficients of pair correlation of factor variables that is symmetric with respect to the main diagonal:

    where rij is the linear pair correlation coefficient between the i-th and j-th factor variables,

    There are units on the diagonal of the correlation matrix, because the coefficient of correlation of the factor variable with itself is equal to one.

    When considering this matrix in order to identify multicollinear factors, the following rules are followed:

    1) if in the correlation matrix of factor variables there are pair correlation coefficients in absolute value greater than 0.8, then it is concluded that there is multicollinearity in this multiple regression model;

    2) calculate the eigenvalues ​​of the correlation matrix of factorial variables λmin and λmax. If λmin‹10-5, then there is multicollinearity in the regression model. If the ratio

    then they also conclude that there are multicollinear factor variables;

    3) calculate the determinant of the correlation matrix of factor variables. If its value is very small, then there is multicollinearity in the regression model.

    37. ways to solve the problem of multicolliarity

    If the estimated regression model is supposed to be used to study economic relationships, then the elimination of multicollinear factors is mandatory, because their presence in the model can lead to incorrect signs of the regression coefficients.

    When building a forecast based on a regression model with multicollinear factors, it is necessary to evaluate the situation by the magnitude of the forecast error. If its value is satisfactory, then the model can be used despite the multicollinearity. If the value of the forecast error is large, then the elimination of multicollinear factors from the regression model is one of the methods for improving the accuracy of the forecast.

    The main ways to eliminate multicollinearity in a multiple regression model include:

    1) One of the easiest ways to eliminate multicollinearity is to obtain additional data. However, in practice, in some cases, the implementation of this method can be very difficult;

    2) a way to transform variables, for example, instead of the values ​​of all variables participating in the model (including the resulting one), you can take their logarithms:

    lny=β0+β1lnx1+β2lnx2+ε.

    However this method is also unable to guarantee the complete elimination of multicollinearity factors;

    If the considered methods did not help to eliminate the multicollinearity of factors, then they switch to using biased methods for estimating unknown parameters of the regression model, or methods for excluding variables from the multiple regression model.

    If none of the factor variables included in the multiple regression model can be excluded, then one of the main biased methods for estimating the coefficients of the regression model is used - ridge regression or ridge (ridge).

    When using the ridge regression method, a small number τ is added to all diagonal elements of the matrix (ХТХ): 10-6 ‹ τ ‹ 0.1. Estimation of unknown parameters of the multiple regression model is carried out according to the formula:

    where ln is the identity matrix.

    The result of the application of ridge regression is a decrease in the standard errors of the coefficients of the multiple regression model due to their stabilization to a certain number.

    Principal component analysis is one of the main methods for eliminating variables from a multiple regression model.

    This method is used to eliminate or reduce the multicollinearity of the factor variables of the regression model. The essence of the method is to reduce the number of factor variables to the most significantly influencing factors. This is achieved by a linear transformation of all factorial variables xi (i=0,…,n) into new variables called principal components, i.e., the transition from the matrix of factorial variables X to the matrix of principal components F is made. In this case, the requirement is put forward that the selection of the first principal component corresponded to the maximum of the total variance of all factor variables xi (i=0,…,n), the second component to the maximum of the remaining variance after the influence of the first principal component was excluded, etc.

    The method of stepwise inclusion of variables consists in choosing from the entire possible set of factor variables exactly those that have a significant impact on the resulting variable.

    The step-by-step switching method is carried out according to the following algorithm:

    1) of all factor variables, the regression model includes those variables that correspond to the largest module of the linear coefficient of pair correlation with the resulting variable;

    2) when new factor variables are added to the regression model, their significance is checked using Fisher's F-test. At the same time, the main hypothesis is put forward about the unreasonableness of including the factor variable xk in the multiple regression model. The reverse hypothesis consists in the statement about the expediency of including the factor variable xk in the multiple regression model. The critical value of the F-criterion is defined as Fcrit(a;k1;k2), where a is the significance level, k1=1 and k2=n–l is the number of degrees of freedom, n is the sample size, l is the number of parameters estimated from the sample. The observed value of the F-criterion is calculated by the formula:

    where q is the number of factor variables already included in the regression model.

    When testing the main hypothesis, the following situations are possible.

    Fobs›Fcrit, then the main hypothesis about the unreasonableness of including the factor variable xk in the multiple regression model is rejected. Therefore, the inclusion of this variable in the multiple regression model is justified.

    If the observed value of the F-criterion (calculated from sample data) is less than or equal to the critical value of the F-criterion (determined from the Fisher-Snedekor distribution table), i.e. regression is accepted. Therefore, this factor variable can be omitted from the model without compromising its quality.

    3) checking factor variables for significance is carried out until at least one variable is found for which the condition Fobs›Fcrit is not satisfied.

    38. dummy variables. chow test

    The term "dummy variables" is used as opposed to "significant" variables, showing the level of a quantitative indicator that takes values ​​from a continuous interval. As a rule, a dummy variable is an indicator variable that reflects a qualitative characteristic. Most often, binary dummy variables are used, which take two values, 0 and 1, depending on a certain condition. For example, in the result of a survey of a group of people, 0 may mean that the interviewee is a man, and 1 - a woman. Dummy variables sometimes include a regressor consisting of only units (i.e., a constant, an intercept), as well as a time trend.

    Dummy variables, being exogenous, do not create any difficulties in the application of GLS. Dummy variables are an effective tool for building regression models and testing hypotheses.

    Suppose that a regression model has been built based on the collected data. The researcher is faced with the problem of whether it is worth introducing additional dummy variables into the resulting model or whether the basic model is optimal. This task solved using the Chow method or test. It is used in situations where the main sample can be divided into parts or sub-samples. In this case, you can test the assumption that the subsamples are more efficient than the general regression model.

    We will assume that the general regression model is a regression model model without restrictions. Denote this model through UN. Separate subsamples will be considered special cases of the regression model without restrictions. We denote these private subsamples as PR.

    Let us introduce the following notation:

    PR1 is the first subsample;

    PR2 is the second subsample;

    ESS(PR1) is the sum of the squares of the residuals for the first subsample;

    ESS(PR2) is the sum of squared residuals for the second subsample;

    ESS(UN) is the sum of squared residuals for the general regression model.

    is the sum of squared residuals for the observations of the first subsample in the overall regression model;

    is the sum of the squares of the residuals for the observations of the second subsample in the overall regression model.

    For particular regression models, the following inequalities are true:

    Condition (ESS(PR1)+ESS(PR2))=ESS(UN) is performed only if the coefficients of particular regression models and the coefficients of the general regression model without restrictions are the same, but in practice such a coincidence is very rare.

    The main hypothesis is formulated as a statement that the quality of the general unconstrained regression model is better than the quality of particular regression models or subsamples.

    The alternative or inverse hypothesis states that the quality of the overall regression model is unconstrained worse quality private regression models or subsamples

    These hypotheses are tested using the Fisher-Snedekor F-test.

    The observed value of the F-criterion is compared with the critical value of the F-criterion, which is determined from the Fischer-Snedekor distribution table.

    A k1=m+1 And k2=n-2m-2.

    The observed value of the F-criterion is calculated by the formula: where ESS(UN)– ESS(PR1)– ESS(PR2) is a value that characterizes the improvement in the quality of the regression model after dividing it into subsamples;

    m– number of factor variables (including dummy ones);

    n- the volume of the total sample.

    If the observed value of the F-criterion (calculated from sample data) is greater than the critical value of the F-criterion (determined from the Fischer-Snedekor distribution table), i.e. Fobs>Fcrit, then the main hypothesis is rejected, and the quality of particular regression models exceeds the quality of the general regression model.

    If the observed value of the F-criterion (calculated from sample data) is less than or equal to the critical value of the F-criterion (determined from the Fischer-Snedekor distribution table), i.e. Fob?Fcrit, then the main hypothesis is accepted, and it makes no sense to split the general regression into subsamples.

    If the significance of the basic regression or restricted regression is checked, then the main hypothesis of the form is put forward:

    The validity of this hypothesis is tested using the Fisher-Snedekor F-criterion.

    The critical value of Fisher's F-test is determined from the Fisher-Snedekor distribution table depending on the significance level A and two degrees of freedom freedom k1=m+1 And k2=n–k–1.

    The observed value of the F-criterion is converted to the form:

    When testing the proposed hypotheses, the following situations are possible.

    If the observed value of the F-criterion (calculated from sample data) is greater than the critical value of the F-criterion (determined from the Fischer-Snedekor distribution table), i.e. Fobs›Fcrit, then the main hypothesis is rejected, and additional dummy variables must be introduced into the regression model, because the quality of the constrained regression model is higher than the quality of the basic or constrained regression model.

    If the observed value of the F-criterion (calculated from sample data) is less than or equal to the critical value of the F-criterion (determined from the Fischer-Snedekor distribution table), i.e. Fob?Fcrit, then the main hypothesis is accepted, and the basic regression model is satisfactory, it makes no sense to introduce additional dummy variables into the model.

    39. system of simultaneous equations (endogenous, exogenous, lag variables). Economically significant examples of systems of simultaneous equations

    So far, we have considered econometric models defined by equations expressing the dependent (explained) variable in terms of explanatory variables. However, real economic objects studied with the help of econometric methods lead to an extension of the concept of an econometric model, described by a system of regression equations and identities1.

    1 Unlike regression equations, identities do not contain model parameters to be estimated and do not include a random component.

    A feature of these systems is that each of the equations of the system, in addition to “its own” explanatory variables, can include explanatory variables from other equations. Thus, we have not one dependent variable, but a set of dependent (explained) variables connected by the equations of the system. Such a system is also called a system of simultaneous equations, emphasizing the fact that in the system the same variables are simultaneously considered as dependent in some equations and independent in others.

    Systems of simultaneous equations most fully describe an economic object containing many interrelated endogenous (formed within the operation of the object) and exogenous (set from outside) variables. At the same time, lag variables (taken at the previous point in time) can act as endogenous and exogenous.

    A classic example of such a system is the model of demand Qd and supply Qs (see § 9.1), when the demand for a product is determined by its price P and the consumer's income /, the supply of a product is determined by its price P and an equilibrium is reached between supply and demand:

    In this system, the exogenous variable is the consumer's income /, and the endogenous variables are the demand (supply) of the goods Qd \u003d Q "\u003d Q and the price of the goods (equilibrium price) P.

    In another model of supply and demand, the variable explaining the supply of Qf can be not only the price of goods P at a given time /, i.e. Pb but also the price of the goods at the previous time Ptb i.e. lagged endogenous variable:

    d" \u003d P4 + P5 ^ + Pb ^ -1 + Є2.

    Summarizing the above, we can say that the econometric model allows one to explain the behavior of endogenous variables depending on the values ​​of exogenous and lag endogenous variables (in other words, depending on predetermined, i.e., predetermined, variables).

    Concluding the consideration of the concept of an econometric model, the following should be noted. Not every economic-mathematical model that represents a mathematical-statistical description of the studied economic object can be considered econometric. It becomes econometric only if it reflects this object on the basis of empirical (statistical) data characterizing it.

    40. indirect least squares

    If the i -th stochastic equation of the structural form is identifiable exactly, then the parameters of this equation (the coefficients of the equation and the variance of the random error) are uniquely restored from the parameters of the reduced system. Therefore, to estimate the parameters of such an equation, it is sufficient to estimate the coefficients of each of the equations of the reduced form by the least squares method (separately for each equation) and obtain an estimate of the covariance matrix Q of errors in the reduced form, and then use the relations PG = B and E = ГТQT , substituting in them, instead of П, the estimated matrix of coefficients of the reduced form П and the estimated covariance matrix of errors in the reduced form £2. This procedure is called indirect least squares (ILS indirect least squares). The resulting estimates of the coefficients of the i -th stochastic equation of the structural form inherit the consistency property of the estimates of the reduced form. However, they do not inherit such properties of reduced form estimators as unbiasedness and efficiency due to the fact that they are obtained as a result of some non-linear transformations. Accordingly, with a small number of observations, even these natural estimates can have a noticeable bias. For this reason, when considering various methods estimating the coefficients of structural equations, first of all, they take care of ensuring the consistency of the resulting estimates.

    41. Problems of identifiability of systems of simultaneous equations

    With the correct specification of the model, the problem of identifying a system of equations is reduced to a correct and unambiguous estimate of its coefficients. Direct estimation of the coefficients of an equation is possible only in systems of externally unrelated equations for which the basic prerequisites for constructing a regression model are satisfied, in particular, the condition of uncorrelated factor variables with residuals.

    In recursive systems, it is always possible to get rid of the problem of correlation of residuals with factor variables by substituting as values ​​of factor variables not actual, but model values ​​of endogenous variables acting as factor variables. The identification process is carried out as follows:

    1. An equation is identified that does not contain endogenous variables as factorial variables. The calculated value of the endogenous variable of this equation is found.

    2. The following equation is considered, in which the endogenous variable found in the previous step is included as a factor. Model (calculated) values ​​of this endogenous variable provide the possibility of identifying this equation, etc.

    In the system of equations in the reduced form, the problem of correlation of factor variables with deviations does not arise, since in each equation only predefined variables are used as factor variables. Thus, under other assumptions, a recursive system is always identifiable.

    When considering a system of simultaneous equations, the problem of identification arises.

    Identification in this case means determining the possibility of unambiguous recalculation of the system coefficients in the reduced form into structural coefficients.

    The structural model (7.3) in its entirety contains parameters to be defined. The given form of the model in its full form contains the parameters. Therefore, to determine unknown parameters of the structural model can be composed of equations. Such systems are indefinite and the parameters of the structural model in the general case cannot be uniquely determined.

    To get the only Possible Solution it is necessary to assume that some of the structural coefficients of the model, due to their weak relationship with the endogenous variable from the left side of the system, are equal to zero. This will reduce the number of structural coefficients of the model. Reducing the number of structural coefficients of the model is also possible in other ways: for example, by equating some coefficients to each other, i.e., by assuming that their effect on the generated endogenous variable is the same, etc.

    From the standpoint of identifiability, structural models can be divided into three types:

    identifiable;

    unidentifiable;

    overidentifiable.

    Model identifiable, if all its structural coefficients are uniquely determined by the coefficients of the reduced form of the model, i.e., if the number of parameters of the structural model is equal to the number of parameters of the reduced form of the model.

    Model unidentifiable, if the number of coefficients of the reduced model is less than the number of structural coefficients, and as a result, the structural coefficients cannot be estimated through the coefficients of the reduced form of the model.

    Model over-identifiable, if the number of coefficients of the reduced model is greater than the number of structural coefficients. In this case, based on the coefficients of the reduced form, two or more values ​​of one structure coefficient can be obtained. An overidentified model, unlike an unidentified model, is practically solvable, but requires special methods for finding parameters for this.

    To determine the type of a structural model, it is necessary to check each of its equations for identifiability.

    A model is considered to be identifiable if each equation of the system is identifiable. If at least one of the equations of the system is unidentifiable, then the entire model is considered unidentifiable. The overidentified model, in addition to the identified ones, contains at least one overidentified equation.

    42. Three-Step Least Squares

    The most efficient procedure for estimating systems of regression equations combines the method of simultaneous estimation and the method of instrumental variables. The corresponding method is called the three-step least squares method. It consists in the fact that at the first step the generalized least squares method is applied to the original model (9.2) in order to eliminate the correlation of random terms. Then the two-step method of least squares is applied to the resulting equations.

    Obviously, if the random terms (9.2) do not correlate, the three-step method reduces to the two-step method, while at the same time, if the matrix B is unit, the three-step method is a procedure for simultaneously estimating the equations as externally unrelated.

    We apply the three-step method to the considered model (9.24):

    ai=19.31; Pi=l.77; a2=19.98; p2=0.05; y=1.4. (6.98) (0.03) (4.82) (0.08) (0.016)

    Since the coefficient p2 is insignificant, the equation for the dependence of Y on X has the form:

    y \u003d 16.98 + 1.4x

    Note that it practically coincides with equation (9.23).

    As is known, the purification of the equation from the correlation of random terms is an iterative process. Accordingly, when using the three-step method computer program asks for the number of iterations or the required precision. We note an important property of the three-step method, which ensures its greatest efficiency.

    For a sufficiently large number of iterations, the estimates of the three-step least squares method coincide with the maximum likelihood estimates.

    As is known, maximum likelihood estimates on large samples are the best.

    43. the concept of economic series of dynamics. General form multiplicative and additive time series models.

    44. modeling the trend of the time series, seasonal and cyclical fluctuations.

    There are several approaches to the analysis of the structure of time series containing seasonal or cyclical fluctuations.

    1 APPROACH. Calculation of the values ​​of the seasonal component using the moving average method and the construction of an additive or multiplicative model of the time series.

    General view of the additive model: (T - trend component, S - seasonal, E - random).

    General view of the multiplicative model:

    Selection of a model based on the analysis of the structure of seasonal fluctuations (if the amplitude of fluctuations is approximately constant - additive, if it increases/decreases - multiplicative).

    The construction of models is reduced to the calculation values ​​T,S,E for each level of the row.

    Model building:

    1. Alignment of the original series by the moving average method;

    2.component value calculation S;

    3. Elimination of the seasonal component from the initial levels of the series and obtaining leveled data ( T+E) in additive or ( T*E) in the multiplicative model.

    4.Analytical alignment of levels ( T+E) or ( T*E) and calculating the value T using the obtained trend level.

    5. Calculation of the values ​​obtained by the model ( T+S) or ( T*S).

    6. Calculation of absolute and/or relative errors.

    If the obtained error values ​​do not contain autocorrelation, they can replace the initial levels of the series and further use the time series of errors E to analyze the relationship of the original series and other time series.

    2 APPROACH. Building a regression model with the inclusion of a time factor and dummy variables. The number of dummy variables in such a model should be one less than the number of moments (periods) of time within one oscillation cycle. For example, when modeling quarterly data, the model should include four independent variables - the time factor and three dummy variables. Each dummy variable reflects the seasonal (cyclical) component of the time series for any one period. It is equal to one (1) for a given period and zero (0) for all others. The disadvantage of the model with dummy variables is the presence of a large number of variables.

    45. autocorrelation function. Its use to detect the presence or absence of a trend and cyclical component

    Autocorrelation of time series levels.

    If there are trends and cyclical fluctuations in the time series, each subsequent level of the series depends on the previous ones. The correlation dependence between successive levels of the time series is called autocorrelation of the levels of the series.

    Quantitatively, the autocorrelation of the levels of a series is measured using a linear correlation coefficient between the levels of the original time series and the levels of this series, shifted by several steps in time.

    Let, for example, be given a time series . Let us determine the correlation coefficient between the series and .

    One of the working formulas for calculating the correlation coefficient is:

    And the time series, i.e. with a lag of 2. It is determined by the formula:

    (4)

    Note that as the lag increases, the number of pairs of values ​​used to calculate the correlation coefficient decreases. Usually the lag is not allowed to be more than a quarter of the number of observations.

    We note two important properties of autocorrelation coefficients.

    First, the autocorrelation coefficients are calculated by analogy with the linear correlation coefficient, i.e. they characterize only the tightness of the linear relationship between the two considered levels of the time series. Therefore, the autocorrelation coefficient can only be used to judge the presence of a linear (or close to linear) trend. For time series with a strong non-linear trend (for example, an exponential), the level autocorrelation coefficient can approach zero.

    0

    Ministry of Education and Science of the Russian Federation

    Federal State Budgetary Educational Institution

    higher education

    TVER STATE TECHNICAL UNIVERSITY

    Department of "Accounting and Finance"

    COURSE PROJECT
    in the discipline "Econometrics"

    “Investigation of multicollinearity in econometric models: exclusion of variable(s) from the model”

    Work manager:

    cand. those. Sciences, Associate Professor

    Konovalova

    Executor:

    student of group EK-1315 EPO

    Tver, 2015

    Introduction…………………………………………………………………………...3

    1.Analytical part………………………………………………………………4

    1.1. Generalized signs of multicollinearity in econometric models……………………………………………………………………………….4

    1.2. The main ways to eliminate multicollinearity in econometric models…………..……………………………………………..7

    2. Design part…………………………………………………………………..11

    2.1. Information and methodological support of econometric research…………………………………………………………………….11

    2.2. An example of an econometric study…………………………….17

    Conclusion ……………………………………………………………………....30

    List of used sources…………………………………………...31

    Introduction

    The relevance of the topic of the work “Investigation of multicollinearity in econometric models: exclusion of a variable (s) from the model” is due to the fact that in our time this problem often found in applied econometric models.

    The subject of research is the problem of multicollinearity. The object of research is econometric models.

    The main goal of the work is the development of design solutions for information and methodological support of econometric research.

    To achieve the goal, the following main tasks of the study were set and solved:

    1. Generalization of signs of multicollinearity in econometric models.
    2. Identification of the main ways to eliminate multicollinearity.

    3. Development of information and methodological support for econometric research.

    1. Analytical part

    1.1. Generalized signs of multicollinearity in econometric models

    Multicollinearity - in econometrics (regression analysis) - the presence of a linear relationship between the explanatory variables (factors) of the regression model. At the same time, they distinguish full collinearity, which means the presence of a functional (identical) linear dependence, and partial or simply multicollinearity— the presence of a strong correlation between the factors .

    Full collinearity leads to uncertainty parameters in a linear regression model, regardless of estimation methods. Consider this using the following linear model as an example:

    Let the factors of this model be identically related as follows: . Then consider the original linear model, in which we add to the first coefficient arbitrary number a, and subtract the same number from the other two coefficients. Then we have (without a random error):

    Thus, despite the relatively arbitrary change in the coefficients of the model, the same model is obtained. Such a model is fundamentally unidentifiable. Uncertainty already exists in the model itself. If we consider a 3-dimensional space of coefficients, then in this space the vector of true coefficients in this case is not the only one, but is a whole straight line. Any point on this line is a true vector of coefficients .

    If complete collinearity leads to uncertainty in the values ​​of parameters, then partial multicollinearity leads to their instability. ratings. Instability is expressed in an increase in statistical uncertainty - the variance of estimates. This means that specific evaluation results can vary greatly from sample to sample even though the samples are homogeneous.

    As is known, the covariance matrix of estimates of the parameters of multiple regression by the least squares method is equal to. Thus, the “smaller” the covariance matrix (its determinant), the “larger” the covariance matrix of parameter estimates, and, in particular, the larger the diagonal elements of this matrix, that is, the variance of parameter estimates. For greater clarity, consider the example of a two-factor model:

    Then the variance of the parameter estimate, for example, with the first factor is:

    where is the sample correlation coefficient between the factors.

    It is clearly seen here that the greater the absolute value of the correlation between the factors, the greater the dispersion of parameter estimates. At (total collinearity), the dispersion tends to infinity, which corresponds to what was said earlier.

    Thus, the estimates of the parameters are inaccurate, which means that it will be difficult to interpret the influence of certain factors on the variable being explained. At the same time, multicollinearity does not affect the quality of the model as a whole - it can be recognized as statistically significant, even when All the coefficients are insignificant (this is one of the signs of multicollinearity).

    In linear models, correlation coefficients between parameters can be positive or negative. In the first case, an increase in one parameter is accompanied by an increase in another parameter. In the second case, when one parameter increases, the other decreases.

    Proceeding from this, it is possible to establish admissible and inadmissible multicollinearity. Inadmissible multicollinearity will be when there is a significant positive correlation between factors 1 and 2, and at the same time, the influence of each factor on the correlation with the function y is unidirectional, that is, an increase in both factors 1 and 2 leads to an increase or decrease in the function y. In other words, both factors act on the y function in the same way, and a significant positive correlation between them may allow one of them to be excluded.

    Permissible multicollinearity is such that the factors act on the function y differently. There are two possible cases here:

    a) with a significant positive correlation between factors, the influence of each factor on the correlation with the function y is multidirectional, i.e. an increase in one factor leads to an increase in the function, and an increase in another factor leads to a decrease in the function y.

    b) with a significant negative correlation between factors, an increase in one factor is accompanied by a decrease in another factor, and this makes the factors different, so any sign of the influence of factors on the function y is possible.

    In practice, some of the most characteristic signs of multicollinearity are distinguished: 1. A small change in the initial data (for example, adding new observations) leads to a significant change in the estimates of the model coefficients. 2. Estimates have large standard errors, low significance, while the model as a whole is significant (high value of the coefficient of determination R 2 and the corresponding F-statistics). 3. Estimates of the coefficients have incorrect signs from the point of view of theory or are unjustifiably big values.

    Indirect signs of multicollinearity are high standard errors of estimates of model parameters, small t-statistics (that is, insignificance of coefficients), incorrect signs of estimates, despite the fact that the model as a whole is recognized as statistically significant ( great importance F-statistics). Multicollinearity can also be indicated by a strong change in parameter estimates from the addition (or removal) of sample data (if the requirements for sufficient sample homogeneity are met).

    To detect multicollinearity of factors, one can analyze directly the correlation matrix of factors. Already the presence of large modulus (above 0.7-0.8) values ​​of the pair correlation coefficients indicates possible problems with the quality of the resulting estimates.

    However, the analysis of pairwise correlation coefficients is insufficient. It is necessary to analyze the coefficients of determination of the regressions of factors on the remaining factors (). It is recommended to calculate the indicator. Too high values ​​of the latter mean the presence of multicollinearity.

    Thus, the main criteria for detecting multicollinearity are as follows: high R 2 for all insignificant coefficients, high pairwise correlation coefficients, high values ​​of the VIF coefficient.

    1.2. The main ways to eliminate multicollinearity in econometric models

    Before pointing out the main methods for eliminating multicollinearity, we note that in some cases multicollinearity is not a serious problem that requires significant efforts to identify and eliminate it. Basically, it all depends on the purpose of the study.

    If the main task of the model is to predict the future values ​​of the regressand, then with a sufficiently large coefficient of determination R2 (> 0.9), the presence of multicollinearity usually does not affect the predictive qualities of the model. Although this statement will be justified only in the event that in the future the same relations will be maintained between the correlated regressors as before. If the goal of the study is to determine the degree of influence of each of the regressors on the regressand, then the presence of multicollinearity, leading to an increase in standard errors, is likely to distort the true relationships between the regressors. In this situation, multicollinearity is a major problem.

    Note that there is no single method for eliminating multicollinearity that is suitable in any case. This is because the causes and consequences of multicollinearity are ambiguous and largely dependent on sampling outcomes.

    In practice, the main methods for eliminating multicollinearity are distinguished:

    1. Exclusion of regressors from the model The simplest method for eliminating multicollinearity is the exclusion of one or a set of correlated regressors from the model. However, some caution is required when applying this method. In this situation, specification errors are possible. For example, when studying the demand for a certain good, the price of this good and the prices of substitutes for this good, which often correlate with each other, can be used as explanatory variables. By excluding substitute prices from the model, we are more likely to make a specification error. As a result, biased estimates can be obtained and unreasonable conclusions can be drawn. Thus, in applied econometric models, it is desirable not to eliminate regressors until their collinearity becomes a serious problem.
    2. Obtaining additional data or a new sample Since multicollinearity is directly dependent on the sample, it is possible that with another sample there will be no multicollinearity at all or it will not be so serious. Sometimes, to reduce multicollinearity, it is enough to increase the sample size. For example, if you are using yearly data, you can change to quarterly data. Increasing the amount of data reduces the variances of the regression coefficients and thus increases their statistical significance. However, obtaining a new sample or expanding the old one is not always possible or is associated with serious costs. In addition, this approach can enhance autocorrelation. These problems limit the applicability of this method.

    III. Changing the specification of the model In some cases, the problem of multicollinearity can be solved by changing the specification of the model: either the shape of the model is changed, or new regressors are added that are not taken into account in the original model, but significantly affect the dependent variable. If this method is justified, then its use reduces the sum of squared deviations, thereby reducing the standard error of the regression. This leads to a reduction in the standard errors of the coefficients.

    1. Transformation of variables in a number of cases, it is possible to minimize or completely eliminate the problem of multicollinearity only with the help of transformation of variables. The input data in each observation are divided by the values ​​of one of the dependent regressors in that observation. The application of the principal components method to the factors of the model makes it possible to transform the initial factors and obtain a set of orthogonal (uncorrelated) factors. At the same time, the presence of multicollinearity will allow us to limit ourselves to a small number of principal components. However, there may be a problem of meaningful interpretation of principal components.

    If by all indications there is multicollinearity, then among econometricians there are different opinions on this matter. When confronted with the problem of multicollinearity, one may naturally want to discard the "extra" independent variables that may be causing it. However, it should be remembered that new difficulties may arise in this case. First, it is far from always clear which variables are redundant in the indicated sense.

    Multicollinearity means only an approximate linear relationship between factors, but this does not always highlight the "extra" variables. Secondly, in many situations, the removal of any independent variables can significantly affect the meaningful meaning of the model. Finally, the rejection of the so-called essential variables, i.e. independent variables that actually affect the dependent variable under study leads to a bias in the coefficients of the model. In practice, usually when multicollinearity is detected, the least significant factor for analysis is removed, and then the calculations are repeated.

    Thus, in practice, the main methods for eliminating multicollinearity are distinguished: changing or increasing the sample, excluding one of the variables, transforming multicollinear variables (use non-linear forms, use aggregates (linear combinations of several variables), use first differences instead of the variables themselves. However, if multicollinearity is not eliminated , you can ignore it, taking into account the expediency of the exception.

    1. Design part

    2.1. Information and methodological support of econometric research

    Information support of econometric research includes the following information:

    Input information:

    • statistical data on a socio-economic indicator, defined as a dependent variable (factors - results);
    • statistical data on socio-economic indicators, defined as explanatory variables (factors - signs);

    Intermediate information:

    • regression equation model, estimated regression equation, quality indicators and conclusion about the quality of the regression equation, conclusion about the presence (absence) of the problem of multicollinearity, recommendations for using the model;

    Effective information:

    • estimated regression equation, conclusion about the quality of the regression equation, conclusion about the presence (absence) of the problem of multicollinearity, recommendations for using the model.

    The methodology of econometric research is as follows: specification; parametrization, verification, additional research, forecasting.

    1. The specification of the regression equation model includes a graphical analysis of the correlation dependence of the dependent variable on each explanatory variable. Based on the results of graphical analysis, a conclusion is made about the model of the regression equation of linear or non-linear types. For graphical analysis, it is most often recommended to use the MsExcel Scatter Plot tool. As a result of this stage, the model of the regression equation is determined, and in the case of a non-linear form, the methods of its linearization are also determined.

    2. Parameterization of the regression equation includes the evaluation of the regression parameters and their socio-economic interpretation. For parameterization, the "Regression" tool is used as part of the "Data Analysis" MsExcel add-ons. Based on the results of the automated regression analysis (the “Coefficients” column), the regression parameters are determined, and their interpretation is also given according to the standard rule:

    Bj represents the amount by which the value of the variable Y changes on average when the independent variable Xj increases by one, all other things being equal.

    The free term of the regression equation is equal to the predicted value of the dependent variable Y in the case when all independent variables are equal to zero.

    3. Verification of the regression equation is carried out on the basis of the results of automated regression analysis (stage 2) for the following indicators: "R-square", "Significance F", "P-value" (for each regression parameter), as well as according to the fitting and residual graphs .

    The significance of the coefficients is determined and the quality of the model is evaluated. For this, the “F Significance”, “P-Value” and “R-square” are considered. If the “P-value” is less than the static significance equation, then this indicates the significance of the coefficient. If the “R-squared” is greater than 0.6, then this means that the regression model well describes the behavior of the dependent variable Y on the factors of the variables.

    If the “Significance F” is less than the static significance equation, then the coefficient of determination (R-square) is recognized as conditionally statistically significant.

    The plot of the residuals allows you to evaluate the variations in the errors. If there are no significant differences between the errors corresponding to different values ​​of Xi, that is, the error variations for different values ​​of Xi are approximately the same and it can be assumed that there are no problems. The fitting plot allows you to form judgments about the base, predicted, and factor values.

    In conclusion, a judgment is formed about the quality of the regression equation.

    1. Additional research.

    4.1. Discovery of the first sign of multicollinearity. Based on the results of regression analysis obtained in paragraphs 2-3, situations are checked in which the coefficient of determination has a high value (R 2 > 0.7) and is statically significant (Significance F<0,05), и хотя бы один из коэффициентов регрессии не может быть признан статистически значим (P-значение >0.05). When such a situation is detected, a conclusion is made about the assumption of multicollinearity.

    4.2. Detection of the second sign of multicollinearity. Based on the calculations of the correlation coefficients between factor variables, a significant relationship of individual factors is determined. For calculations in MS Excel, it is advisable to use the “Data Analysis / Correlation” tool. According to the values ​​of the correlation coefficient, conclusions are drawn: the closer (r) to the extreme points (±1), the greater the degree of linear relationship, if the correlation coefficient is less than 0.5, then it is considered that the relationship is weak. The presence of multicollinearity is assumed in the following case, if there is a significant correlation coefficient between at least two variables (i.e., more than 0.7 in absolute value).

    4.3. Detection of the third sign of multicollinearity. Based on the assessment of auxiliary regressions between factor variables, and between variables where there is a significant correlation coefficient (section 4.2), a conclusion is made about the presence of multicollinearity, if at least one auxiliary regression is significant and significant. The method of additional regressions of the coefficient of determination is as follows: 1) regression equations are constructed that connect each of the regressors with all the remaining ones; 2) coefficients of determination R 2 are calculated for each regression equation; 3) if the equation and the coefficient of determination are recognized as statistically significant, then this regressor leads to multicollinearity.

    4.4. Generalization of judgments.

    On the basis of paragraphs 4.1-4.3, a judgment is formed about the presence / absence of multicollinearity and regressors leading to multicollinearity.

    Next, directions for using the model (in case of ignoring or absence of the multicollinearity problem) or recommendations for eliminating multicollinearity (in practice, excluding the variable) are formed.

    When excluding a variable, it is advisable to use the rule:

    The coefficient of determination is determined for the regression equation initially constructed from n observations (R 2 1);

    By excluding from consideration (k) the last variables, an equation is formed for the remaining factors according to the initial n observations and the coefficient of determination (R 2 2) is determined for it;

    The F-statistic is calculated: where (R 1 2 -R 2 2) is the loss of the equation as a result of dropping to variables, (K) is the number of additionally appeared degrees of freedom, (1- R 1 2) / (n-m-l) is the unexplained variance of the initial equations;

    The critical value F a ,k ,n- m -1 is determined according to the tables of critical points of the Fisher distribution at a given significance level a and degrees of freedom v 1 =k, v 2 =n-m-l;

    Judgments are formed about the expediency of exclusion according to the rule: the exclusion (simultaneous) from the equation of k variables is considered inappropriate for F > F a , k , n- m - 1 , otherwise such an exclusion is permissible.

    When the variable is eliminated, the resulting model is analyzed according to paragraphs 3-4; and compared with the original model, as a result, the “best” one is selected. In practice, since multicollinearity does not affect the predictive performance of the model, this problem can be ignored.

    5. Forecasting is carried out according to the initial / “best” model selected in paragraph 4.4, according to the retrospective forecast scheme, in which the last 1/3 of observations is used for the forecast.

    5.1. Point forecast. The actual values ​​of factor variables in the forecast period are considered predicted, the forecast values ​​of the resulting variable are determined as predicted by the original / “best” model based on factor variables in the forecast period. With the help of the Microsoft Excel "Graph" tool, a graph of actual and predicted values ​​of the resulting variable is plotted based on observations and a conclusion is made about the closeness of the actual values ​​to the predicted ones.

    5.2. Interval forecasting involves calculating standard prediction errors (using Salkever's dummy variables) and upper and lower bounds on forecast values.

    Using the Microsoft Excel "Data Analysis/Regression" tool, a regression is built for the aggregate sample data set and the forecast period, but with the addition of dummy variables D 1 , D 2 , ..., D p . In this case, D i = 1 only for the moment of observation (n + i), for all other moments D i =0. Then the coefficient of the dummy variable D i is equal to the prediction error at time (n + i), and the standard error of the coefficient is equal to the standard error of prediction (S i). Thus, an automated regression analysis of the model is carried out, where the aggregate (sample and predictive) values ​​of the factor variables and the values ​​of Salkever's dummy variables are used as the X values, and the aggregate (sample and predictive) values ​​of the resulting variable are used as the Y values.

    The resulting standard errors of the coefficients for Salkever's dummy variables are equal to the standard prediction errors. Then the boundaries of the interval forecast are calculated according to the following formulas: Ymin n + i = Yemp n + i -S i *t cr, Ymax n + i = Yemp n + i +S i *t cr, where t cr is the critical value of Student's distribution, determined by the formula “=STYURASV(0.05; n-m-1)”, m is the number of explanatory factors in the model (Y * t), Yemp n + i are the predicted values ​​of the resulting variable (clause 5.1).

    Using the Microsoft Excel Graph tool, a graph is plotted based on the actual and predicted values ​​of the resulting variable, the upper and lower bounds of the forecast based on observations. It is concluded that the actual values ​​of the resulting variable fit within the boundaries of the interval forecast.

    5.3. The evaluation of the stability of the model using the CHS test is carried out as follows:

    a) using the Microsoft Excel "Data Analysis/Regression" tool, a regression is built, where the X values ​​are taken to be the total (sample and predicted) values ​​of the factor variables, and the Y values ​​are the aggregate (sample and predicted) values ​​of the resulting variable. This regression determines the sum of the squares of the residuals S;

    b) by regression of item 5.2 with dummy Salkever variables, the sum of squared residuals Sd is determined;

    c) the value of F statistics is calculated and estimated according to the formula:

    where p is the number of predictive steps. If the obtained value is greater than the critical value Fcr, determined by the formula "=FDISP(0.05; p; n-m-1)", then the hypothesis about the stability of the model in the forecast period is rejected, otherwise it is accepted.

    5.4. Generalization of judgments about the predictive qualities of the model based on clauses 5.1-5.3, as a result, a conclusion is formed about the predictive quality of the model and recommendations for using the model for forecasting.

    Thus, the developed information and methodological support corresponds to the main tasks of the econometric study of the problem of multicollinearity in multiple regression models.

    2.2. An example of an econometric study

    The study is based on data reflecting the real macroeconomic indicators of the Russian Federation for the period 2003-2011. (Table 1), according to the method of p.2.1.

    Table 1

    House expenses. households (billion rubles)[Y]

    Population (million people)

    Money supply (billion rubles)

    Unemployment rate (%)

    1.Specification The regression equation model includes a graphical analysis of the correlation dependence of the dependent variable Y (Household expenditures on the explanatory variable X 1 (Population) (Fig. 1), the correlation dependence of the dependent variable Y (Household expenditures on the explanatory variable X 2 (Money supply) (Fig. 2), the correlation dependence of the dependent variable Y (Household expenditures on the explanatory variable X 3 (Unemployment rate) (Fig. 3).

    The graph of the correlation between Y and X 1, shown in Figure 1, reflects a significant (R 2 =0.71) inverse linear dependence of Y on X 1 .

    The graph of the correlation between Y and X 2 presented in Figure 2 reflects a significant (R 2 =0.98) direct linear dependence of Y on X 2 .

    The graph of the correlation between Y and X 3 shown in Figure 3 reflects an insignificant (R 2 =0.15) inverse linear dependence of Y on X 3 .

    Picture 1

    Figure 2

    Figure 3

    As a result, one can specify a linear multiple regression model Y=b 0 +b 1 X 1 +b 2 X 2 + b 3 X 3 .

    2. Parameterization the regression equation is carried out using the "Regression" tool as part of the "Data Analysis" MsExcel add-ons (Fig. 4).

    Figure 4

    The estimated regression equation is:

    233983.8- 1605.6X 1 + 1.0X 2 + 396.22X 3.

    At the same time, the regression coefficients are interpreted as follows: with an increase in the population by 1 million people, house expenses. farms decrease by 1,605.6 billion rubles; with an increase in the money supply by 1 billion rubles. house expenses. farms will increase by 1.0 billion rubles; when the unemployment rate increases by 1%, house expenses. farms will increase by 396.2 billion rubles. At zero values ​​of factorial variables house expenses. farms will amount to 233,983.8 billion rubles, which, perhaps, has no economic interpretation.

    3. Verification the regression equation is carried out based on the results of automated regression analysis (stage 2).

    So, "R-square" is equal to 0.998, i.e. the regression equation describes the behavior of the dependent variable by 99%, which indicates a high level of description of the equation. The "F Significance" is 2.14774253442155E-07, which means that the "R-square" is significant. The "P-Value" for b 0 is 0.002, indicating that this parameter is significant. The "P-Value" for b 1 is 0.002, indicating that this coefficient is significant. The "P-Value" for b 2 is 8.29103190343224E-07, indicating that this coefficient is significant. The "P-Value" for b 3 is 0.084, indicating that this coefficient is not significant.

    Based on the residuals plots, the residuals e are random variables.

    Based on the selection graphs, a conclusion is made about the closeness of the actual and predicted values ​​for the model.

    So, the model has good quality, while b 3 is not significant, so we can assume the presence of multicollinearity.

    4.Additional research.

    4.1. Detection of the first sign of multicollinearity. According to the regression analysis (Figure 5), we can say that there is the first sign of multicollinearity, since a high and significant R 2 is revealed, it was found that the equation has a high coefficient of determination, and one of the coefficients is not significant. This suggests the presence of multicollinearity.

    4.2. Detection of the second sign of multicollinearity.

    Based on the calculations of the correlation coefficients between factor variables, a significant relationship of individual factors is determined. (Table 2). The presence of multicollinearity is assumed in the following case, if there is a significant correlation coefficient between at least two variables (i.e., more than 0.5 in absolute value).

    table 2

    [ x2]

    [ X3]

    [ x2]

    [ X3]

    In our case, there is a correlation coefficient between X 1 and X 2 (-0.788), which indicates a strong relationship between the variables X 1, X 2, there is also a correlation coefficient between X 1 and X 3 (0.54), which indicates strong dependence between variables X 1, X 3.

    As a result, we can assume the presence of multicollinearity.

    4.3. Detection of the third sign of multicollinearity.

    Since a strong relationship between the variables X 1 and X 2 was found in paragraph 4.2, the auxiliary regression between these variables is analyzed further (Fig. 5).

    Figure 5

    Since the "Significance of F" is 0.01, which indicates that the "R-square" and the auxiliary regression are significant, it can be assumed that the X 2 regressor leads to multicollinearity.

    Since in paragraph 4.2 a correlation between the variables X 1 and X 3 was found above the average level, the auxiliary regression between these variables is analyzed further (Fig. 6).

    Figure 6

    Since the "Significance of F" is 0.13, which means that the "R-squared" and the auxiliary regression are not significant, it can be assumed that the X 3 regressor does not lead to multicollinearity.

    So, according to the third sign, we can assume the presence of multicollinearity.

    4.4. Generalization of judgments.

    According to the analysis of paragraphs 4.1-4.3, all three signs of multicollinearity were found, so it can be assumed with a high probability. At the same time, despite the assumption in Section 4.3 regarding the regressor leading to multicollinearity, we can recommend the exclusion of X 3 from the original model, since X 3 has the smallest correlation coefficient with Y and the coefficient in this regressor is insignificant in the original equation. The results of the regression analysis after the exclusion of X 3 are shown in fig. 7.

    Figure 7

    At the same time, we calculate F - statistics to check the expediency of exclusion:

    F fact = 4.62,

    and F table = F 0.05; 1; 5 = 6.61, since F is a fact< F табл, то исключение допустимо для переменной X 3 .

    Evaluation of the quality of the linear multiple regression model Y=b 0 +b 1 X 1 +b 2 X 2 . "R-square" is equal to 0.996, i.e. the regression equation describes the behavior of the dependent variable by 99%, which indicates a high level of description of the equation. The "F Significance" is 3.02415218982089E-08, which means that the "R-square" is significant. The "P-Value" for b 0 is 0.004, which indicates that this parameter is significant. The "P-Value" for b 1 is 0.005, which indicates that this coefficient is significant. The "P-Value" for b 2 is 3.87838361673427E-07, indicating that this coefficient is significant. The estimated regression equation is:

    201511.7 -1359.6X 1 + 1.01X 2

    At the same time, the regression coefficients are interpreted as follows: with a decrease in the population by 1 million people, house expenses. farms decrease by 1359.6 billion rubles; with an increase in the level of money supply, house spending. farms will increase by 1.0) (billion rubles). At zero values ​​of factorial variables house expenses. farms will amount to 201511.7 billion rubles, which may have an economic interpretation.

    Thus, the model = 201511.7 -1359.6X 1 + 1.01X 2 is of good quality and is recommended for forecasting as “the best” in comparison with the original model.

    5. Forecasting.

    5.1 Point forecast. The actual values ​​of factor variables in the forecast period are considered predicted, the forecast values ​​of the resulting variable are determined as predicted by the “best” model (= 201511.7 -1359.6X 1 + 1.01X 2) based on factor variables in the forecast period. With the help of the Microsoft Excel "Graph" tool, a graph of actual and predicted values ​​of the resulting variable is plotted based on observations and a conclusion is made about the closeness of the actual values ​​to the predicted ones.

    Predictive values ​​of factor variables are presented in Table 3.

    Table 3

    The predicted values ​​of the resulting variable are determined as predicted by the “best” model (= 201511.7 -1359.6X 1 + 1.01X 2) based on factor variables in the forecast period. The predicted values ​​are presented in Table 4; actual values ​​are added for comparison.

    Table 4

    [Y] empirical

    Figure 8 shows the actual and predicted values ​​of the resulting variable, as well as the lower and upper bounds of the forecast.

    Figure 8

    According to Fig. 8, the forecast maintains an increasing trend, and all forecast values ​​are close to the actual ones.

    5.2. Interval forecast.

    Using the Microsoft Excel "Data Analysis/Regression" tool, a regression is built for the aggregate sample data set and the forecast period, but with the addition of dummy variables D 1 , D 2 , ..., D p . In this case, D i = 1 only for the moment of observation (n + i), for all other moments D i =0. The data are presented in Table 5, the regression result in Fig. 9.

    Table 5

    [Y]owls

    Figure 9

    Then the standard error of the coefficient for the dummy variable is equal to the standard error of prediction (S i): for 2012 it will be 738.5; for 2013 will be 897.1; for 2014 will be 1139.4.

    The boundaries of the interval forecast are calculated in Table 6.

    Table 6

    [Y] empirical

    [Y]owls

    [S]pr

    According to Table. 6, using the Microsoft Excel "Graph" tool, a graph is built according to the actual and predicted values ​​of the resulting variable, the upper and lower boundaries of the forecast based on observations (Fig. 10).

    Figure 10

    According to the graph, the forecast values ​​fit into the boundaries of the interval forecast, which indicates good quality forecast.

    5.3. Estimating Model Stability Using the CHS Test is carried out as follows:

    a) using the Microsoft Excel "Data Analysis/Regression" tool, a regression is built (Fig. 11), where the X values ​​are taken to be the total (sample and predictive) values ​​of factor variables, and the Y values ​​are the total (sample and forecast) values result variable. Based on this regression, the sum of the squares of the residuals S=2058232.333 is determined.

    Figure 11

    b) by regression p.3.2 with Salkever's dummy variables (Fig. 9), the sum of squared residuals Sd=1270272.697 is determined.

    c) the value of the F statistic is calculated and evaluated:

    at the same time, F cr = F 0.05;3;5 = 5.40, then the obtained value is less than the critical value F cr and the hypothesis of model stability in the forecast period is accepted.

    5.4. Generalization of judgments about the predictive qualities of the model based on clauses 5.1-5.3, as a result, a conclusion is formed about the high predictive quality of the model (= 201511.7 -1359.6X 1 + 1.01X 2) and recommendations are given for using the model for forecasting.

    The technique of p.2.1 has been successfully tested, allows to identify the main signs of multicollinearity and can be recommended for such studies.

    Conclusion

    Multicollinearity - in econometrics (regression analysis) - the presence of a linear relationship between the explanatory variables (factors) of the regression model. At the same time, full collinearity is distinguished, which means the presence of a functional (identical) linear dependence and partial or simply multicollinearity - the presence of a strong correlation between factors.

    The main consequences of multicollinearity are: large variances of estimates, decrease in t-statistics of coefficients, estimates of coefficients by least squares become unstable, difficulty in determining the contribution of variables, getting the wrong sign for the coefficient.

    The main criteria for detecting multicollinearity are as follows: high R 2 with insignificant coefficients; High pairwise correlation coefficients; high VIF values.

    The main methods for eliminating multicollinearity are: exclusion of the variable(s) from the model; obtaining additional data or a new sample; model specification change; use of preliminary information about some parameters.

    The developed information and methodological support corresponds to the main tasks of the econometric study of the problem of multicollinearity in multiple regression models and can be recommended for such studies.

    List of sources used

    1. Astakhov, S.N. Econometrics [Text]: Educational and methodical complex. Kazan, 2008. - 107p.
    2. Bardasov, S. A. ECONOMETRIC [Text]: tutorial. 2nd ed., revised. and additional Tyumen: Tyumen State University Press, 2010. 264 p.
    3. Borodkina, L.I. Course of lectures [Electronic resource]. Access mode - http://www.iskunstvo.info/materials/history/2/inf/correl.htm
    4. Voskoboynikov, Yu.E. ECONOMETRICA in EXCEL Part 1 [Text]: textbook, Novosibirsk 2005,156 p.
    5. Eliseeva, I.I. Workshop on econometrics: textbook. allowance for econom. universities / Eliseeva, I.I., Kurysheva, S.V., Gordeenko, N.M. , [and etc.] ; ed. I.I. Eliseeva - M.: Finance and statistics, 2001. - 191 p. - (14126-1).
    6. Multicollinearity [Electronic resource]. Access mode - https://ru.wikipedia.org/wiki/Multicollinearity.
    7. Novikov, A.I. Econometrics [Text]: textbook. allowance for eg. "Finance and credit", "Economics" - M.: Dashkov i K, 2013. - 223 p. - (93895-1).
    8. The problem of multicollinearity [Electronic resource]. Access mode - http://crow.academy.ru/econometrics/lectures_/lect_09_/lect_09_4.pdf .
    9. Chernyak, V. Applied Econometrics. Lecture No. 9 [Electronic resource]. Access mode http://www.slideshare.net/vtcherniak/lect-09.
    10. ru - encyclopedic site [Electronic resource]. Access mode - http://kodcupon.ru/ra17syplinoe97/Multicollinearity.

    Download: You do not have access to download files from our server.

    Suppose that we are considering a regression equation and the data for its evaluation contain observations for objects of different quality: for men and women, for whites and blacks. The question that may be of interest to us here is the following - is it true that the model under consideration is the same for two samples belonging to objects of different quality? This question can be answered using the Chow test.

    Consider the models:

    , i=1,…,N (1);

    , i=N+1,…,N+M (2).

    In the first sample N observations, in the second - M observations. Example: Y- salary, explanatory variables - age, length of service, level of education. Does it follow from the available data that the pattern of wage dependence on the explanatory variables on the right hand side is the same for men and women?

    To test this hypothesis, one can use general scheme testing hypotheses by comparing regression with restrictions and regression without restrictions. The unconstrained regression here is the union of regressions (1) and (2), i.e. ESSUR = ESS 1 + ESS 2 , the number of degrees of freedom - N + M - 2k. A constrained regression (that is, a regression under the assumption that the null hypothesis is met) would be a regression for the entire set of observations available:

    , i = 1,…, N+M (3).

    Estimating (3), we obtain ESS R. To test the null hypothesis, we use the following statistics:

    Which, in the case of the validity of the null hypothesis, has the Fisher distribution with the number of degrees of freedom of the numerator k and denominator N+ M- 2k.

    If the null hypothesis is true, we can combine the available samples into one and evaluate the model for N+M observations. If we reject the null hypothesis, then we cannot merge the two samples into one, and we will have to evaluate these two models separately.


    The study of the general linear model, considered by us earlier, is very essential, as we have seen, based on the statistical apparatus. However, as in all applications, mat. statistics, the strength of a method depends on the assumptions that underlie it and are necessary for its application. For a while, we will consider situations where one or more of the hypotheses underlying the linear model is violated. We will consider alternative estimation methods in these cases. We will see that the role of some hypotheses is more significant than the role of others. We need to see what consequences violations of certain conditions (assumptions) can lead to, be able to check whether they are satisfied or not, and know what statistical methods can and should be applied when the classical method of least squares does not fit.

    1. The relationship between the variables is linear and is expressed by the equation - model specification errors (non-inclusion of significant explanatory variables in the equation, inclusion of unnecessary variables in the equation, incorrect choice of the form of dependence between variables);


    2. X 1 ,…,X k- deterministic variables - stochastic regressors, linearly independent - complete multicollinearity;

    4. - heteroscedasticity;

    5. at i ¹ k– error autocorrelation

    Before starting the conversation, let's consider the following concepts: pairwise correlation coefficient and partial correlation coefficient.

    Suppose we are examining the effect of one variable on another variable ( Y And X). In order to understand how these variables are related to each other, we calculate the pairwise correlation coefficient using the following formula:

    If we got the value of the correlation coefficient close to 1, we conclude that the variables are quite strongly related to each other.

    However, if the correlation coefficient between the two variables under study is close to 1, they may not actually be dependent. The example of the mentally ill and radio receivers is an example of the so-called "false correlation". The high value of the correlation coefficient may also be due to the existence of a third variable, which has a strong influence on the first two variables, which is the reason for their high correlation. Therefore, the problem arises of calculating the "pure" correlation between variables X And Y, i.e., correlations in which the influence (linear) of other variables is excluded. For this, the concept of a partial correlation coefficient is introduced.

    So, we want to determine the coefficient of partial correlation between variables X And Y, excluding the linear influence of the variable Z. The following procedure is used to determine it:

    1. We evaluate the regression,

    2. We get the balances,

    3. We evaluate the regression,

    4. We get the balances,

    5. - sample coefficient of partial correlation, measures the degree of relationship between variables X And Y, cleared of the influence of the variable Z.

    Direct calculations:

    Property:

    The procedure for constructing the partial correlation coefficient is generalized to the case when we want to get rid of the influence of two or more variables.


    1. Perfect multicollinearity.

    One of the requirements of Gauss-Markov tells us that the explanatory variables should not be related by any exact relationship. If such a relationship between variables exists, we say that the model has perfect multicollinearity. Example. Consider a model with an average test score, consisting of three explanatory variables: I- parental income D- the average number of hours spent on training per day, W- the average number of hours spent on training per week. It's obvious that W=7D. And this ratio will be fulfilled for each student who falls into our sample. The case of complete multicollinearity is easy to trace, since in this case it is impossible to construct estimates using the least squares method.

    2. Partial multicollinearity or simply multicollinearity.

    A much more common situation is when there is no exact linear relationship between the explanatory variables, but there is a close correlation between them - this case is called real or partial multicollinearity (simply multicollinearity) - the existence of close statistical relationships between variables. It must be said that the issue of multicollinearity is rather a matter of the degree of manifestation of the phenomenon, and not of its type. Any regression estimate will suffer from it in one way or another, unless all of the explanatory variables are completely uncorrelated. Consideration of this problem begins only when it begins to seriously affect the results of regression estimation (the presence of statistical relationships between regressors does not necessarily give unsatisfactory estimates). So, multicollinearity is a problem when a strong correlation between regressors leads to unreliable regression estimates.

    Consequences of multicollinearity:

    Formally, since ( X"X) is non-degenerate, then we can construct OLS estimates for the regression coefficients. However, let us recall how the theoretical variances of estimates of the regression coefficients are expressed: , where a ii - i-th diagonal element of the matrix . Since the matrix (X"X) is close to degenerate and det( X"X) » 0, then

    1) there are very large numbers on the main diagonal of the inverse matrix, since the elements of the inverse matrix are inversely proportional to det( X"X). Therefore, the theoretical variance i th coefficient is large enough and the variance estimate is also large, therefore, t- statistics are small, which can lead to statistical insignificance i-th coefficient. That is, the variable has a significant effect on the variable being explained, and we conclude that it is insignificant.

    2) Since the estimates and depend on ( X"X) -1 , whose elements are inversely proportional to det( X"X), then if we add or remove one or two observations, thus adding or removing one or two rows to the matrix X"X, then the values ​​of and can change significantly, up to a change of sign - the instability of the estimation results.

    3) Difficulty in interpreting the regression equation. Let's say we have two variables in an equation that are related to each other: X 1 and X 2. Regression coefficient at X 1 is interpreted as a measure of change Y by changing X 1 ceteris paribus, i.e. the values ​​of all other variables remain the same. However, since the variables X 1 and X 2 are related, then the changes in the variable X 1 will result in predictable changes in the variable X 2 and value X 2 will not remain the same.

    Example: where X 1 - total area, X 2 - living area. We say: "If the living area is increased by 1 sq. m., then, other things being equal, the price of an apartment will increase by dollars." However, in this case, the living area will increase by 1 sq. m. and price increase will be . Delimit influence on a variable Y each variable separately is no longer possible. The way out in this situation with the price of an apartment is to include in the model not the total area, but the so-called "additional" or "additional" area.

    Signs of multicollinearity.

    There are no exact criteria for determining the presence (absence) of multicollinearity. However, there are heuristic recommendations for its identification:

    1) Analyze the matrix of paired correlation coefficients between the regressors and if the value of the correlation coefficient is close to 1, then this is considered a sign of multicollinearity.

    2) Analysis of the correlation matrix is ​​only a superficial judgment about the presence (absence) of multicollinearity. A more careful study of this issue is achieved by calculating the partial correlation coefficients or calculating the coefficients of determination of each of the explanatory variables for all other explanatory variables in the regression.

    4) (XX) is a symmetric positive-definite matrix; therefore, all its eigenvalues ​​are non-negative. If the matrix determinant ( XX) is equal to zero, then the minimum eigenvalue is also zero and continuity is preserved. Therefore, by the value of the manimal eigenvalue, one can also judge the proximity to zero of the determinant of the matrix ( XX). In addition to this property, the minimum eigenvalue is also important because the standard error of the coefficient is inversely proportional to .

    5) The presence of multicollinearity can be judged by external signs that are consequences of multicollinearity:

    a) some of the estimates have incorrect signs from the point of view of economic theory or unreasonably large values;

    b) a small change in the initial economic data leads to a significant change in the estimates of the coefficients of the model;

    c) majority t-statistics of the coefficients are not significantly different from zero, at the same time, the model as a whole is significant, as evidenced by the high value F- statistics.

    How to get rid of multicollinearity, how to eliminate it:

    1) Use of factor analysis. Transition from the original set of regressors, among which there are statistically dependent ones, to new regressors Z 1 ,…,Zm using the method of principal components - instead of the original variables, instead of the original variables, we consider some of their linear combinations, the correlation between which is small or absent at all. The task here is to give a meaningful interpretation of the new variables Z. If it fails, we return to the original variables using inverse transformations. The estimates obtained will, however, be biased, but will have a smaller variance.

    2) Among all the available variables, select the factors that most significantly affect the variable being explained. The selection procedures will be discussed below.

    3) Transition to biased estimation methods.

    When faced with the problem of multicollinearity, the inexperienced researcher initially wants to simply eliminate the extra regressors that may be causing it. However, it is not always clear which variables are redundant in this sense. In addition, as will be shown below, the rejection of the so-called significantly influencing variables leads to a bias in the OLS estimates.