Multiple Regression Diagnostics Multiple regression is probably the multivariate model that has benefited the most from systematic examinations and applications of data cleaning procedures -- and for good reason, since it is probably the most-used of all the models.
Influential Case Analysis SPSS provides several diagnostic statistics that allow the case-by-case evaluation of the data for possible influential cases. We'll need some vocabulary… •
Leverage -- assesses outliers among the predictors. All the leverage stats are some variation on Mahalanobis distance (√ Σ(x - µ)² where x is each predictor, in turn). Larger scores mean the case is further from the multivariate centroid of the sample. Cases with high leverage are "far out" -- but they might just be far along the regression line, so leverage isn't a sufficient criterion for exclusion.
Discrepancy -- assesses the extent to which a case is in line with others (very similar to what we called "truly bivariate outliers" earlier
Influence -- is the product of leverage and discrepancy -- and is the best single index of whether a case ought to be "hucked". Different combinations of leverage and discrepancy produce different influences … • high leverage and low discrepancy à over-estimates of R², underestimates standard error of regression weights and increased Type I errors when testing H0: R² = 0 and H0: b=0 • low leverage and high discrepancy à under-estimates of R², overestimates of standard error of regression weights and increased Type II errors when testing H0: R² = 0 and H0: b=0 • high leverage and high discrepancy à "pivoting" the regression line, underestimates of the R², underestimates of regression weights, overestimates of their standard errors and incresed Type II errors when testing H0: R²=0 and H0: b=0
High leverage, low discrepancy, moderate invluence
Low leverage, high discrepancy, moderate influence
High leverage, high discrepancy, high influence
SPSS includes influence statistics that have a long history -- Cook's Distance, DfBeta and DfFit. When selected form the "Save" menu, these produce values for each case. For each of these, the usual "cutoff" is 1.0 -cases with values larger than 1.0 are "suspected of being outliers". I found that same phrase in 5 different books and articles! It is all well and good for authors to tell us about suspicions, but we need to make and defend decisions. Usually there are few cases that have large values, and unless we have a really skimpy sample size, tossing them will be the best thing (more below).
Analyze à Regression à Linear à Click the "Save" button
• • •
coo_1 à Cook's Distance - you get one for each regression model (_1, _2, etc.)
sdb0_1 à Standardized DfBeta -- you • • •
sdf_1 à Standardized DfFit -- you get one each regression model (_1, _2, etc.) get one for each predictor, for each model 0_1 is for constant from first model 1_1 is for first predictor in first model 1_2 is for first predictor in second model
Use Cook's and DfFit to make a "keep - dump" decision for each case. Use the DfBetas to identify specific predictors that might be leading to this being an influential case. For example, sometimes they identify a variable for a specific case that, when Windsorized, reduces the influence of that case and allows you to keep it in the analysis. Important point: What model to consider for influential cases? The "old advice" was to focus on the full model. However, a case might be influential for a reduced model without being influential for the full model (easier to hide amongst larger, mor collinear predictor set.
Collinearity Analysis Why does collinearity cause "problems"? The higher the collinearity, the greater the discrepancy between bivariate and multivariate contributions of variables. This is "reality" because predictors are correlated with each other, and so combinations of predictors will bring that collinearity with them. However, when we start piling up the predictors, then that very real collinearity can produce apparently uninteresting and possible confusing results (remember the crocodiles!). The best way to handle this very real and representative kind of collinearity is to do what you already know is important -- compare the results from bivariate correlations and different nested and non-nested models to get a complete story about how specific predictors relate to the criterion. Another issue is when the collinearity is sufficient to perturb the mathematics of regression analysis. In order -1 -1 to compute the multiple regression weights, we have invert the correlation matrix (X where X*X = I). If there is sufficient collinearity, the computation of this inversion will be perturbed, and the resulting regression weights will be wrong. Perhaps the clearest indication that something like this has gone wrong is if any of the predictors have a standardized regression weight (β) that is > 1.0 or < -1.0. When this happens one or more variables will have to be deleted or combined to reduce the collinearity. The most common summary statistic for evaluating collinearity is tolerance. The tolerance value for a particular predictor in a particular model is 1 - R², where the R² is obtained using that predictor as a criterion and all others as predictors. SPSS automatically does a tolerance analysis and won't enter the regression model any variable with tolerance < .001 -- that's a variable that shares more than 99.9% of its variance with the rest of the predictor set. While this is a common "cutoff", lots of texts and articles also suggest taking a look at what happens when you delete from the model variables that have "relatively small" tolerances. Analyze à Regression à Linear à Click the "Statistics" button
Here's an example of an interesting result from a common mistake. We're trying to predict the number of friends someone reports having from self-reports of the frequency with which they engage in five behaviors. Coefficientsa
(Constant) tell jokes get others to do -.732 things my way stick up for others -.424 forget to return items -.571 make jokes when 9.908E-03 others clumsy
t 5.447 2.506
Sig. .000 .012
Collinearit Correlatio y ns Statistics Zero-order Tolerance .261
a. Dependent Variable: how many friends sub listed
Notice that the scale is composed of the same variables that were the predictor in the first regression model.
The common explanation is that if we include the scale along with the items as predictors is that the scale will be "dumped" because it is perfectly predictable from the items. But that isn't always what happens -- see below!
Standardi zed Coefficien ts Beta
Unstandardized Coefficients B Std. Error 18.529 3.402
(Constant) get others to do things my way stick up for others forget to return items make jokes when others clumsy SCALE
Collinearit y Statistics Tolerance
a. Dependent Variable: how many friends sub listed Excluded Variablesb
a. Predictors in the Model: (Constant), SCALE, stick up for others, forget to return items, get others to do things my way, make jokes when others clumsy b. Dependent Variable: how many friends sub listed
What got dumped because of especially low tolerance (.000 which means R² = 1.00) was not the scale, but one of the items -- "tell jokes" -- the only one that contributed in the item model. With scale and the other 4 items in the model, some interesting things happen (notice -- we'd not know them to be interesting if we'd not run the item model). Specifically, three of the predictors look to be suppressor variables -- we could work really hard to tell an interesting story about these "suppressors", but it is really just a poor set of estimates produced by the collinearity of including 4 of the 5 items composing "scale".