Data Screening Part 2

Erin M. Buchanan

Last Updated: 2026-01-08

Data Screening: Next Steps

In the last lecture, we discussed:
- Accuracy
- Missing data
- Outliers
What else should we consider for checking our data?

Data Screening: Assumptions

For parametric statistics, we should think about:
- Independence
- Additivity
- Linearity
- Normality
- Homogeneity (Sphericity), Homoscedasticity

Data Screening: Assumptions

The procedure:
- We will use a ‘fake’ regression to help us analyze these assumptions.
- It’s ‘fake’ because it’s not part of a real analysis, but just a way to calculate the numbers we need for assessment.
- When we run regression as a statistical analysis, we can use the real regression in the same way as described below.

Assumptions: Independence

Definition: the errors in the model should not be related to each other.
In plain English: Imagine you’re measuring test scores from different students. Independence means that one student’s score shouldn’t influence another student’s score. If students copied from each other, their scores would be related (not independent), which would violate this assumption [1].
If you do not have independence, confidence intervals and significance tests will be invalid [1].
Independence is often a matter of the research design. For example, if you measure the same people multiple times, their responses are naturally related to each other [1].

Assumptions: Additivity

If you have several variables then their combined effect is best described by adding their effects together.
In plain English: Think of additivity like ingredients in a recipe. If you’re predicting cake quality from flour and sugar, each ingredient should contribute its own unique effect. If flour and sugar were essentially measuring the same thing (like two different brands of the same ingredient), you’d be counting the same effect twice [2].
If two variables are not additive, this implies that the variables are too related, which reduces power.
- Why use the same variable twice?
- The “good” thing is called additivity, the violation or “bad” thing is called multicollinearity or singularity.
- Multicollinearity = when predictors correlate too highly (r > .90), making it difficult to determine which variable is actually important [2]
- Singularity = an extreme form of multicollinearity (r > .95)
This analysis is necessary when you have multiple continuous variables. If you only have one dependent variable, then you cannot run this check.

Assumptions: Additivity

You can check this assumption by checking the correlation between the variables.
How to check: Look at the correlation matrix. If correlations between predictors are above 0.80 or 0.90, you may have multicollinearity problems [2].
What to do if violated: If correlations are too high, you can either combine the problematic variables into one measure or just pick one of them to use in your analysis [2].
Note: we are starting with the same analysis as the last lecture, at the point we ended with.

## 
##  iter imp variable
##   1   1  RS3  RS6  RS8  RS11  RS13  RS14
##   1   2  RS3  RS6  RS8  RS11  RS13  RS14
##   1   3  RS3  RS6  RS8  RS11  RS13  RS14
##   1   4  RS3  RS6  RS8  RS11  RS13  RS14
##   1   5  RS3  RS6  RS8  RS11  RS13  RS14
##   2   1  RS3  RS6  RS8  RS11  RS13  RS14
##   2   2  RS3  RS6  RS8  RS11  RS13  RS14
##   2   3  RS3  RS6  RS8  RS11  RS13  RS14
##   2   4  RS3  RS6  RS8  RS11  RS13  RS14
##   2   5  RS3  RS6  RS8  RS11  RS13  RS14
##   3   1  RS3  RS6  RS8  RS11  RS13  RS14
##   3   2  RS3  RS6  RS8  RS11  RS13  RS14
##   3   3  RS3  RS6  RS8  RS11  RS13  RS14
##   3   4  RS3  RS6  RS8  RS11  RS13  RS14
##   3   5  RS3  RS6  RS8  RS11  RS13  RS14
##   4   1  RS3  RS6  RS8  RS11  RS13  RS14
##   4   2  RS3  RS6  RS8  RS11  RS13  RS14
##   4   3  RS3  RS6  RS8  RS11  RS13  RS14
##   4   4  RS3  RS6  RS8  RS11  RS13  RS14
##   4   5  RS3  RS6  RS8  RS11  RS13  RS14
##   5   1  RS3  RS6  RS8  RS11  RS13  RS14
##   5   2  RS3  RS6  RS8  RS11  RS13  RS14
##   5   3  RS3  RS6  RS8  RS11  RS13  RS14
##   5   4  RS3  RS6  RS8  RS11  RS13  RS14
##   5   5  RS3  RS6  RS8  RS11  RS13  RS14

str(noout)

## 'data.frame':    118 obs. of  20 variables:
##  $ Sex     : Factor w/ 2 levels "Women","Men": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Age     : int  17 16 15 16 15 14 13 15 16 17 ...
##  $ Grade   : int  11 7 6 11 7 6 4 6 8 11 ...
##  $ SES     : Factor w/ 3 levels "Low","Medium",..: 2 3 3 2 3 3 1 2 2 3 ...
##  $ Absences: int  2 2 2 6 2 2 2 2 1 6 ...
##  $ RS1     : int  6 7 5 7 7 2 6 4 3 4 ...
##  $ RS2     : int  4 1 5 4 7 3 6 6 5 3 ...
##  $ RS3     : int  2 1 5 4 7 2 7 6 3 4 ...
##  $ RS4     : int  2 5 7 7 7 3 6 6 2 4 ...
##  $ RS5     : int  4 7 5 4 7 2 1 6 1 4 ...
##  $ RS6     : int  7 7 6 4 7 3 4 6 7 4 ...
##  $ RS7     : int  7 7 5 7 7 3 6 6 4 4 ...
##  $ RS8     : int  4 7 7 7 7 3 6 6 3 7 ...
##  $ RS9     : int  5 4 6 4 7 2 2 6 3 4 ...
##  $ RS10    : int  7 7 7 7 7 2 5 6 3 4 ...
##  $ RS11    : int  4 1 6 7 7 3 6 6 3 6 ...
##  $ RS12    : int  7 7 6 4 7 3 6 6 2 5 ...
##  $ RS13    : int  4 4 6 7 7 2 6 6 5 6 ...
##  $ RS14    : int  7 7 6 4 7 3 2 6 3 4 ...
##  $ Health  : int  6 6 2 6 4 6 1 2 3 1 ...

cor(noout[ , -c(1,4)])

##                  Age       Grade     Absences         RS1         RS2
## Age       1.00000000  0.49129879  0.259278455  0.05557066 -0.02500303
## Grade     0.49129879  1.00000000 -0.106825686  0.13576618  0.04492324
## Absences  0.25927845 -0.10682569  1.000000000  0.08062971  0.12545466
## RS1       0.05557066  0.13576618  0.080629712  1.00000000  0.30387778
## RS2      -0.02500303  0.04492324  0.125454661  0.30387778  1.00000000
## RS3       0.12731787  0.11110507  0.214147935  0.39435773  0.60392742
## RS4      -0.09530848 -0.01762404  0.095790980  0.33308664  0.41917969
## RS5       0.16312777  0.10613350  0.089601326  0.31633125  0.32639523
## RS6       0.15873088  0.21894625  0.006016586  0.28979227  0.51680621
## RS7       0.24685001  0.24172671  0.166350702  0.49315246  0.57762184
## RS8       0.16732512  0.19607386  0.083607390  0.30270485  0.41999692
## RS9       0.02368352  0.01918898 -0.110881821  0.32269858  0.48383577
## RS10      0.05703377  0.19189629  0.122778324  0.42665728  0.62064605
## RS11      0.03964118  0.08136441  0.191811833  0.32656293  0.59752445
## RS12      0.18399663  0.07424952  0.004201959  0.32427895  0.27635326
## RS13      0.10128837  0.10547175  0.140220973  0.30619685  0.62718380
## RS14      0.18292725  0.17612502 -0.081311740  0.22975797  0.33264694
## Health   -0.16644080 -0.12436242  0.023720562 -0.01407696 -0.13978793
##                 RS3         RS4          RS5          RS6        RS7
## Age       0.1273179 -0.09530848 0.1631277685  0.158730882  0.2468500
## Grade     0.1111051 -0.01762404 0.1061335021  0.218946250  0.2417267
## Absences  0.2141479  0.09579098 0.0896013265  0.006016586  0.1663507
## RS1       0.3943577  0.33308664 0.3163312549  0.289792266  0.4931525
## RS2       0.6039274  0.41917969 0.3263952295  0.516806207  0.5776218
## RS3       1.0000000  0.53124323 0.4308641895  0.428769218  0.4680652
## RS4       0.5312432  1.00000000 0.3187108026  0.219281403  0.3019374
## RS5       0.4308642  0.31871080 1.0000000000  0.445824875  0.3880713
## RS6       0.4287692  0.21928140 0.4458248751  1.000000000  0.6038055
## RS7       0.4680652  0.30193741 0.3880712508  0.603805512  1.0000000
## RS8       0.4476441  0.21250714 0.4131928044  0.437741777  0.4925186
## RS9       0.3245957  0.34644102 0.4529777465  0.567339117  0.4441627
## RS10      0.4327729  0.45494068 0.4445222275  0.504325912  0.6034629
## RS11      0.5457643  0.39694108 0.5084182361  0.581456379  0.5317407
## RS12      0.3255026  0.10577219 0.5219541822  0.440801532  0.4183208
## RS13      0.6120622  0.38919009 0.4178701180  0.623020688  0.5549517
## RS14      0.3830965  0.19221471 0.5284791103  0.535873491  0.4631298
## Health   -0.1821844 -0.03851484 0.0009663785 -0.066956532 -0.1091089
##                  RS8         RS9        RS10        RS11         RS12
## Age      0.167325123  0.02368352  0.05703377  0.03964118  0.183996626
## Grade    0.196073861  0.01918898  0.19189629  0.08136441  0.074249525
## Absences 0.083607390 -0.11088182  0.12277832  0.19181183  0.004201959
## RS1      0.302704849  0.32269858  0.42665728  0.32656293  0.324278946
## RS2      0.419996923  0.48383577  0.62064605  0.59752445  0.276353262
## RS3      0.447644056  0.32459572  0.43277287  0.54576426  0.325502575
## RS4      0.212507139  0.34644102  0.45494068  0.39694108  0.105772190
## RS5      0.413192804  0.45297775  0.44452223  0.50841824  0.521954182
## RS6      0.437741777  0.56733912  0.50432591  0.58145638  0.440801532
## RS7      0.492518639  0.44416274  0.60346287  0.53174073  0.418320781
## RS8      1.000000000  0.45762845  0.47063305  0.44848862  0.486307902
## RS9      0.457628455  1.00000000  0.62362701  0.60201244  0.486867126
## RS10     0.470633053  0.62362701  1.00000000  0.60685642  0.420750948
## RS11     0.448488621  0.60201244  0.60685642  1.00000000  0.467235410
## RS12     0.486307902  0.48686713  0.42075095  0.46723541  1.000000000
## RS13     0.577940692  0.55246064  0.56010960  0.73465105  0.562519549
## RS14     0.400478077  0.55362235  0.48853773  0.36588780  0.529659279
## Health   0.005123918  0.02091008 -0.05983438 -0.07828823 -0.021540432
##                 RS13        RS14        Health
## Age       0.10128837  0.18292725 -0.1664407976
## Grade     0.10547175  0.17612502 -0.1243624181
## Absences  0.14022097 -0.08131174  0.0237205616
## RS1       0.30619685  0.22975797 -0.0140769622
## RS2       0.62718380  0.33264694 -0.1397879302
## RS3       0.61206222  0.38309648 -0.1821843842
## RS4       0.38919009  0.19221471 -0.0385148397
## RS5       0.41787012  0.52847911  0.0009663785
## RS6       0.62302069  0.53587349 -0.0669565316
## RS7       0.55495169  0.46312976 -0.1091089078
## RS8       0.57794069  0.40047808  0.0051239184
## RS9       0.55246064  0.55362235  0.0209100770
## RS10      0.56010960  0.48853773 -0.0598343773
## RS11      0.73465105  0.36588780 -0.0782882278
## RS12      0.56251955  0.52965928 -0.0215404321
## RS13      1.00000000  0.52257091 -0.0911749490
## RS14      0.52257091  1.00000000 -0.0923315613
## Health   -0.09117495 -0.09233156  1.0000000000

Assumptions: Additivity

library(corrplot)
corrplot(cor(noout[ , -c(1,4)]))

Assumptions: Linearity

Assumption that the relationship between variables is linear and not curved [3].
In plain English: Imagine plotting height vs. weight. Linearity means the relationship forms a straight line (or close to it), not a curve. If taller people get heavier at a steady rate, that’s linear. If the relationship curves (like exponential growth), that violates linearity [3].
Most parametric statistics have this assumption including ANOVA, regression, etc. [3].
Linearity includes both univariate (every variable with every other variable) and multivariate (all the linear combinations together).
Generally, checking for multivariate linearity can allow you to assess the overall pattern.
However, if this analysis appears like the assumption is not met, you can check each pairwise combination separately to identify which specific relationships are problematic [3].

Assumptions: Fake Regression

At this point, we will create and use our fake regression.
Why “fake”?: We’re not actually testing a research hypothesis here. Instead, we’re using regression as a diagnostic tool to check if our data meet the assumptions needed for statistical tests [3].
For many of the statistical tests you would run, there are diagnostic plots / assumptions built into them.
This guide lets you apply data screening to any analysis, if you wanted to learn one set of rules, rather than one for each analysis.
But there are still things that only apply to ANOVA that you’d want to add when you run ANOVA. We will learn these in each analysis section.

Assumptions: Fake Regression

For the fake regression, we will first create a fake regression.
As a reminder, we talked about using chi-square for Mahalanobis distance as our cut off score.
In a similar fashion, we will use chi-square because the errors should be chi-square distributed (lots of small errors, only a few big ones).
Key concept: In a good statistical model, most errors (residuals) should be small, with only a few large ones. This creates a chi-square distribution pattern [4].
However, the standardized errors should be normally distributed around zero (because they have been standardized!) [4].

Assumptions: Fake Regression

random <- rchisq(nrow(noout), 7) #why 7? It works, any number bigger than 2
fake <- lm(random ~ ., #Y is predicted by all variables in the data
           data = noout) #You can use categorical variables now!
standardized <- rstudent(fake)
fitvalues <- scale(fake$fitted.values)

Assumptions: Linearity

With this setup, we can use a couple options to get a QQ plot (sometimes you’ll see PP plots, same idea).
What is a QQ plot?: QQ stands for “quantile-quantile.” This plot compares your data to what we’d expect from a perfect normal distribution [5]. If the dots follow the diagonal line closely, your data are normally distributed [5].

{qqnorm(standardized)
abline(0,1)}

plot(fake, 2)

Assumptions: Linearity

What are the dots on these plots?
Regression is a model that we’ve discussed: \(Outcome_i = (bX_i) + error_i\)
We are predicting the outcome of a random variable, so the errors should be randomly distributed - lots of small numbers that are centered around zero [4].
Why standardize?: We standardized the errors to help us interpret them by having a scale to compare to. Standardized residuals are converted to z-scores, which have a mean of 0 and standard deviation of 1 [4].
Each dot represents a person’s standardized residual plotted against the theoretical residual for that area of the standardized distribution [5].
What should I look for? Remember how most of data is between -2 and 2 in a standardized distribution, so we want the dots to line up on the line between -2 and 2 especially [4].
Outside that range can be very hard to predict, so we check it, but less concerned if it curves away from the line at the extremes [5].

Assumptions: Linearity

{qqnorm(standardized)
abline(0,1)}

Assumptions: Normality

This assumption tends to get incorrectly translated as your data need to be normally distributed.
The actual assumption is that the sampling distribution is normally distributed [6].
In plain English: We don’t need every individual score to be perfectly normal. What matters is that if we took many samples and calculated their means, those means would form a normal distribution [6].
Remember the Central Limit Theorem - at what point is the sample size large enough to assume normality?
- N > 30 is the general rule [6]
- Why this matters: The Central Limit Theorem tells us that in large samples (above 30), the sampling distribution tends to be normal anyway, regardless of how the original data look [6].
- In practical terms, as long as your sample is fairly large, outliers are a much more pressing concern than normality [6].
Check out the sample distribution of residuals as an approximation for multivariate normality.
- The same idea applies - if multivariate normality is not met, you can check the distribution of each variable to determine which ones might be the problem.

Assumptions: Normality

From our earlier lectures, we covered how to check each variable individually - go back and check out those lectures for the comparison rules for concerning values.

hist(noout$RS1)

library(moments)
skewness(noout[ , -c(1,4)])

##         Age       Grade    Absences         RS1         RS2         RS3 
## -0.20880151  0.14109617  0.50473399 -0.64442864 -0.39612649  0.07144744 
##         RS4         RS5         RS6         RS7         RS8         RS9 
## -0.34114107 -0.07196785 -0.36542224 -0.70320051 -0.22125349 -0.29978265 
##        RS10        RS11        RS12        RS13        RS14      Health 
## -0.63408916 -0.25701748 -1.09556422 -0.56047487 -0.44083590  0.16354123

kurtosis(noout[ , -c(1,4)]) - 3 #to get excess kurtosis

##        Age      Grade   Absences        RS1        RS2        RS3        RS4 
## -1.2139538 -0.6513750 -1.0873422 -0.4179216 -0.7539534 -1.0075623 -1.0546091 
##        RS5        RS6        RS7        RS8        RS9       RS10       RS11 
## -1.1301711 -0.9259363 -0.3086471 -1.2802140 -0.7513962 -0.6078609 -1.0478794 
##       RS12       RS13       RS14     Health 
##  0.3950031 -0.4969303 -0.8046555 -1.2753754

Assumptions: Normality

To check for multivariate normality, we can check a histogram of the standardized residuals.
What to look for: We want our distribution to be centered over zero, with most of the data between -2 and 2 [4].
Interpreting the histogram: The histogram should look like a bell curve (normal distribution). If it’s lopsided (skewed) or too flat/pointy (kurtosis issues), that suggests problems with normality [5].
In this example, we might have slight positive skew, but it is mostly normal.

hist(standardized, breaks=15)

length(standardized)

## [1] 117

Assumptions: Homogeneity

Assumption that the variances of the variables are roughly equal [7].
In plain English: Imagine comparing test scores from three classes. Homogeneity of variance means that the spread of scores (how much they vary) should be similar in all three classes. If one class has scores all between 80-90 (small variance) and another has scores between 40-100 (large variance), this assumption is violated [7].
Ways to check: you do NOT want p < .001:
- Levene’s Test - checks homogeneity for one variable at a time (univariate) [7]
- Box’s Test - checks homogeneity for multiple variables together (multivariate)
- We will use those with their specific analysis.
Sphericity - the assumption that the differences between measurements in repeated measures have approximately the same variance and correlations.
- Mauchley’s Test - used to check sphericity in repeated measures designs

Assumptions: Homogeneity

Assumptions: Homoscedasticity

Spread of the variance of a variable is the same across all values of the other variable [8].
In plain English: Imagine predicting income from years of education. Homoscedasticity means the variability in income should be roughly the same whether someone has 12 years or 20 years of education. If income becomes much more variable at higher education levels (some people make a lot, some don’t), that’s heteroscedasticity (violation of this assumption) [8].
Therefore, the variance of Y is the same at each spot of X [8].

Assumptions: Homogeneity and Homoscedasticity

Create a scatterplot of the fake regression.
- X-axis = Standardized Fitted values = the predicted score for a person in your regression [4].
- Y-axis = Standardized Residuals = the difference between the predicted score and a person’s actual score in the regression \(Y - \hat{Y}\) [4].
- Make them both standardized for an easier scale to interpret.
In theory, the residuals should be randomly distributed (hence why we created a random variable to test with) [3].
What you want to see: The plot should look like a bunch of random dots scattered evenly around zero, with no clear pattern [8]. Think of it like stars randomly scattered across the sky - no clusters, no shapes, just randomness.

Assumptions: Homogeneity and Homoscedasticity

{plot(fitvalues, standardized) 
abline(0,0)
abline(v = 0)}

Assumptions: Homogeneity and Homoscedasticity

Homogeneity - is the spread above that line the same as below that 0, 0 line (both directions)? [8]
- What to avoid: You do not want a very large spread on one side and a small spread on the other side (looks like it’s raining).
- Good sign: Points are evenly scattered above and below the horizontal line [8].
Homoscedasticity - is the spread equal all the way across the x axis? [8]
- Warning signs: Look for megaphones (funnel shapes), triangles, or big groupings of data [8]. These patterns indicate heteroscedasticity.
- Good sign: It should appear to be an even random spread of dots across the entire plot [8].

{plot(fitvalues, standardized) 
abline(0,0)
abline(v = 0)}

Summary

In this lecture, we have covered:
- Independence
- Additivity
- Linearity
- Normality
- Homogeneity (Sphericity), Homoscedasticity
- How to plot, interpret, and understand each of these assumptions

References

[1] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 170-171. [Assumption of independence]

[2] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 274-276. [Multicollinearity and additivity]

[3] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 293-295. [Linearity assumption and residual plots]

[4] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 269-271. [Standardized residuals and error distributions]

[5] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 179-182. [Q-Q plots and assessing normality]

[6] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 168-169. [Central Limit Theorem and normality assumption]

[7] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 185-188. [Homogeneity of variance and Levene’s test]

[8] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 272-273, 293. [Homoscedasticity and residual plots]