Data Screening Part 2

Erin M. Buchanan

Last Updated: 2026-01-08

Data Screening: Next Steps

Data Screening: Assumptions

Data Screening: Assumptions

Assumptions: Independence

Assumptions: Additivity

Assumptions: Additivity

## 
##  iter imp variable
##   1   1  RS3  RS6  RS8  RS11  RS13  RS14
##   1   2  RS3  RS6  RS8  RS11  RS13  RS14
##   1   3  RS3  RS6  RS8  RS11  RS13  RS14
##   1   4  RS3  RS6  RS8  RS11  RS13  RS14
##   1   5  RS3  RS6  RS8  RS11  RS13  RS14
##   2   1  RS3  RS6  RS8  RS11  RS13  RS14
##   2   2  RS3  RS6  RS8  RS11  RS13  RS14
##   2   3  RS3  RS6  RS8  RS11  RS13  RS14
##   2   4  RS3  RS6  RS8  RS11  RS13  RS14
##   2   5  RS3  RS6  RS8  RS11  RS13  RS14
##   3   1  RS3  RS6  RS8  RS11  RS13  RS14
##   3   2  RS3  RS6  RS8  RS11  RS13  RS14
##   3   3  RS3  RS6  RS8  RS11  RS13  RS14
##   3   4  RS3  RS6  RS8  RS11  RS13  RS14
##   3   5  RS3  RS6  RS8  RS11  RS13  RS14
##   4   1  RS3  RS6  RS8  RS11  RS13  RS14
##   4   2  RS3  RS6  RS8  RS11  RS13  RS14
##   4   3  RS3  RS6  RS8  RS11  RS13  RS14
##   4   4  RS3  RS6  RS8  RS11  RS13  RS14
##   4   5  RS3  RS6  RS8  RS11  RS13  RS14
##   5   1  RS3  RS6  RS8  RS11  RS13  RS14
##   5   2  RS3  RS6  RS8  RS11  RS13  RS14
##   5   3  RS3  RS6  RS8  RS11  RS13  RS14
##   5   4  RS3  RS6  RS8  RS11  RS13  RS14
##   5   5  RS3  RS6  RS8  RS11  RS13  RS14
str(noout)
## 'data.frame':    118 obs. of  20 variables:
##  $ Sex     : Factor w/ 2 levels "Women","Men": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Age     : int  17 16 15 16 15 14 13 15 16 17 ...
##  $ Grade   : int  11 7 6 11 7 6 4 6 8 11 ...
##  $ SES     : Factor w/ 3 levels "Low","Medium",..: 2 3 3 2 3 3 1 2 2 3 ...
##  $ Absences: int  2 2 2 6 2 2 2 2 1 6 ...
##  $ RS1     : int  6 7 5 7 7 2 6 4 3 4 ...
##  $ RS2     : int  4 1 5 4 7 3 6 6 5 3 ...
##  $ RS3     : int  2 1 5 4 7 2 7 6 3 4 ...
##  $ RS4     : int  2 5 7 7 7 3 6 6 2 4 ...
##  $ RS5     : int  4 7 5 4 7 2 1 6 1 4 ...
##  $ RS6     : int  7 7 6 4 7 3 4 6 7 4 ...
##  $ RS7     : int  7 7 5 7 7 3 6 6 4 4 ...
##  $ RS8     : int  4 7 7 7 7 3 6 6 3 7 ...
##  $ RS9     : int  5 4 6 4 7 2 2 6 3 4 ...
##  $ RS10    : int  7 7 7 7 7 2 5 6 3 4 ...
##  $ RS11    : int  4 1 6 7 7 3 6 6 3 6 ...
##  $ RS12    : int  7 7 6 4 7 3 6 6 2 5 ...
##  $ RS13    : int  4 4 6 7 7 2 6 6 5 6 ...
##  $ RS14    : int  7 7 6 4 7 3 2 6 3 4 ...
##  $ Health  : int  6 6 2 6 4 6 1 2 3 1 ...
cor(noout[ , -c(1,4)])
##                  Age       Grade     Absences         RS1         RS2
## Age       1.00000000  0.49129879  0.259278455  0.05557066 -0.02500303
## Grade     0.49129879  1.00000000 -0.106825686  0.13576618  0.04492324
## Absences  0.25927845 -0.10682569  1.000000000  0.08062971  0.12545466
## RS1       0.05557066  0.13576618  0.080629712  1.00000000  0.30387778
## RS2      -0.02500303  0.04492324  0.125454661  0.30387778  1.00000000
## RS3       0.12731787  0.11110507  0.214147935  0.39435773  0.60392742
## RS4      -0.09530848 -0.01762404  0.095790980  0.33308664  0.41917969
## RS5       0.16312777  0.10613350  0.089601326  0.31633125  0.32639523
## RS6       0.15873088  0.21894625  0.006016586  0.28979227  0.51680621
## RS7       0.24685001  0.24172671  0.166350702  0.49315246  0.57762184
## RS8       0.16732512  0.19607386  0.083607390  0.30270485  0.41999692
## RS9       0.02368352  0.01918898 -0.110881821  0.32269858  0.48383577
## RS10      0.05703377  0.19189629  0.122778324  0.42665728  0.62064605
## RS11      0.03964118  0.08136441  0.191811833  0.32656293  0.59752445
## RS12      0.18399663  0.07424952  0.004201959  0.32427895  0.27635326
## RS13      0.10128837  0.10547175  0.140220973  0.30619685  0.62718380
## RS14      0.18292725  0.17612502 -0.081311740  0.22975797  0.33264694
## Health   -0.16644080 -0.12436242  0.023720562 -0.01407696 -0.13978793
##                 RS3         RS4          RS5          RS6        RS7
## Age       0.1273179 -0.09530848 0.1631277685  0.158730882  0.2468500
## Grade     0.1111051 -0.01762404 0.1061335021  0.218946250  0.2417267
## Absences  0.2141479  0.09579098 0.0896013265  0.006016586  0.1663507
## RS1       0.3943577  0.33308664 0.3163312549  0.289792266  0.4931525
## RS2       0.6039274  0.41917969 0.3263952295  0.516806207  0.5776218
## RS3       1.0000000  0.53124323 0.4308641895  0.428769218  0.4680652
## RS4       0.5312432  1.00000000 0.3187108026  0.219281403  0.3019374
## RS5       0.4308642  0.31871080 1.0000000000  0.445824875  0.3880713
## RS6       0.4287692  0.21928140 0.4458248751  1.000000000  0.6038055
## RS7       0.4680652  0.30193741 0.3880712508  0.603805512  1.0000000
## RS8       0.4476441  0.21250714 0.4131928044  0.437741777  0.4925186
## RS9       0.3245957  0.34644102 0.4529777465  0.567339117  0.4441627
## RS10      0.4327729  0.45494068 0.4445222275  0.504325912  0.6034629
## RS11      0.5457643  0.39694108 0.5084182361  0.581456379  0.5317407
## RS12      0.3255026  0.10577219 0.5219541822  0.440801532  0.4183208
## RS13      0.6120622  0.38919009 0.4178701180  0.623020688  0.5549517
## RS14      0.3830965  0.19221471 0.5284791103  0.535873491  0.4631298
## Health   -0.1821844 -0.03851484 0.0009663785 -0.066956532 -0.1091089
##                  RS8         RS9        RS10        RS11         RS12
## Age      0.167325123  0.02368352  0.05703377  0.03964118  0.183996626
## Grade    0.196073861  0.01918898  0.19189629  0.08136441  0.074249525
## Absences 0.083607390 -0.11088182  0.12277832  0.19181183  0.004201959
## RS1      0.302704849  0.32269858  0.42665728  0.32656293  0.324278946
## RS2      0.419996923  0.48383577  0.62064605  0.59752445  0.276353262
## RS3      0.447644056  0.32459572  0.43277287  0.54576426  0.325502575
## RS4      0.212507139  0.34644102  0.45494068  0.39694108  0.105772190
## RS5      0.413192804  0.45297775  0.44452223  0.50841824  0.521954182
## RS6      0.437741777  0.56733912  0.50432591  0.58145638  0.440801532
## RS7      0.492518639  0.44416274  0.60346287  0.53174073  0.418320781
## RS8      1.000000000  0.45762845  0.47063305  0.44848862  0.486307902
## RS9      0.457628455  1.00000000  0.62362701  0.60201244  0.486867126
## RS10     0.470633053  0.62362701  1.00000000  0.60685642  0.420750948
## RS11     0.448488621  0.60201244  0.60685642  1.00000000  0.467235410
## RS12     0.486307902  0.48686713  0.42075095  0.46723541  1.000000000
## RS13     0.577940692  0.55246064  0.56010960  0.73465105  0.562519549
## RS14     0.400478077  0.55362235  0.48853773  0.36588780  0.529659279
## Health   0.005123918  0.02091008 -0.05983438 -0.07828823 -0.021540432
##                 RS13        RS14        Health
## Age       0.10128837  0.18292725 -0.1664407976
## Grade     0.10547175  0.17612502 -0.1243624181
## Absences  0.14022097 -0.08131174  0.0237205616
## RS1       0.30619685  0.22975797 -0.0140769622
## RS2       0.62718380  0.33264694 -0.1397879302
## RS3       0.61206222  0.38309648 -0.1821843842
## RS4       0.38919009  0.19221471 -0.0385148397
## RS5       0.41787012  0.52847911  0.0009663785
## RS6       0.62302069  0.53587349 -0.0669565316
## RS7       0.55495169  0.46312976 -0.1091089078
## RS8       0.57794069  0.40047808  0.0051239184
## RS9       0.55246064  0.55362235  0.0209100770
## RS10      0.56010960  0.48853773 -0.0598343773
## RS11      0.73465105  0.36588780 -0.0782882278
## RS12      0.56251955  0.52965928 -0.0215404321
## RS13      1.00000000  0.52257091 -0.0911749490
## RS14      0.52257091  1.00000000 -0.0923315613
## Health   -0.09117495 -0.09233156  1.0000000000

Assumptions: Additivity

library(corrplot)
corrplot(cor(noout[ , -c(1,4)]))

Assumptions: Linearity

Assumptions: Fake Regression

Assumptions: Fake Regression

Assumptions: Fake Regression

random <- rchisq(nrow(noout), 7) #why 7? It works, any number bigger than 2
fake <- lm(random ~ ., #Y is predicted by all variables in the data
           data = noout) #You can use categorical variables now!
standardized <- rstudent(fake)
fitvalues <- scale(fake$fitted.values)

Assumptions: Linearity

{qqnorm(standardized)
abline(0,1)}

plot(fake, 2)

Assumptions: Linearity

Assumptions: Linearity

{qqnorm(standardized)
abline(0,1)}

Assumptions: Normality

Assumptions: Normality

hist(noout$RS1)

library(moments)
skewness(noout[ , -c(1,4)])
##         Age       Grade    Absences         RS1         RS2         RS3 
## -0.20880151  0.14109617  0.50473399 -0.64442864 -0.39612649  0.07144744 
##         RS4         RS5         RS6         RS7         RS8         RS9 
## -0.34114107 -0.07196785 -0.36542224 -0.70320051 -0.22125349 -0.29978265 
##        RS10        RS11        RS12        RS13        RS14      Health 
## -0.63408916 -0.25701748 -1.09556422 -0.56047487 -0.44083590  0.16354123
kurtosis(noout[ , -c(1,4)]) - 3 #to get excess kurtosis
##        Age      Grade   Absences        RS1        RS2        RS3        RS4 
## -1.2139538 -0.6513750 -1.0873422 -0.4179216 -0.7539534 -1.0075623 -1.0546091 
##        RS5        RS6        RS7        RS8        RS9       RS10       RS11 
## -1.1301711 -0.9259363 -0.3086471 -1.2802140 -0.7513962 -0.6078609 -1.0478794 
##       RS12       RS13       RS14     Health 
##  0.3950031 -0.4969303 -0.8046555 -1.2753754

Assumptions: Normality

hist(standardized, breaks=15)

length(standardized)
## [1] 117

Assumptions: Homogeneity

Assumptions: Homogeneity

Assumptions: Homoscedasticity

Assumptions: Homogeneity and Homoscedasticity

Assumptions: Homogeneity and Homoscedasticity

Assumptions: Homogeneity and Homoscedasticity

{plot(fitvalues, standardized) 
abline(0,0)
abline(v = 0)}

Assumptions: Homogeneity and Homoscedasticity

{plot(fitvalues, standardized) 
abline(0,0)
abline(v = 0)}

Summary

References

[1] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 170-171. [Assumption of independence]

[2] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 274-276. [Multicollinearity and additivity]

[3] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 293-295. [Linearity assumption and residual plots]

[4] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 269-271. [Standardized residuals and error distributions]

[5] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 179-182. [Q-Q plots and assessing normality]

[6] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 168-169. [Central Limit Theorem and normality assumption]

[7] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 185-188. [Homogeneity of variance and Levene’s test]

[8] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 272-273, 293. [Homoscedasticity and residual plots]