Data Screening Example

While we have not covered any specific analyses yet, we can screen data overall that will cover many different types of analyses.
This dataset examines the resiliency (RS columns) of teenagers after a natural disaster.

library(rio)
master <- import("data/data_screening.csv")
str(master)

## 'data.frame':    137 obs. of  20 variables:
##  $ Sex     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Age     : int  17 16 13 15 16 15 12 14 13 18 ...
##  $ SES     : int  2 3 2 3 2 3 3 3 1 3 ...
##  $ Grade   : int  11 7 5 6 11 7 4 6 4 9 ...
##  $ Absences: int  2 2 2 2 6 2 1 2 2 5 ...
##  $ RS1     : int  6 7 5 5 7 7 1 2 6 7 ...
##  $ RS2     : int  4 1 NA 5 4 7 6 3 6 6 ...
##  $ RS3     : int  2 1 5 5 4 7 NA 2 7 5 ...
##  $ RS4     : int  2 5 7 7 7 7 1 3 6 NA ...
##  $ RS5     : int  4 7 4 5 4 7 4 2 1 6 ...
##  $ RS6     : int  7 7 6 6 4 7 7 3 4 6 ...
##  $ RS7     : int  7 7 7 5 7 7 7 3 6 10 ...
##  $ RS8     : int  4 7 6 7 7 7 NA 3 6 7 ...
##  $ RS9     : int  5 4 6 6 4 7 NA 2 2 7 ...
##  $ RS10    : int  7 7 7 7 7 7 4 2 5 7 ...
##  $ RS11    : int  4 1 NA 6 7 7 3 3 6 5 ...
##  $ RS12    : int  7 7 5 6 4 7 7 3 6 6 ...
##  $ RS13    : int  4 4 7 6 7 7 7 2 6 6 ...
##  $ RS14    : int  7 7 6 6 4 7 7 3 2 6 ...
##  $ Health  : int  6 6 2 2 6 4 3 6 1 5 ...

Accuracy

We will want to check for several issues:
- Any factor variables incorrectly imported as numbers.
- Any reverse coded variables.
- Any out of range values or incorrect values.
- Other errors depending on data collection procedure.

Accuracy: Categorical Variables

notypos <- master #update the dataset with each step 
apply(notypos[ , c("Sex", "SES")], 2, table)

##   Sex SES
## 1  64   9
## 2  72  56
## 3   1  72

#3 here for sex is probably incorrect

Accuracy: Categorical Variables

We can use factor() but not include the bad label to drop that incorrect point.
You can also exclude it as shown in a few slides.

## fix the categorical labels and typos
notypos$Sex <- factor(notypos$Sex, 
                     levels = c(1,2), #no 3
                     labels = c("Women", "Men"))
notypos$SES <- factor(notypos$SES, 
                     levels = c(1,2, 3),
                     labels = c("Low", "Medium", "High"))
apply(notypos[ , c("Sex", "SES")], 2, table)

## $Sex
## 
##   Men Women 
##    72    64 
## 
## $SES
## 
##   High    Low Medium 
##     72      9     56

Accuracy: Continuous Variables

We can use the summary() function to view the summary statistics for our continuous variables.
Useful to check the min and max for out of range values.
Useful to check for reverse scoring (for data you know well, the mean may be obvious it is much lower than the rest).
The RS14 scale runs from 1 to 7, so we clearly have some typos.

summary(notypos)

##     Sex          Age            SES         Grade           Absences     
##  Women:64   Min.   :11.00   Low   : 9   Min.   : 1.000   Min.   : 1.000  
##  Men  :72   1st Qu.:13.00   Medium:56   1st Qu.: 4.000   1st Qu.: 2.000  
##  NA's : 1   Median :15.50   High  :72   Median : 6.000   Median : 2.000  
##             Mean   :15.05               Mean   : 5.883   Mean   : 3.358  
##             3rd Qu.:17.00               3rd Qu.: 8.000   3rd Qu.: 5.000  
##             Max.   :18.00               Max.   :35.000   Max.   :35.000  
##             NA's   :5                                                    
##       RS1             RS2             RS3             RS4            RS5       
##  Min.   :1.000   Min.   :1.000   Min.   : 1.00   Min.   :1.00   Min.   :1.000  
##  1st Qu.:4.000   1st Qu.:4.000   1st Qu.: 3.00   1st Qu.:3.00   1st Qu.:3.000  
##  Median :5.000   Median :5.000   Median : 4.00   Median :5.00   Median :4.000  
##  Mean   :4.858   Mean   :4.962   Mean   : 4.42   Mean   :4.55   Mean   :4.448  
##  3rd Qu.:6.000   3rd Qu.:6.000   3rd Qu.: 6.00   3rd Qu.:7.00   3rd Qu.:6.750  
##  Max.   :7.000   Max.   :7.000   Max.   :18.00   Max.   :7.00   Max.   :7.000  
##  NA's   :3       NA's   :5       NA's   :6       NA's   :6      NA's   :3      
##       RS6         RS7              RS8             RS9             RS10      
##  Min.   :1   Min.   : 1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:4   1st Qu.: 4.000   1st Qu.:3.000   1st Qu.:4.000   1st Qu.:4.000  
##  Median :5   Median : 5.000   Median :5.000   Median :5.000   Median :5.500  
##  Mean   :5   Mean   : 5.082   Mean   :4.764   Mean   :4.692   Mean   :5.313  
##  3rd Qu.:7   3rd Qu.: 7.000   3rd Qu.:7.000   3rd Qu.:6.000   3rd Qu.:7.000  
##  Max.   :7   Max.   :10.000   Max.   :7.000   Max.   :7.000   Max.   :7.000  
##  NA's   :7   NA's   :3        NA's   :10      NA's   :7       NA's   :3      
##       RS11            RS12            RS13             RS14     
##  Min.   :1.000   Min.   :1.000   Min.   : 1.000   Min.   :1.00  
##  1st Qu.:3.000   1st Qu.:5.000   1st Qu.: 4.000   1st Qu.:4.00  
##  Median :5.000   Median :6.000   Median : 6.000   Median :5.00  
##  Mean   :4.641   Mean   :5.552   Mean   : 5.507   Mean   :4.97  
##  3rd Qu.:6.500   3rd Qu.:7.000   3rd Qu.: 7.000   3rd Qu.:7.00  
##  Max.   :7.000   Max.   :7.000   Max.   :15.000   Max.   :7.00  
##  NA's   :6       NA's   :3       NA's   :3        NA's   :5     
##      Health     
##  Min.   :1.000  
##  1st Qu.:2.000  
##  Median :4.000  
##  Mean   :3.839  
##  3rd Qu.:6.000  
##  Max.   :7.000  
##

How do we “fix” issues?

Find the original data and figure out what the point should be.
Delete that data point but not the entire participant.

Accuracy: Continuous Variables

When you have an inaccurate value in one or only a few columns, you can fix those values using logical operators that we discussed in our subsetting section.
In this example, we know that students could not have missed more than 15 days.
They cannot be more than 12 for Grade.

summary(notypos$Grade)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   6.000   5.883   8.000  35.000

notypos$Grade[ notypos$Grade > 12 ] <- NA
summary(notypos$Grade)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   4.000   6.000   5.669   8.000  11.000       1

summary(notypos$Absences)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   2.000   3.358   5.000  35.000

notypos$Absences[ notypos$Absences > 15 ] <- NA
summary(notypos$Absences)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   2.000   2.000   3.125   5.000   7.000       1

Accuracy: Continuous Variables

When you have inaccurate values for columns that all have the same possible range, you can use the subsetting rules across all columns at once.

names(notypos)

##  [1] "Sex"      "Age"      "SES"      "Grade"    "Absences" "RS1"     
##  [7] "RS2"      "RS3"      "RS4"      "RS5"      "RS6"      "RS7"     
## [13] "RS8"      "RS9"      "RS10"     "RS11"     "RS12"     "RS13"    
## [19] "RS14"     "Health"

head(notypos[ , 6:19]) #lots of ways to do this part!

##   RS1 RS2 RS3 RS4 RS5 RS6 RS7 RS8 RS9 RS10 RS11 RS12 RS13 RS14
## 1   6   4   2   2   4   7   7   4   5    7    4    7    4    7
## 2   7   1   1   5   7   7   7   7   4    7    1    7    4    7
## 3   5  NA   5   7   4   6   7   6   6    7   NA    5    7    6
## 4   5   5   5   7   5   6   5   7   6    7    6    6    6    6
## 5   7   4   4   7   4   4   7   7   4    7    7    4    7    4
## 6   7   7   7   7   7   7   7   7   7    7    7    7    7    7

notypos[ , 6:19][ notypos[ , 6:19] > 7 ] <- NA
summary(notypos)

##     Sex          Age            SES         Grade           Absences    
##  Women:64   Min.   :11.00   Low   : 9   Min.   : 1.000   Min.   :1.000  
##  Men  :72   1st Qu.:13.00   Medium:56   1st Qu.: 4.000   1st Qu.:2.000  
##  NA's : 1   Median :15.50   High  :72   Median : 6.000   Median :2.000  
##             Mean   :15.05               Mean   : 5.669   Mean   :3.125  
##             3rd Qu.:17.00               3rd Qu.: 8.000   3rd Qu.:5.000  
##             Max.   :18.00               Max.   :11.000   Max.   :7.000  
##             NA's   :5                   NA's   :1        NA's   :1      
##       RS1             RS2             RS3             RS4            RS5       
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  1st Qu.:4.000   1st Qu.:4.000   1st Qu.:3.000   1st Qu.:3.00   1st Qu.:3.000  
##  Median :5.000   Median :5.000   Median :4.000   Median :5.00   Median :4.000  
##  Mean   :4.858   Mean   :4.962   Mean   :4.315   Mean   :4.55   Mean   :4.448  
##  3rd Qu.:6.000   3rd Qu.:6.000   3rd Qu.:6.000   3rd Qu.:7.00   3rd Qu.:6.750  
##  Max.   :7.000   Max.   :7.000   Max.   :7.000   Max.   :7.00   Max.   :7.000  
##  NA's   :3       NA's   :5       NA's   :7       NA's   :6      NA's   :3      
##       RS6         RS7             RS8             RS9             RS10      
##  Min.   :1   Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:4   1st Qu.:4.000   1st Qu.:3.000   1st Qu.:4.000   1st Qu.:4.000  
##  Median :5   Median :5.000   Median :5.000   Median :5.000   Median :5.500  
##  Mean   :5   Mean   :5.008   Mean   :4.764   Mean   :4.692   Mean   :5.313  
##  3rd Qu.:7   3rd Qu.:7.000   3rd Qu.:7.000   3rd Qu.:6.000   3rd Qu.:7.000  
##  Max.   :7   Max.   :7.000   Max.   :7.000   Max.   :7.000   Max.   :7.000  
##  NA's   :7   NA's   :5       NA's   :10      NA's   :7       NA's   :3      
##       RS11            RS12            RS13            RS14          Health     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:5.000   1st Qu.:4.000   1st Qu.:4.00   1st Qu.:2.000  
##  Median :5.000   Median :6.000   Median :6.000   Median :5.00   Median :4.000  
##  Mean   :4.641   Mean   :5.552   Mean   :5.364   Mean   :4.97   Mean   :3.839  
##  3rd Qu.:6.500   3rd Qu.:7.000   3rd Qu.:7.000   3rd Qu.:7.00   3rd Qu.:6.000  
##  Max.   :7.000   Max.   :7.000   Max.   :7.000   Max.   :7.00   Max.   :7.000  
##  NA's   :6       NA's   :3       NA's   :5       NA's   :5

Accuracy: Continuous Variables

Using mean and standard deviation to interpret accuracy for continuous variables
You want to make sure it’s the data you expect, the mean can be used to make such a judgment.
Also, standard deviation indicates the spread of the data
- very large spreads (lots of error) and,
- very small spreads (no variance) can be bad for you.
- Remember that depends on the scale of the data.

Accuracy: Continuous Variables

names(notypos)

##  [1] "Sex"      "Age"      "SES"      "Grade"    "Absences" "RS1"     
##  [7] "RS2"      "RS3"      "RS4"      "RS5"      "RS6"      "RS7"     
## [13] "RS8"      "RS9"      "RS10"     "RS11"     "RS12"     "RS13"    
## [19] "RS14"     "Health"

apply(notypos[ , -c(1,3)], 2, mean, na.rm = T)

##       Age     Grade  Absences       RS1       RS2       RS3       RS4       RS5 
## 15.053030  5.669118  3.125000  4.858209  4.962121  4.315385  4.549618  4.447761 
##       RS6       RS7       RS8       RS9      RS10      RS11      RS12      RS13 
##  5.000000  5.007576  4.763780  4.692308  5.313433  4.641221  5.552239  5.363636 
##      RS14    Health 
##  4.969697  3.839416

apply(notypos[ , -c(1,3)], 2, sd, na.rm = T)

##      Age    Grade Absences      RS1      RS2      RS3      RS4      RS5 
## 1.931318 2.647305 1.815316 1.840043 1.722794 1.821638 2.116651 1.971859 
##      RS6      RS7      RS8      RS9     RS10     RS11     RS12     RS13 
## 1.787120 1.888075 1.945515 1.825089 1.696604 1.905735 1.656916 1.549552 
##     RS14   Health 
## 1.824094 1.749931

Missing Data

To check for missing data, summary() can give a quick view.
Additionally, we can continue to use apply() for a sum of the number of NA values.

summary(notypos)

##     Sex          Age            SES         Grade           Absences    
##  Women:64   Min.   :11.00   Low   : 9   Min.   : 1.000   Min.   :1.000  
##  Men  :72   1st Qu.:13.00   Medium:56   1st Qu.: 4.000   1st Qu.:2.000  
##  NA's : 1   Median :15.50   High  :72   Median : 6.000   Median :2.000  
##             Mean   :15.05               Mean   : 5.669   Mean   :3.125  
##             3rd Qu.:17.00               3rd Qu.: 8.000   3rd Qu.:5.000  
##             Max.   :18.00               Max.   :11.000   Max.   :7.000  
##             NA's   :5                   NA's   :1        NA's   :1      
##       RS1             RS2             RS3             RS4            RS5       
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  1st Qu.:4.000   1st Qu.:4.000   1st Qu.:3.000   1st Qu.:3.00   1st Qu.:3.000  
##  Median :5.000   Median :5.000   Median :4.000   Median :5.00   Median :4.000  
##  Mean   :4.858   Mean   :4.962   Mean   :4.315   Mean   :4.55   Mean   :4.448  
##  3rd Qu.:6.000   3rd Qu.:6.000   3rd Qu.:6.000   3rd Qu.:7.00   3rd Qu.:6.750  
##  Max.   :7.000   Max.   :7.000   Max.   :7.000   Max.   :7.00   Max.   :7.000  
##  NA's   :3       NA's   :5       NA's   :7       NA's   :6      NA's   :3      
##       RS6         RS7             RS8             RS9             RS10      
##  Min.   :1   Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:4   1st Qu.:4.000   1st Qu.:3.000   1st Qu.:4.000   1st Qu.:4.000  
##  Median :5   Median :5.000   Median :5.000   Median :5.000   Median :5.500  
##  Mean   :5   Mean   :5.008   Mean   :4.764   Mean   :4.692   Mean   :5.313  
##  3rd Qu.:7   3rd Qu.:7.000   3rd Qu.:7.000   3rd Qu.:6.000   3rd Qu.:7.000  
##  Max.   :7   Max.   :7.000   Max.   :7.000   Max.   :7.000   Max.   :7.000  
##  NA's   :7   NA's   :5       NA's   :10      NA's   :7       NA's   :3      
##       RS11            RS12            RS13            RS14          Health     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:5.000   1st Qu.:4.000   1st Qu.:4.00   1st Qu.:2.000  
##  Median :5.000   Median :6.000   Median :6.000   Median :5.00   Median :4.000  
##  Mean   :4.641   Mean   :5.552   Mean   :5.364   Mean   :4.97   Mean   :3.839  
##  3rd Qu.:6.500   3rd Qu.:7.000   3rd Qu.:7.000   3rd Qu.:7.00   3rd Qu.:6.000  
##  Max.   :7.000   Max.   :7.000   Max.   :7.000   Max.   :7.00   Max.   :7.000  
##  NA's   :6       NA's   :3       NA's   :5       NA's   :5

apply(notypos, 2, function(x) { sum(is.na(x)) })

##      Sex      Age      SES    Grade Absences      RS1      RS2      RS3 
##        1        5        0        1        1        3        5        7 
##      RS4      RS5      RS6      RS7      RS8      RS9     RS10     RS11 
##        6        3        7        5       10        7        3        6 
##     RS12     RS13     RS14   Health 
##        3        5        5        0

Missing Data

Missing data is an important problem and leads us to ask ourselves one question: Why is this data missing?
- Did someone skip the question?
- Did someone forget to enter it (from a paper survey or other transfer of data)?
- Is it a typo or other issue with answering the question appropriately?

Types of Missing Data

MCAR: missing completely at random
- You want this type of missing data.
- Potentially, participants simply skipped a question or missing a trial.
MNAR: missing not at random.
- This type of missing data can be problematic.
- May be the survey instrument, computer program, or other data collection issue.
- For instance, what if you surveyed a campus about alcohol abuse? What does it mean if everyone skips the same question?
There are ways to test for the type, but most times you can easily note problematic MNAR data by checking percent of missing data or using the View() function.

What do I do with missing data?

You should not replace:
- MNAR data
- Categorical options that are demographic of the participants
You can conservatively replace:
- MCAR data when 5% of the data or less is to be replaced
- Be careful with small datasets
- Quick Fix: For simple statistics, you can often use arguments like na.rm = TRUE in R functions (e.g., mean(x, na.rm = TRUE)) to ignore missing values without permanently deleting them (Field et al., 2012).
Note: there is a difference between missing data and incomplete data.

What do I do with missing data?

By creating different datasets at each stage of our screening, we can run our analyses with and without missing data replaced to determine if our replacement impacted our results.
We can exclude entire participants (listwise) with missing data or only when they have have missing data for that analysis (pairwise).

What do I do with missing data?

There are multiple estimation methods to “fill-in” missing data if you did not want to remove it.
Mean substitution was a popular way to estimate missing data by simply estimating the mean for that variable for the missing data points.
- Conservative - doesn’t change the mean values used to find significant differences.
- Does change the variance, which may cause type I error with a large number of estimated data points.
Multiple imputation is now the most popular way to estimate missing data points because computing programs have made this process easier.
- Creates a set of potential expected values for each missing point.
- Using matrix algebra, the program estimates the probability of each value and picks the highest one.

Visualize Missing Data

library(VIM, quietly = T)

## VIM is ready to use.

## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues

## 
## Attaching package: 'VIM'

## The following object is masked from 'package:datasets':
## 
##     sleep

aggr(notypos, numbers = T)

Replacing Missing Data: Rows

We will start with examining missing data by rows, as we can exclude incomplete data and rows with too much missing data to replace.
This table tells us a summary of the missing data - how much is missing in the top row and the number of rows with that missing data in the second row of the table.

percentmiss <- function(x){ sum(is.na(x))/length(x) * 100 }
missing <- apply(notypos, 1, percentmiss)
table(missing)

## missing
##   0   5  10  15  20  70 
## 108  17   4   4   1   3

Replacing Missing Data: Rows

Create a set of rows to potentially replace.
If you know you want to include participants pairwise, you can also create a set of participants who you will not replace but keep with the dataset.

replace_rows <- subset(notypos, missing <= 5) #5%
noreplace_rows <- subset(notypos, missing > 5)

nrow(notypos)

## [1] 137

nrow(replace_rows)

## [1] 125

nrow(noreplace_rows)

## [1] 12

Replacing Missing Data: Columns

Next, we should examine missing data by columns to ensure we do not have MNAR data.
Note: We should examine for this missing data on our replace_rows because excluding the incomplete data may eliminate any issues by column.

apply(replace_rows, 2, percentmiss)

##      Sex      Age      SES    Grade Absences      RS1      RS2      RS3 
##      0.8      3.2      0.0      0.8      0.8      0.0      0.0      0.8 
##      RS4      RS5      RS6      RS7      RS8      RS9     RS10     RS11 
##      0.0      0.0      1.6      0.0      1.6      0.0      0.0      0.8 
##     RS12     RS13     RS14   Health 
##      0.0      1.6      1.6      0.0

Replacing Missing Data: Columns

Now we will exclude columns that we should not replace missing data for (categorical or demographic data).
Sex, Age, SES, and Grade are our demographic variables, but you can include variables that do not have missing data to provide a better estimation for the other missing data as mice uses all available information to estimate.

replace_columns <- replace_rows[ , -c(1,2,4)]
noreplace_columns <- replace_rows[ , c(1,2,4)] #notice these are both replace_rows

Replacing Missing Data: Using `mice`

mice() will figure out the type of data based on the column structure and replace it with that type of data.
This function creates several imputations of the data, which we will need to combine back into our original dataset.

library(mice)

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

temp_no_miss <- mice(replace_columns)

## 
##  iter imp variable
##   1   1  Absences  RS3  RS6  RS8  RS11  RS13  RS14
##   1   2  Absences  RS3  RS6  RS8  RS11  RS13  RS14
##   1   3  Absences  RS3  RS6  RS8  RS11  RS13  RS14
##   1   4  Absences  RS3  RS6  RS8  RS11  RS13  RS14
##   1   5  Absences  RS3  RS6  RS8  RS11  RS13  RS14
##   2   1  Absences  RS3  RS6  RS8  RS11  RS13  RS14
##   2   2  Absences  RS3  RS6  RS8  RS11  RS13  RS14
##   2   3  Absences  RS3  RS6  RS8  RS11  RS13  RS14
##   2   4  Absences  RS3  RS6  RS8  RS11  RS13  RS14
##   2   5  Absences  RS3  RS6  RS8  RS11  RS13  RS14
##   3   1  Absences  RS3  RS6  RS8  RS11  RS13  RS14
##   3   2  Absences  RS3  RS6  RS8  RS11  RS13  RS14
##   3   3  Absences  RS3  RS6  RS8  RS11  RS13  RS14
##   3   4  Absences  RS3  RS6  RS8  RS11  RS13  RS14
##   3   5  Absences  RS3  RS6  RS8  RS11  RS13  RS14
##   4   1  Absences  RS3  RS6  RS8  RS11  RS13  RS14
##   4   2  Absences  RS3  RS6  RS8  RS11  RS13  RS14
##   4   3  Absences  RS3  RS6  RS8  RS11  RS13  RS14
##   4   4  Absences  RS3  RS6  RS8  RS11  RS13  RS14
##   4   5  Absences  RS3  RS6  RS8  RS11  RS13  RS14
##   5   1  Absences  RS3  RS6  RS8  RS11  RS13  RS14
##   5   2  Absences  RS3  RS6  RS8  RS11  RS13  RS14
##   5   3  Absences  RS3  RS6  RS8  RS11  RS13  RS14
##   5   4  Absences  RS3  RS6  RS8  RS11  RS13  RS14
##   5   5  Absences  RS3  RS6  RS8  RS11  RS13  RS14

Replacing Missing Data: Using `mice`

nomiss <- complete(temp_no_miss, 1) #pick a dataset 1-5 

#combine back together
dim(notypos) #original data from previous step

## [1] 137  20

dim(nomiss) #replaced data

## [1] 125  17

#get all columns 
all_columns <- cbind(noreplace_columns, nomiss)
dim(all_columns)

## [1] 125  20

#get all rows
all_rows <- rbind(noreplace_rows, all_columns)
dim(all_rows)

## [1] 137  20

Outliers

Definition - case with extreme value on one variable or multiple variables.
Why does an outlier occur?
- Data input error.
- “Mindless” participant.
- Not a population you meant to sample.
- From the population but has really long tails and very extreme values.
The logic of removing outliers:
- Many statistics focus on the mean as a model of the data.
- The mean is affected by outliers.
- We will use a strict criterion to only remove very extreme scores.

Outliers: Types

Visual Inspection first: Field et al. (2012) suggest starting with boxplots and histograms to spot obvious errors (e.g., typos like 20.02 instead of 2.02).
There are two (2) types:
- Univariate - when you have one (1) DV or Y variable.
  - We can use Z-scores with an \(\alpha\) of .001 to eliminate very extreme values.
  - This corresponds to a Z-score of +/- 3.29 (Field et al., 2012).
  - This analysis is useful when you only have one column to check.
- Multivariate - when you have multiple continuous variables, measurements, or dependent variables.
  - How can we measure distance for many variables at once?
  - Mahalanobis distance: It creates a distance from the centroid (the mean of the means).
  - However, because we are creating a scores based on multiple columns, there is not one rule like Z-scores.

Outliers: Mahalanobis

Mahalanobis distance is distributed using a \(\chi^2\) distribution.
This measure is a distance measure:
- Distances are always positive!
- Many scores are close to zero, indicating they are close to the mean of means.
- Few scores are very large, indicating their scores are very different from everyone else.
- This type of pattern is not normally distributed, but rather, \(\chi^2\).
How do we know what is very far away?
- Use a \(\chi^2\) distribution with df = to the number of variables used to create the score.
- Use \(\alpha\) < .001, as our very strict criterion.
Again, because we save multiple datasets, we can test the analysis with and without outliers to help us determine their impact on our analyses.

Outliers: Analyze and Eliminate

All variables must be numbers for this analysis, because we are calculating the mean of means.
You could also only analyze for outliers on the data that is provided by participants.
This stage will potentially eliminate all participants who have any remaining missing data because they will not receive a distance score.

## you can use all columns or all rows here
## however, all rows has missing data, which will not get a score 
str(all_columns)

## 'data.frame':    125 obs. of  20 variables:
##  $ Sex     : Factor w/ 2 levels "Women","Men": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Age     : int  17 16 15 16 15 14 13 15 16 17 ...
##  $ Grade   : int  11 7 6 11 7 6 4 6 8 11 ...
##  $ SES     : Factor w/ 3 levels "Low","Medium",..: 2 3 3 2 3 3 1 2 2 3 ...
##  $ Absences: int  2 2 2 6 2 2 2 2 1 6 ...
##  $ RS1     : int  6 7 5 7 7 2 6 4 3 4 ...
##  $ RS2     : int  4 1 5 4 7 3 6 6 5 3 ...
##  $ RS3     : int  2 1 5 4 7 2 7 6 3 4 ...
##  $ RS4     : int  2 5 7 7 7 3 6 6 2 4 ...
##  $ RS5     : int  4 7 5 4 7 2 1 6 1 4 ...
##  $ RS6     : int  7 7 6 4 7 3 4 6 7 4 ...
##  $ RS7     : int  7 7 5 7 7 3 6 6 4 4 ...
##  $ RS8     : int  4 7 7 7 7 3 6 6 3 7 ...
##  $ RS9     : int  5 4 6 4 7 2 2 6 3 4 ...
##  $ RS10    : int  7 7 7 7 7 2 5 6 3 4 ...
##  $ RS11    : int  4 1 6 7 7 3 6 6 3 6 ...
##  $ RS12    : int  7 7 6 4 7 3 6 6 2 5 ...
##  $ RS13    : int  4 4 6 7 7 2 6 6 5 6 ...
##  $ RS14    : int  7 7 6 4 7 3 2 6 3 4 ...
##  $ Health  : int  6 6 2 6 4 6 1 2 3 1 ...

mahal <- mahalanobis(all_columns[ , -c(1,4)],
                    colMeans(all_columns[ , -c(1,4)], na.rm=TRUE),
                    cov(all_columns[ , -c(1,4)], use ="pairwise.complete.obs"))

Outliers: Analyze and Eliminate

## remember to match the number of columns
cutoff <- qchisq(1-.001, ncol(all_columns[ , -c(1,4)]))

## df and cutoff
ncol(all_columns[ , -c(1,4)])

## [1] 18

cutoff

## [1] 42.3124

##how many outliers? Look at FALSE
summary(mahal < cutoff)

##    Mode   FALSE    TRUE    NA's 
## logical       1     119       5

## eliminate
noout <- subset(all_columns, mahal < cutoff)
dim(all_columns)

## [1] 125  20

dim(noout)

## [1] 119  20

Data Screening Part 1

Data Screening

An Important Note

What Order Should be followed?

Why is this the order?

Data Screening Example

Accuracy

Accuracy: Categorical Variables

Accuracy: Categorical Variables

Accuracy: Continuous Variables

How do we “fix” issues?

Accuracy: Continuous Variables

Accuracy: Continuous Variables

Accuracy: Continuous Variables

Accuracy: Continuous Variables

Missing Data

Missing Data

Types of Missing Data

What do I do with missing data?

What do I do with missing data?

What do I do with missing data?

Visualize Missing Data

Replacing Missing Data: Rows

Replacing Missing Data: Rows

Replacing Missing Data: Columns

Replacing Missing Data: Columns

Replacing Missing Data: Using `mice`

Replacing Missing Data: Using `mice`

Outliers

Outliers: Types

Outliers: Mahalanobis

Outliers: Analyze and Eliminate

Outliers: Analyze and Eliminate

Summary

References

Data Screening Part 1

Data Screening

An Important Note

What Order Should be followed?

Why is this the order?

Data Screening Example

Accuracy

Accuracy: Categorical Variables

Accuracy: Categorical Variables

Accuracy: Continuous Variables

How do we “fix” issues?

Accuracy: Continuous Variables

Accuracy: Continuous Variables

Accuracy: Continuous Variables

Accuracy: Continuous Variables

Missing Data

Missing Data

Types of Missing Data

What do I do with missing data?

What do I do with missing data?

What do I do with missing data?

Visualize Missing Data

Replacing Missing Data: Rows

Replacing Missing Data: Rows

Replacing Missing Data: Columns

Replacing Missing Data: Columns

Replacing Missing Data: Using mice

Replacing Missing Data: Using mice

Outliers

Outliers: Types

Outliers: Mahalanobis

Outliers: Analyze and Eliminate

Outliers: Analyze and Eliminate

Summary

References

Replacing Missing Data: Using `mice`

Replacing Missing Data: Using `mice`