Programming Suggestion

We will be using ggplot2 to create our data visualizations.
The code can get very long, so you should use code stacking!
Stacking your code can help you troubleshoot and see what is going on easier.

textline +
  stat_summary(fun = mean,
               geom = "point") +
  stat_summary(fun = mean,
               geom = "line", 
               aes(group = Group)) +
  stat_summary(fun.data = mean_cl_normal,
               geom = "errorbar",
               width = .2) + 
  xlab("Measurement Time") +
  ylab("Mean Grammar Score") +
  cleanup +
  scale_color_manual(name = "Texting Option",
                     labels = c("All the texts", "None of the texts"),
                     values = c("Black", "Grey")) +
  scale_x_discrete(labels = c("Baseline", "Six Months"))

Outline

Reminder how to import files
Two key issues for data format: factors, format
Graphs:
- Histograms
- Scatterplots
- Bar Graphs
- Line Graphs

Working with Files

Let’s load SPSS file “ChickFlick.sav”.
Note that I have this saved in a data folder that is in the same folder as my Markdown file.
Remember that the rio library can usually interpret these files without much work using the import() function.

library(rio)
chickflick <- import("data/ChickFlick.sav")
str(chickflick)

## 'data.frame':    40 obs. of  3 variables:
##  $ gender : num  1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "Gender of Participant"
##   ..- attr(*, "format.spss")= chr "F8.0"
##   ..- attr(*, "labels")= Named num [1:2] 1 2
##   .. ..- attr(*, "names")= chr [1:2] "Male" "Female"
##  $ film   : num  1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "Name of Film"
##   ..- attr(*, "format.spss")= chr "F8.0"
##   ..- attr(*, "display_width")= int 18
##   ..- attr(*, "labels")= Named num [1:2] 1 2
##   .. ..- attr(*, "names")= chr [1:2] "Bridget Jones's Diary" "Memento"
##  $ arousal: num  22 13 16 10 18 24 13 14 19 23 ...
##   ..- attr(*, "label")= chr "Psychological Arousal During the Film"
##   ..- attr(*, "format.spss")= chr "F8.0"

Factor Categorical Variables

We saw on the previous slide that the gender and film variables have embedded data from SPSS.
However, this does not translate to a factor variable, which is useful for making plots.

table(chickflick$gender)

## 
##  1  2 
## 20 20

table(chickflick$film)

## 
##  1  2 
## 20 20

How to Factor

chickflick$gender <- factor(chickflick$gender, #the variable you want to factor
                            levels = c(1,2), #the information already in the data
                            labels = c("Male", "Female")) #the labels for those levels

table(chickflick$gender)

## 
##   Male Female 
##     20     20

Data Structure Format

Wide datasets: rows are participants, and columns are variables.
- Each participant gets one row, while each variable gets one column.
Long datasets: rows are assessments, and columns are variables.
- Each time point or repeated assessment gets a separate row, while each column still represents a variable.

Rearrange Data: Wide to Long

I wanted to test the Disney philosophy that ‘Wishing upon a star makes all your dreams come true’.
Measured the success of 250 people using a composite measure involving their salary, quality of life and how close their life matches their aspirations.
Success was measured using a standardized technique ranging from 0 Complete failure to 4 = Complete success
Participants were randomly allocated to either: Wish upon a star or work as hard as they can for next 5 years.
Success was measured again after 5 years.

library(reshape) #note: modern alternative is tidyr::pivot_longer()
cricket <- import("data/Jiminy Cricket.csv")
head(cricket)

##   ID Strategy Success_Pre Success_Post
## 1  1        1          53           74
## 2  2        1          62           67
## 3  3        1          52           33
## 4  4        1          57           62
## 5  5        1          55           44
## 6  6        1          52           65

Rearrange Data: Wide to Long

id = c("column", "column") - constant variables you do not want to change. These will stay their own column but get repeated when necessary.
measured = c("column", "column") - dependent variables you want to combine into one column.

longcricket <- melt(cricket,  #name of dataset
                    id = c("ID", "Strategy"), 
                    measured = c("Success_Pre", "Success_Post"))
#you can actually leave measured blank
head(longcricket)

##   ID Strategy    variable value
## 1  1        1 Success_Pre    53
## 2  2        1 Success_Pre    62
## 3  3        1 Success_Pre    52
## 4  4        1 Success_Pre    57
## 5  5        1 Success_Pre    55
## 6  6        1 Success_Pre    52

Rearrange Data: Wide to Long

Unfortunately, our melted columns become variable and value, which are not very useful names.
We should rename them for clarity!

colnames(longcricket)[3:4] #just to figure out which ones

## [1] "variable" "value"

colnames(longcricket)[3:4] <- c("Time", "Score")

The Art of Presenting Data

Graphs should (Tufte, 2001):
- Show the data.
- Induce the reader to think about the data being presented (rather than some other aspect of the graph).
- Avoid distorting the data.
- Present many numbers with minimum ink.
- Make large data sets (assuming you have one) coherent.
- Encourage the reader to compare different pieces of data.
- Reveal data.

Why is this Graph Bad?

Other Bad Design Choices

3D charts need to be well-made (check out plotly).
Patterns (depending)
Cylindrical bars
Bad axis labels
Overlays

Why is this Graph Better?

Do Not Deceive the Reader!

Plotting in R

The great thing about ggplot2 is that it is very flexible and well documented!
The bad thing about ggplot2 is that there’s a lot going on.
You will need Hmisc for confidence interval error bars.

library(ggplot2)
library(Hmisc)

Working with ggplot2

How ggplot works: First you define the basic structure of a plot you want.

#an example 
myGraph <- ggplot(dataset,
                  aes(x_axis, y_axis, 
                      color = legend_var, 
                      fill = legend_var))

Working with ggplot2

Now we have myGraph saved with all the variables, but no visualization.
Next, you can add options to create new layers on the graph.

#an example part 2
myGraph + 
  geom_bar() +
  geom_point() +
  xlab("X Axis Label") + 
  ylab("Y Axis Label")

Histograms

Histograms plot:
- The continuous score (x-axis)
- The frequency (y-axis)
Histograms help us to identify:
- The shape of the distribution. Measured using Skew, Kurtosis, or the Spread or variation in scores.
Unusual scores

Histogram: Example

crickethist <- ggplot(data = cricket, #dataset
                      aes(x = Success_Pre) #only define X axis 
                      )
crickethist

Histogram: Example

Our current plot is blank, let’s add the histogram bars:

crickethist + 
  geom_histogram()

Histogram: Example

Make the bins different:

crickethist + 
  geom_histogram(binwidth = 1)

Histogram: Example

You can change the color:
- Color is the outline of the bars
- Fill is the interior of the bars

crickethist + 
  geom_histogram(binwidth = 1, color = 'purple', fill = 'magenta')

Histogram: Example

Let’s add some labels:

crickethist + 
  geom_histogram(binwidth = 1, color = 'purple', fill = 'magenta') + 
  xlab("Success Pre Test") + 
  ylab("Frequency")

Histogram: Example 2

This dataset includes festival attendees who indicated their hygiene score on each day of the festival from 0 (eau de toilet) to 4 (eau de toilette)

festival <- import("data/festival.csv")
str(festival)

## 'data.frame':    810 obs. of  5 variables:
##  $ ticknumb: int  2111 2229 2338 2384 2401 2405 2467 2478 2490 2504 ...
##  $ gender  : chr  "Male" "Female" "Male" "Female" ...
##  $ day1    : num  2.64 0.97 0.84 3.03 0.88 0.85 1.56 3.02 2.29 1.11 ...
##  $ day2    : num  1.35 1.41 NA NA 0.08 NA NA NA NA 0.44 ...
##  $ day3    : num  1.61 0.29 NA NA NA NA NA NA NA 0.55 ...

Histogram: Example 2

In the previous example, we added things one at a time just as an example
You can run the whole graph at once!
Create the plot object:

festivalhist <- ggplot(data = festival, aes(x = day1)) 
festivalhist + 
  geom_histogram(binwidth = 1, color = 'blue') + 
  xlab("Day 1 of Festival Hygiene") +
  ylab("Frequency") +
  theme_bw() #theme_classic() also good!

Focus on these Facets

All graphs should have:
- X and Y axis labels.
- The use of titles depends on where the graph is used.
- Labels for the legend, facets, other group markers.
- Error bars when appropriate.
- Readable/cleaned up!

Clean Up?

theme_bw() and theme_classic() are great, easy themes to make graphs presentable.
Another important facet may be the font size, as it appears to be pretty small on the default.
Additionally, we will want to eliminate that default gray background that is included with legends.

cleanup <- theme(panel.grid.major = element_blank(), #no grid lines
                panel.grid.minor = element_blank(), #no grid lines
                panel.background = element_blank(), #no background
                axis.line.x = element_line(color = 'black'), #black x axis line
                axis.line.y = element_line(color = 'black'), #black y axis line
                legend.key = element_rect(fill = 'white'), #no legend background
                text = element_text(size = 15)) #bigger text size

Clean Up?

Save that code and then you can just do graph + cleanup.

festivalhist + 
  geom_histogram(binwidth = 1, color = 'blue') + 
  xlab("Day 1 of Festival Hygiene") +
  ylab("Frequency") +
  cleanup

Scatterplots

Simple scatter: X and Y are continuous variables
Grouped scatter: X and Y are continuous variables with legend for a third variable (generally categorical).

Scatterplots: Example

Anxiety and Exam Performance: 103 students who rated their exam anxiety, time spent revising their exam, exam performance, and gender.

exam <- import("data/Exam Anxiety.csv")
str(exam)

## 'data.frame':    103 obs. of  5 variables:
##  $ Code   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Revise : int  4 11 27 53 4 22 16 21 25 18 ...
##  $ Exam   : int  40 65 80 80 40 70 20 55 50 40 ...
##  $ Anxiety: num  86.3 88.7 70.2 61.3 89.5 ...
##  $ Gender : int  1 2 1 1 1 2 2 2 2 2 ...

Scatterplots: Example

Gender is currently a continuous variable, so we should factor that variable to appropriately label it on a graph when used.

table(exam$Gender)

## 
##  1  2 
## 52 51

exam$Gender <- factor(exam$Gender,
                     levels = c(1,2),
                     labels = c("Male", "Female"))
table(exam$Gender)

## 
##   Male Female 
##     52     51

Simple Scatterplot

scatter <- ggplot(exam, aes(Anxiety, Exam))
scatter +
  geom_point() +
  xlab("Anxiety Score") +
  ylab("Exam Score") +
  cleanup

Simple Scatterplot with Regression Line

scatter + 
  geom_point() +
  geom_smooth(method = 'lm', formula = y ~ x, 
              color = 'black', fill = 'blue') +
  xlab('Anxiety Score') +
  ylab('Exam Score') +
  cleanup

Grouped Scatterplot

How to control the colors and fill with legends.
- scale_fill_manual(name, labels, values)
- scale_color_manual(name, labels, values)

Grouped Scatterplot with Regression Line

scatter2 <- ggplot(exam, aes(Anxiety, Exam, 
                             color = Gender, fill = Gender)) #why both?
scatter2 +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x) +
  xlab("Anxiety Score") +
  ylab("Exam Score") +
  cleanup + 
  scale_fill_manual(name = "Gender of Participant",
                    labels = c("Men", "Women"),
                    values = c("purple", "grey")) +
  scale_color_manual(name = "Gender of Participant",
                     labels = c("Men", "Women"),
                     values = c("purple", "grey10"))

`GGally` for Multiple Visualization

GGally has a plotting function that uses ggplot2 as a back end.
The function ggpairs allows you to create a scatterplot matrix for any numerical variables in a dataset.

library(GGally)
ggpairs(data = exam[ , -1], #no participant variable
        title = "Exam Anxiety, Scores, and Gender")

Bar Graphs

Bar graphs are often used to display mean scores to allow comparison between groups.
Error bars can be added to display the precision of the mean:
- The confidence interval
- The standard deviation
- The standard error

Bar Graph: One Independent Variable

Is there such a thing as a ‘chick flick’?
- Twenty men and twenty women were assigned to watch one of two movies: A ‘chick flick’ (Bridget Jones’s Diary) or the control movie (Memento).
- Physiological arousal was used as an indicator of how much they enjoyed the film.

Bar Chart: One Independent Variable

Be sure to convert all factor variables, otherwise they will be treated as continuous (or you’ll get strange errors!).

str(chickflick) #already fixed gender

## 'data.frame':    40 obs. of  3 variables:
##  $ gender : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ...
##  $ film   : num  1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "Name of Film"
##   ..- attr(*, "format.spss")= chr "F8.0"
##   ..- attr(*, "display_width")= int 18
##   ..- attr(*, "labels")= Named num [1:2] 1 2
##   .. ..- attr(*, "names")= chr [1:2] "Bridget Jones's Diary" "Memento"
##  $ arousal: num  22 13 16 10 18 24 13 14 19 23 ...
##   ..- attr(*, "label")= chr "Psychological Arousal During the Film"
##   ..- attr(*, "format.spss")= chr "F8.0"

chickflick$film <- factor(chickflick$film,
                    levels = c(1,2),
                    labels = c("Bridget Jones", "Memento"))

Bar Chart: One Independent Variable Example

To add the mean, displayed as bars, we can add this as a layer to bar using the stat_summary() function:

chickbar <- ggplot(chickflick, aes(film, arousal))
chickbar + 
  stat_summary(fun = mean,
               geom = "bar",
               fill = "White", 
               color = "Black") +
  cleanup

Bar Chart: One Independent Variable Example

To add error bars, add these as a layer using stat_summary():

chickbar + 
  stat_summary(fun = mean,
               geom = "bar",
               fill = "White", 
               color = "Black") +
  stat_summary(fun.data = mean_cl_normal, 
               geom = "errorbar", 
               position = position_dodge(width = 0.90), 
               width = 0.2) +
  cleanup

Bar Chart: One Independent Variable Example

Now, add the rest of things we’ve been doing:

chickbar + 
  stat_summary(fun = mean,
               geom = "bar",
               fill = "White", 
               color = "Black") +
  stat_summary(fun.data = mean_cl_normal, 
               geom = "errorbar", 
               position = position_dodge(width = 0.90), 
               width = 0.2) +
  xlab("Movie Watched by Participant") +
  ylab("Arousal Level") +
  cleanup +
  scale_x_discrete(labels = c("Girl Film", "Guy Film"))

Bar Chart: Two Independent Variables

chickbar2 <- ggplot(chickflick, aes(film, arousal, fill = gender))
chickbar2 +
  stat_summary(fun = mean,
               geom = "bar",
               position = "dodge") +
  stat_summary(fun.data = mean_cl_normal,
               geom = "errorbar", 
               position = position_dodge(width = 0.90),
               width = .2) +
  xlab("Film Watched") +
  ylab("Arousal Level") + 
  cleanup +
  scale_fill_manual(name = "Gender of Participant", 
                    labels = c("Boys", "Girls"),
                    values = c("Gray30", "Gray"))

Line Graphs

When to use a line graph:
- With data that X is categorical, but is considered “mildly continuous”
- Usually with repeated measures data over time
- The example here is not quite appropriate for a line graph but does include repeated measures data for practicing restructuring data.

Line Graphs: One Independent Variable

How to cure hiccups? Participants included 15 participants who tried four hiccup cures.
The dependent variable included the number of hiccups in the minute after each procedure.

hiccups <- import("data/Hiccups.csv")
str(hiccups)

## 'data.frame':    15 obs. of  4 variables:
##  $ Baseline: int  15 13 9 7 11 14 20 9 17 19 ...
##  $ Tongue  : int  9 18 17 15 18 8 3 16 10 10 ...
##  $ Carotid : int  7 7 5 10 7 10 7 12 9 8 ...
##  $ Other   : int  2 4 4 5 4 3 3 3 4 4 ...

Line Graphs: One Independent Variable

These data are in the wrong format for ggplot2 to use.
We need all of the scores stacked up in a single column and then another variable that specifies the type of intervention.

longhiccups <- melt(hiccups, 
                    measured = c("Baseline", "Tongue", "Carotid", "Other"))
str(longhiccups)

## 'data.frame':    60 obs. of  2 variables:
##  $ variable: Factor w/ 4 levels "Baseline","Tongue",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ value   : int  15 13 9 7 11 14 20 9 17 19 ...

colnames(longhiccups) <- c("Intervention", "Hiccups")

Line Graphs: One Independent Variable

hiccupline <- ggplot(longhiccups, aes(Intervention, Hiccups))
hiccupline +
  stat_summary(fun = mean, ##adds the points
               geom = "point") +
  stat_summary(fun = mean, ##adds the line
               geom = "line",
               aes(group=1)) + ##necessary for mapping line to dots
  stat_summary(fun.data = mean_cl_normal, ##adds the error bars
               geom = "errorbar", 
               width = .2) +
  xlab("Intervention Type") +
  ylab("Number of Hiccups") + 
  cleanup

Line Graphs: Two Independent Variables

Is text-messaging bad for your grammar? Participants included 50 children who were split into two groups (between-subjects): text-messaging allowed or text-messaging forbidden.
In a second variable (repeated measures), each child was measured at two points in time: at baseline and six months later.
The dependent variable was the percent correct score on a grammar test.

Line Graphs: Two Independent Variables

texting <- import("data/Texting.xlsx")
str(texting)

## 'data.frame':    50 obs. of  3 variables:
##  $ Group     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Baseline  : num  52 68 85 47 73 57 63 50 66 60 ...
##  $ Six_months: num  32 48 62 16 63 53 59 58 59 57 ...

Line Graphs: Two Independent Variables

Two issues to clean up:
- Correctly label the groups into a factor variable.
- Restructure the data so the repeated measures variable is in long format.
- When you restructure the data, you should also rename the columns

texting$Group <- factor(texting$Group,
                       levels = c(1,2),
                       labels = c("Texting Allowed", "No Texting Allowed"))
longtexting <- melt(texting,
                   id = c("Group"),
                   measured = c("Baseline", "Six_months"))
str(longtexting)

## 'data.frame':    100 obs. of  3 variables:
##  $ Group   : Factor w/ 2 levels "Texting Allowed",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ variable: Factor w/ 2 levels "Baseline","Six_months": 1 1 1 1 1 1 1 1 1 1 ...
##  $ value   : num  52 68 85 47 73 57 63 50 66 60 ...

colnames(longtexting) <- c("Group", "Time", "Grammar_Score")

Line Graphs: Two Independent Variables

textline <- ggplot(longtexting, aes(Time, Grammar_Score, color = Group))
textline +
  stat_summary(fun = mean,
               geom = "point") +
  stat_summary(fun = mean,
               geom = "line", 
               aes(group = Group)) + #Group is the variable name
  stat_summary(fun.data = mean_cl_normal,
               geom = "errorbar",
               width = .2) + 
  xlab("Measurement Time") +
  ylab("Mean Grammar Score") +
  cleanup +
  scale_color_manual(name = "Texting Option",
                     labels = c("All the texts", "None of the texts"),
                     values = c("Black", "Grey")) +
  scale_x_discrete(labels = c("Baseline", "Six Months"))

Summary

In this lecture, you have learned:
- How to clean up your data for visualizations: factor() and melt()
- Ideas for good visualizations (Tufte principles)
- Histograms, scatterplots, bar charts, and line graphs
- ggplot2 structure and layered grammar of graphics
- Error bars with confidence intervals
- Custom themes for professional presentations

Data Visualization

Programming Suggestion

Outline

Working with Files

Factor Categorical Variables

How to Factor

Data Structure Format

Rearrange Data: Wide to Long

Rearrange Data: Wide to Long

Rearrange Data: Wide to Long

The Art of Presenting Data

Why is this Graph Bad?

Other Bad Design Choices

Why is this Graph Better?

Do Not Deceive the Reader!

Plotting in R

Working with ggplot2

Working with ggplot2

Histograms

Histogram: Example

Histogram: Example

Histogram: Example

Histogram: Example

Histogram: Example

Histogram: Example 2

Histogram: Example 2

Focus on these Facets

Clean Up?

Clean Up?

Scatterplots

Scatterplots: Example

Scatterplots: Example

Simple Scatterplot

Simple Scatterplot with Regression Line

Grouped Scatterplot

Grouped Scatterplot with Regression Line

GGally for Multiple Visualization

Bar Graphs

Bar Graph: One Independent Variable

Bar Chart: One Independent Variable

Bar Chart: One Independent Variable Example

Bar Chart: One Independent Variable Example

Bar Chart: One Independent Variable Example

Bar Chart: Two Independent Variables

Line Graphs

Line Graphs: One Independent Variable

Line Graphs: One Independent Variable

Line Graphs: One Independent Variable

Line Graphs: Two Independent Variables

Line Graphs: Two Independent Variables

Line Graphs: Two Independent Variables

Line Graphs: Two Independent Variables

Summary

Additional Resources

`GGally` for Multiple Visualization