Correlation

Ziyuan Huang

Last Updated: 2026-01-22

Machine: MEL_DEL - Windows 10 x64

What is a Correlation?

Visualizing Correlation

Visualizing Correlation

Visualizing Correlation

Using Scatterplots

Using Scatterplots

# Load required libraries
library(rio)        # Data import
library(ggplot2)    # Visualization
library(corrplot)   # Correlation matrix visualization
library(GGally)     # ggpairs for scatterplot matrices
library(Hmisc)      # rcorr() function
library(ppcor)      # Partial correlations

# Define cleanup theme for professional graphs
cleanup <- theme(panel.grid.major = element_blank(), 
                panel.grid.minor = element_blank(), 
                panel.background = element_blank(), 
                axis.line.x = element_line(color = "black"),
                axis.line.y = element_line(color = "black"),
                legend.key = element_rect(fill = "white"),
                text = element_text(size = 15))

# Import datasets
exam <- import("data/exam_data.csv")
liar <- import("data/liar_data.csv")

Using Scatterplots

#from chapter 5 notes
scatter <- ggplot(exam, aes(Anxiety, Exam))
scatter +
  geom_point() +
  xlab("Anxiety Score") +
  ylab("Exam Score") +
  cleanup

Using Scatterplots

#from chapter 5 notes + coord_cartesian
scatter <- ggplot(exam, aes(Anxiety, Exam))
scatter +
  geom_point() +
  xlab("Anxiety Score") +
  ylab("Exam Score") +
  cleanup + 
  coord_cartesian(xlim = c(50,100), ylim = c(0,100))

  #just example numbers, you would want to use the real scale of the data

Modeling Relationships

Modeling Relationships

Measuring Relationships

Revision of Variance

\[SD^2 = \frac {\sum(X_i-\bar{X})^2}{N-1}\]

\[SD^2 = \frac {\sum(X_i-\bar{X})(X_i-\bar{X})}{N-1}\]

Revision of Variance

\[Cov(x,y) = \frac {\sum(X_i-\bar{X})(Y_i-\bar{Y})}{N-1} \]

Revision of Variance: Calculating Variance

var(exam$Revise)
## [1] 329.7531
var(exam$Exam)
## [1] 672.9138

Revision of Variance: Calculating Covariance

cov(exam$Revise, exam$Exam)
## [1] 186.8784
plot(exam$Revise, exam$Exam)

Problems with Covariance

The Correlation Coefficient

\[r = \frac{Cov(x,y)}{S_xS_y}\]

\[r = \frac{\sum(X_i-\bar{X})(Y_i-\bar{Y})}{(N-1)S_xS_y} \]

The Correlation Coefficient

The Correlation Coefficient

R versus r

Correlation: Example

cor(exam$Revise, exam$Exam, use = "complete.obs")
## [1] 0.3967207

Visualizing Correlations: Modern Approaches

# Quick scatterplot matrix with ggpairs()
ggpairs(exam[, c("Exam", "Anxiety", "Revise")],
        title = "Scatterplot Matrix of Exam Data",
        lower = list(continuous = "smooth"),
        diag = list(continuous = "barDiag"))

Correlation: Example

Assumptions for Correlation

Correlation: Understanding the NHST

Correlation: Understanding the NHST

Correlation: Understanding the NHST

Correlation: How to Calculate

Correlation: Three Main Functions in R

Function Methods Multiple Vars p-values CI Best Use
cor() All 3 Quick exploration
rcorr() P, S Matrix with p-values
cor.test() All 3 Single pair, full stats

Correlation: How to Calculate

Correlation Matrix: Numeric Methods

# Pearson correlation matrix
cor(exam[ , -1], 
    use = "pairwise.complete.obs",  # Handle missing data
    method = "pearson")
##              Revise         Exam     Anxiety       Gender
## Revise   1.00000000  0.396720697 -0.70924926  0.085350584
## Exam     0.39672070  1.000000000 -0.44099341 -0.004674066
## Anxiety -0.70924926 -0.440993412  1.00000000 -0.002365840
## Gender   0.08535058 -0.004674066 -0.00236584  1.000000000
# Kendall's tau (better for small samples, tied ranks)
cor(exam[ , -1], 
    use = "pairwise.complete.obs", 
    method = "kendall")
##              Revise         Exam     Anxiety       Gender
## Revise   1.00000000  0.263325853 -0.48856004  0.099160691
## Exam     0.26332585  1.000000000 -0.28479188 -0.007164456
## Anxiety -0.48856004 -0.284791876  1.00000000  0.018699690
## Gender   0.09916069 -0.007164456  0.01869969  1.000000000

Visualizing Correlation Matrix

# Create correlation matrix
cor_matrix <- cor(exam[ , -1], use = "pairwise.complete.obs")

# Visualize with corrplot
corrplot(cor_matrix, 
         method = "color",        # Color-coded cells
         type = "upper",          # Show upper triangle only
         addCoef.col = "black",   # Add correlation coefficients
         tl.col = "black",        # Text label color
         tl.srt = 45,             # Text label rotation
         diag = FALSE)            # Hide diagonal

Correlation: rcorr() with p-values

# rcorr() requires matrix input
rcorr(as.matrix(exam[ , -1]), type = "pearson")
##         Revise  Exam Anxiety Gender
## Revise    1.00  0.40   -0.71   0.09
## Exam      0.40  1.00   -0.44   0.00
## Anxiety  -0.71 -0.44    1.00   0.00
## Gender    0.09  0.00    0.00   1.00
## 
## n= 103 
## 
## 
## P
##         Revise Exam   Anxiety Gender
## Revise         0.0000 0.0000  0.3913
## Exam    0.0000        0.0000  0.9626
## Anxiety 0.0000 0.0000         0.9811
## Gender  0.3913 0.9626 0.9811

Correlation: cor.test() with Full Statistics

# Full statistical test for single correlation
cor.test(exam$Revise,
         exam$Exam,
         method = "pearson",
         alternative = "two.sided",  # or "greater", "less"
         conf.level = 0.95)          # 95% CI
## 
##  Pearson's product-moment correlation
## 
## data:  exam$Revise and exam$Exam
## t = 4.3434, df = 101, p-value = 3.343e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2200938 0.5481602
## sample estimates:
##       cor 
## 0.3967207

Interpreting cor.test() Output

## 
##  Pearson's product-moment correlation
## 
## data:  exam$Revise and exam$Exam
## t = 4.3434, df = 101, p-value = 3.343e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2200938 0.5481602
## sample estimates:
##       cor 
## 0.3967207

Correlation Interpretation

Nonparametric Correlation

Correlation: Example - Liar Dataset

str(liar)
## 'data.frame':    68 obs. of  3 variables:
##  $ Creativity: int  53 36 31 43 30 41 32 54 47 50 ...
##  $ Position  : int  1 3 4 2 4 1 4 1 2 2 ...
##  $ Novice    : chr  "First Time" "Had entered Competition Before" "First Time" "First Time" ...
summary(liar)
##    Creativity       Position        Novice         
##  Min.   :21.00   Min.   :1.000   Length:68         
##  1st Qu.:35.00   1st Qu.:1.000   Class :character  
##  Median :39.00   Median :2.000   Mode  :character  
##  Mean   :39.99   Mean   :2.221                     
##  3rd Qu.:45.25   3rd Qu.:3.000                     
##  Max.   :56.00   Max.   :6.000

Why Non-Parametric for Liar Data?

Correlation: Comparing Methods on Liar Data

with(liar, cor.test(Creativity, Position, method = "spearman"))
## 
##  Spearman's rank correlation rho
## 
## data:  Creativity and Position
## S = 71948, p-value = 0.00172
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.3732184
with(liar, cor.test(Creativity, Position, method = "kendall"))
## 
##  Kendall's rank correlation tau
## 
## data:  Creativity and Position
## z = -3.2252, p-value = 0.001259
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##        tau 
## -0.3002413

Interpreting Non-Parametric Results

Visualizing the Liar Data Relationship

scatter <- ggplot(liar, aes(Creativity, Position))
scatter +
  geom_point(size = 3, alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE, color = "blue") +
  xlab("Creativity Score (0-60)") +
  ylab("Competition Position (1 = Winner)") +
  ggtitle("Negative Correlation: Higher Creativity → Better Placement") +
  cleanup +
  scale_y_reverse()  # Reverse Y so winners at top

Correlation: Example

Correlation: Biserial and Point-Biserial

Point-Biserial Example

# Convert character variable to numeric (1, 2)
liar$Novice2 <- as.numeric(as.factor(liar$Novice))
str(liar$Novice2) 
##  num [1:68] 1 2 1 1 2 1 1 2 2 1 ...
# Correlation between creativity and novice status
with(liar, cor.test(Creativity, Novice2))
## 
##  Pearson's product-moment correlation
## 
## data:  Creativity and Novice2
## t = -2.1969, df = 66, p-value = 0.03154
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.47020478 -0.02412131
## sample estimates:
##        cor 
## -0.2610451

Visualizing Point-Biserial Correlation

# Better visualization with boxplot
ggplot(liar, aes(x = Novice, y = Creativity, fill = Novice)) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(width = 0.1, alpha = 0.3) +
  xlab("Competitor Experience") +
  ylab("Creativity Score") +
  ggtitle("Point-Biserial: Creativity by Experience Level") +
  cleanup +
  scale_fill_manual(values = c("grey70", "grey30")) +
  theme(legend.position = "none")

Important Note on Binary Correlations

Comparing Correlations

Types of Correlation Comparisons

  1. Independent correlations:
    • Correlations from separate groups (e.g., men vs. women)
    • Same variables in both groups
    • Example: Is r(anxiety, performance) different between control vs. treatment?
  2. Dependent correlations - Overlapping:
    • Same people, different variables sharing one variable
    • Example: Compare r(X,Y) vs. r(X,Z) in same sample
  3. Dependent correlations - Non-overlapping:
    • Same people, completely different variable pairs
    • Example: Compare r(W,X) vs. r(Y,Z) in same sample

Independent Correlations

Independent Correlations

library(cocor)
new <- subset(liar, Novice == "First Time")
old <- subset(liar, Novice == "Had entered Competition Before")
ind_data <- list(new, old)
cocor(~Creativity + Position | Creativity + Position,
      data = ind_data)
## 
##   Results of a comparison of two correlations based on independent groups
## 
## Comparison between r1.jk (Creativity, Position) = -0.2123 and r2.hm (Creativity, Position) = -0.3802
## Difference: r1.jk - r2.hm = 0.1679
## Data: ind_data: j = Creativity, k = Position, h = Creativity, m = Position
## Group sizes: n1 = 33, n2 = 35
## Null hypothesis: r1.jk is equal to r2.hm
## Alternative hypothesis: r1.jk is not equal to r2.hm (two-sided)
## Alpha: 0.05
## 
## fisher1925: Fisher's z (1925)
##   z = 0.7268, p-value = 0.4673
##   Null hypothesis retained
## 
## zou2007: Zou's (2007) confidence interval
##   95% confidence interval for r1.jk - r2.hm: -0.2792 0.6027
##   Null hypothesis retained (Interval includes 0)

Dependent Correlations

Dependent Correlations

cocor(~Revise + Exam | Revise + Anxiety, 
      data = exam)
## 
##   Results of a comparison of two overlapping correlations based on dependent groups
## 
## Comparison between r.jk (Revise, Exam) = 0.3967 and r.jh (Revise, Anxiety) = -0.7092
## Difference: r.jk - r.jh = 1.106
## Related correlation: r.kh = -0.441
## Data: exam: j = Revise, k = Exam, h = Anxiety
## Group size: n = 103
## Null hypothesis: r.jk is equal to r.jh
## Alternative hypothesis: r.jk is not equal to r.jh (two-sided)
## Alpha: 0.05
## 
## pearson1898: Pearson and Filon's z (1898)
##   z = 10.1802, p-value = 0.0000
##   Null hypothesis rejected
## 
## hotelling1940: Hotelling's t (1940)
##   t = 9.3238, df = 100, p-value = 0.0000
##   Null hypothesis rejected
## 
## williams1959: Williams' t (1959)
##   t = 8.9261, df = 100, p-value = 0.0000
##   Null hypothesis rejected
## 
## olkin1967: Olkin's z (1967)
##   z = 10.1802, p-value = 0.0000
##   Null hypothesis rejected
## 
## dunn1969: Dunn and Clark's z (1969)
##   z = 8.0684, p-value = 0.0000
##   Null hypothesis rejected
## 
## hendrickson1970: Hendrickson, Stanley, and Hills' (1970) modification of Williams' t (1959)
##   t = 9.2710, df = 100, p-value = 0.0000
##   Null hypothesis rejected
## 
## steiger1980: Steiger's (1980) modification of Dunn and Clark's z (1969) using average correlations
##   z = 7.6646, p-value = 0.0000
##   Null hypothesis rejected
## 
## meng1992: Meng, Rosenthal, and Rubin's z (1992)
##   z = 7.6896, p-value = 0.0000
##   Null hypothesis rejected
##   95% confidence interval for r.jk - r.jh: 0.9727 1.6382
##   Null hypothesis rejected (Interval does not include 0)
## 
## hittner2003: Hittner, May, and Silver's (2003) modification of Dunn and Clark's z (1969) using a backtransformed average Fisher's (1921) Z procedure
##   z = 7.6392, p-value = 0.0000
##   Null hypothesis rejected
## 
## zou2007: Zou's (2007) confidence interval
##   95% confidence interval for r.jk - r.jh: 0.8698 1.3009
##   Null hypothesis rejected (Interval does not include 0)

Partial and Semi-Partial Correlations

Understanding Partial Correlation

Understanding Semi-Partial Correlation

Partial and Semi-Partial Correlations

Partial Correlations in R

# Using ppcor package
pcor(exam[ , -c(1)], method = "pearson")
## $estimate
##             Revise        Exam    Anxiety      Gender
## Revise   1.0000000  0.13438703 -0.6510138  0.12060588
## Exam     0.1343870  1.00000000 -0.2442316 -0.02247429
## Anxiety -0.6510138 -0.24423163  1.0000000  0.07479920
## Gender   0.1206059 -0.02247429  0.0747992  1.00000000
## 
## $p.value
##               Revise       Exam      Anxiety    Gender
## Revise  0.000000e+00 0.18029461 1.703877e-13 0.2296046
## Exam    1.802946e-01 0.00000000 1.384188e-02 0.8234728
## Anxiety 1.703877e-13 0.01384188 0.000000e+00 0.4572346
## Gender  2.296046e-01 0.82347277 4.572346e-01 0.0000000
## 
## $statistic
##            Revise       Exam    Anxiety     Gender
## Revise   0.000000  1.3493743 -8.5335224  1.2088373
## Exam     1.349374  0.0000000 -2.5059623 -0.2236729
## Anxiety -8.533522 -2.5059623  0.0000000  0.7463335
## Gender   1.208837 -0.2236729  0.7463335  0.0000000
## 
## $n
## [1] 103
## 
## $gp
## [1] 2
## 
## $method
## [1] "pearson"

Semi-Partial Correlations in R

spcor(exam[ , -c(1)], method = "pearson")
## $estimate
##             Revise        Exam     Anxiety      Gender
## Revise   1.0000000  0.09406750 -0.59488837  0.08427039
## Exam     0.1206113  1.00000000 -0.22399075 -0.01999258
## Anxiety -0.5842845 -0.17158155  1.00000000  0.05110095
## Gender   0.1206031 -0.02231536  0.07446008  1.00000000
## 
## $p.value
##               Revise       Exam      Anxiety    Gender
## Revise  0.000000e+00 0.34943989 5.384648e-11 0.4021110
## Exam    2.295835e-01 0.00000000 2.433797e-02 0.8426994
## Anxiety 1.414541e-10 0.08622476 0.000000e+00 0.6118058
## Gender  2.296154e-01 0.82470106 4.592829e-01 0.0000000
## 
## $statistic
##            Revise       Exam    Anxiety     Gender
## Revise   0.000000  0.9401285 -7.3637762  0.8414729
## Exam     1.208892  0.0000000 -2.2867841 -0.1989634
## Anxiety -7.163533 -1.7329141  0.0000000  0.5091132
## Gender   1.208809 -0.2220903  0.7429308  0.0000000
## 
## $n
## [1] 103
## 
## $gp
## [1] 2
## 
## $method
## [1] "pearson"

Visualizing the Difference

Reporting Correlations (APA Style)

Effect Sizes for Correlation

Effect Size r
Small 0.10 - 0.29 Variables weakly related
Medium 0.30 - 0.49 Moderate relationship
Large 0.50 - 1.00 Strong relationship

Correlation Assumptions Summary

Summary: Key Takeaways

Additional Resources