Ziyuan Huang
Last Updated: 2026-01-22
Machine: MEL_DEL - Windows 10 x64
Definition: It is a way of measuring the extent to which two variables are related.
Correlation indicates the direction and strength of the relationship across variables.
Plotting reminder:
ggplot(data, aes(X, Y, color/fill = categorical)
We can additionally control the length of the X and Y axes with
coord_cartesian().
# Load required libraries
library(rio) # Data import
library(ggplot2) # Visualization
library(corrplot) # Correlation matrix visualization
library(GGally) # ggpairs for scatterplot matrices
library(Hmisc) # rcorr() function
library(ppcor) # Partial correlations
# Define cleanup theme for professional graphs
cleanup <- theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black"),
legend.key = element_rect(fill = "white"),
text = element_text(size = 15))
# Import datasets
exam <- import("data/exam_data.csv")
liar <- import("data/liar_data.csv")#from chapter 5 notes
scatter <- ggplot(exam, aes(Anxiety, Exam))
scatter +
geom_point() +
xlab("Anxiety Score") +
ylab("Exam Score") +
cleanup#from chapter 5 notes + coord_cartesian
scatter <- ggplot(exam, aes(Anxiety, Exam))
scatter +
geom_point() +
xlab("Anxiety Score") +
ylab("Exam Score") +
cleanup +
coord_cartesian(xlim = c(50,100), ylim = c(0,100))Remember, our all-in-one statistical equation:
We have previously defined our model as the Mean and the Standard Deviation or Standard Error.
We used the SE to determine if our model “fit” well.
Now, we can use:
Correlation is a type of simple standardized regression, which is what this equation represents.
When you have one X and one Y variable, r and \(\beta\) are equal.
Now we are using r or \(\beta\) to determine the strength of the model.
Traditionally you don’t see the error values reported (sometimes you see CI).
How do we measure the direction and strength of the relationship of variables?
We need to see whether as one variable increases, the other increases, decreases or stays the same.
This relationship can be found by calculating the covariance.
\[SD^2 = \frac {\sum(X_i-\bar{X})^2}{N-1}\]
\[SD^2 = \frac {\sum(X_i-\bar{X})(X_i-\bar{X})}{N-1}\]
\[Cov(x,y) = \frac {\sum(X_i-\bar{X})(Y_i-\bar{Y})}{N-1} \]
## [1] 329.7531
## [1] 672.9138
## [1] 186.8784
It depends upon the units of measurement.
One solution: standardize it!
The standardized version of covariance is known as the correlation coefficient.
\[r = \frac{Cov(x,y)}{S_xS_y}\]
\[r = \frac{\sum(X_i-\bar{X})(Y_i-\bar{Y})}{(N-1)S_xS_y} \]
It varies between -1 and +1
It is an effect size
Coefficient of determination, \(r^2\)
## [1] 0.3967207
# Quick scatterplot matrix with ggpairs()
ggpairs(exam[, c("Exam", "Anxiety", "Revise")],
title = "Scatterplot Matrix of Exam Data",
lower = list(continuous = "smooth"),
diag = list(continuous = "barDiag"))Correlation is an effect size and can be used for statistical testing. What should I check for data screening?
Before running correlation analysis, check these assumptions:
Reference: Field, Miles, & Field (2012), Chapter 6
Remember that we discussed NHST as a framework for understanding the results of a statistical test.
For correlation, we might consider using:
We can use many different forms of this type of set up:
The main “rule” is that the two hypotheses are opposites and cover the entire range of possibilities. For example, here’s an incompatible set up:
cor() during data screening.| Function | Methods | Multiple Vars | p-values | CI | Best Use |
|---|---|---|---|---|---|
cor() |
All 3 | ✓ | ✗ | ✗ | Quick exploration |
rcorr() |
P, S | ✓ | ✓ | ✗ | Matrix with p-values |
cor.test() |
All 3 | ✗ | ✓ | ✓ | Single pair, full stats |
cor() function will calculate:
# Pearson correlation matrix
cor(exam[ , -1],
use = "pairwise.complete.obs", # Handle missing data
method = "pearson")## Revise Exam Anxiety Gender
## Revise 1.00000000 0.396720697 -0.70924926 0.085350584
## Exam 0.39672070 1.000000000 -0.44099341 -0.004674066
## Anxiety -0.70924926 -0.440993412 1.00000000 -0.002365840
## Gender 0.08535058 -0.004674066 -0.00236584 1.000000000
# Kendall's tau (better for small samples, tied ranks)
cor(exam[ , -1],
use = "pairwise.complete.obs",
method = "kendall")## Revise Exam Anxiety Gender
## Revise 1.00000000 0.263325853 -0.48856004 0.099160691
## Exam 0.26332585 1.000000000 -0.28479188 -0.007164456
## Anxiety -0.48856004 -0.284791876 1.00000000 0.018699690
## Gender 0.09916069 -0.007164456 0.01869969 1.000000000
# Create correlation matrix
cor_matrix <- cor(exam[ , -1], use = "pairwise.complete.obs")
# Visualize with corrplot
corrplot(cor_matrix,
method = "color", # Color-coded cells
type = "upper", # Show upper triangle only
addCoef.col = "black", # Add correlation coefficients
tl.col = "black", # Text label color
tl.srt = 45, # Text label rotation
diag = FALSE) # Hide diagonalrcorr() function will calculate:
as.matrix()## Revise Exam Anxiety Gender
## Revise 1.00 0.40 -0.71 0.09
## Exam 0.40 1.00 -0.44 0.00
## Anxiety -0.71 -0.44 1.00 0.00
## Gender 0.09 0.00 0.00 1.00
##
## n= 103
##
##
## P
## Revise Exam Anxiety Gender
## Revise 0.0000 0.0000 0.3913
## Exam 0.0000 0.0000 0.9626
## Anxiety 0.0000 0.0000 0.9811
## Gender 0.3913 0.9626 0.9811
cor.test() function will calculate:
# Full statistical test for single correlation
cor.test(exam$Revise,
exam$Exam,
method = "pearson",
alternative = "two.sided", # or "greater", "less"
conf.level = 0.95) # 95% CI##
## Pearson's product-moment correlation
##
## data: exam$Revise and exam$Exam
## t = 4.3434, df = 101, p-value = 3.343e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2200938 0.5481602
## sample estimates:
## cor
## 0.3967207
##
## Pearson's product-moment correlation
##
## data: exam$Revise and exam$Exam
## t = 4.3434, df = 101, p-value = 3.343e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2200938 0.5481602
## sample estimates:
## cor
## 0.3967207
The third-variable problem:
Direction of causality:
When to use: Violations of normality or presence of outliers
Field et al. (2012, pp. 277-279) describe two non-parametric alternatives:
Spearman’s Rho (rs):
Kendall’s Tau (τ):
Dataset from Field et al. (2012) - World’s Best Liar Competition:
Position: Competition placement (1st, 2nd, 3rd, etc.) -
OrdinalCreativity: Questionnaire score (0-60) -
ContinuousNovice: First time vs. returning competitor -
Binary## 'data.frame': 68 obs. of 3 variables:
## $ Creativity: int 53 36 31 43 30 41 32 54 47 50 ...
## $ Position : int 1 3 4 2 4 1 4 1 2 2 ...
## $ Novice : chr "First Time" "Had entered Competition Before" "First Time" "First Time" ...
## Creativity Position Novice
## Min. :21.00 Min. :1.000 Length:68
## 1st Qu.:35.00 1st Qu.:1.000 Class :character
## Median :39.00 Median :2.000 Mode :character
## Mean :39.99 Mean :2.221
## 3rd Qu.:45.25 3rd Qu.:3.000
## Max. :56.00 Max. :6.000
##
## Spearman's rank correlation rho
##
## data: Creativity and Position
## S = 71948, p-value = 0.00172
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.3732184
##
## Kendall's rank correlation tau
##
## data: Creativity and Position
## z = -3.2252, p-value = 0.001259
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## -0.3002413
scatter <- ggplot(liar, aes(Creativity, Position))
scatter +
geom_point(size = 3, alpha = 0.6) +
geom_smooth(method = "lm", se = TRUE, color = "blue") +
xlab("Creativity Score (0-60)") +
ylab("Competition Position (1 = Winner)") +
ggtitle("Negative Correlation: Higher Creativity → Better Placement") +
cleanup +
scale_y_reverse() # Reverse Y so winners at topAll values must be numeric to be able to do any of these correlations.
Therefore, if you have any variables that import with labels, you
will have defactor them using as.numeric().
What type of correlation should we use with binary predictors?
Field et al. (2012, pp. 280-282) discuss correlations with binary variables:
Point-Biserial Correlation (rpb):
Biserial Correlation (rb):
# Convert character variable to numeric (1, 2)
liar$Novice2 <- as.numeric(as.factor(liar$Novice))
str(liar$Novice2) ## num [1:68] 1 2 1 1 2 1 1 2 2 1 ...
##
## Pearson's product-moment correlation
##
## data: Creativity and Novice2
## t = -2.1969, df = 66, p-value = 0.03154
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.47020478 -0.02412131
## sample estimates:
## cor
## -0.2610451
# Better visualization with boxplot
ggplot(liar, aes(x = Novice, y = Creativity, fill = Novice)) +
geom_boxplot(alpha = 0.7) +
geom_jitter(width = 0.1, alpha = 0.3) +
xlab("Competitor Experience") +
ylab("Creativity Score") +
ggtitle("Point-Biserial: Creativity by Experience Level") +
cleanup +
scale_fill_manual(values = c("grey70", "grey30")) +
theme(legend.position = "none")There’s no split/subset function within cocor.
Therefore, you have to split up the dataset on your independent
variable, and then put it back together in list
format.
Subset the data, then create a list.
library(cocor)
new <- subset(liar, Novice == "First Time")
old <- subset(liar, Novice == "Had entered Competition Before")
ind_data <- list(new, old)
cocor(~Creativity + Position | Creativity + Position,
data = ind_data)##
## Results of a comparison of two correlations based on independent groups
##
## Comparison between r1.jk (Creativity, Position) = -0.2123 and r2.hm (Creativity, Position) = -0.3802
## Difference: r1.jk - r2.hm = 0.1679
## Data: ind_data: j = Creativity, k = Position, h = Creativity, m = Position
## Group sizes: n1 = 33, n2 = 35
## Null hypothesis: r1.jk is equal to r2.hm
## Alternative hypothesis: r1.jk is not equal to r2.hm (two-sided)
## Alpha: 0.05
##
## fisher1925: Fisher's z (1925)
## z = 0.7268, p-value = 0.4673
## Null hypothesis retained
##
## zou2007: Zou's (2007) confidence interval
## 95% confidence interval for r1.jk - r2.hm: -0.2792 0.6027
## Null hypothesis retained (Interval includes 0)
cocor(~ X + Y | Y + Z, data = data) - Overlapping
correlationX + Y | Q + Z - Non-overlapping
correlation##
## Results of a comparison of two overlapping correlations based on dependent groups
##
## Comparison between r.jk (Revise, Exam) = 0.3967 and r.jh (Revise, Anxiety) = -0.7092
## Difference: r.jk - r.jh = 1.106
## Related correlation: r.kh = -0.441
## Data: exam: j = Revise, k = Exam, h = Anxiety
## Group size: n = 103
## Null hypothesis: r.jk is equal to r.jh
## Alternative hypothesis: r.jk is not equal to r.jh (two-sided)
## Alpha: 0.05
##
## pearson1898: Pearson and Filon's z (1898)
## z = 10.1802, p-value = 0.0000
## Null hypothesis rejected
##
## hotelling1940: Hotelling's t (1940)
## t = 9.3238, df = 100, p-value = 0.0000
## Null hypothesis rejected
##
## williams1959: Williams' t (1959)
## t = 8.9261, df = 100, p-value = 0.0000
## Null hypothesis rejected
##
## olkin1967: Olkin's z (1967)
## z = 10.1802, p-value = 0.0000
## Null hypothesis rejected
##
## dunn1969: Dunn and Clark's z (1969)
## z = 8.0684, p-value = 0.0000
## Null hypothesis rejected
##
## hendrickson1970: Hendrickson, Stanley, and Hills' (1970) modification of Williams' t (1959)
## t = 9.2710, df = 100, p-value = 0.0000
## Null hypothesis rejected
##
## steiger1980: Steiger's (1980) modification of Dunn and Clark's z (1969) using average correlations
## z = 7.6646, p-value = 0.0000
## Null hypothesis rejected
##
## meng1992: Meng, Rosenthal, and Rubin's z (1992)
## z = 7.6896, p-value = 0.0000
## Null hypothesis rejected
## 95% confidence interval for r.jk - r.jh: 0.9727 1.6382
## Null hypothesis rejected (Interval does not include 0)
##
## hittner2003: Hittner, May, and Silver's (2003) modification of Dunn and Clark's z (1969) using a backtransformed average Fisher's (1921) Z procedure
## z = 7.6392, p-value = 0.0000
## Null hypothesis rejected
##
## zou2007: Zou's (2007) confidence interval
## 95% confidence interval for r.jk - r.jh: 0.8698 1.3009
## Null hypothesis rejected (Interval does not include 0)
## $estimate
## Revise Exam Anxiety Gender
## Revise 1.0000000 0.13438703 -0.6510138 0.12060588
## Exam 0.1343870 1.00000000 -0.2442316 -0.02247429
## Anxiety -0.6510138 -0.24423163 1.0000000 0.07479920
## Gender 0.1206059 -0.02247429 0.0747992 1.00000000
##
## $p.value
## Revise Exam Anxiety Gender
## Revise 0.000000e+00 0.18029461 1.703877e-13 0.2296046
## Exam 1.802946e-01 0.00000000 1.384188e-02 0.8234728
## Anxiety 1.703877e-13 0.01384188 0.000000e+00 0.4572346
## Gender 2.296046e-01 0.82347277 4.572346e-01 0.0000000
##
## $statistic
## Revise Exam Anxiety Gender
## Revise 0.000000 1.3493743 -8.5335224 1.2088373
## Exam 1.349374 0.0000000 -2.5059623 -0.2236729
## Anxiety -8.533522 -2.5059623 0.0000000 0.7463335
## Gender 1.208837 -0.2236729 0.7463335 0.0000000
##
## $n
## [1] 103
##
## $gp
## [1] 2
##
## $method
## [1] "pearson"
## $estimate
## Revise Exam Anxiety Gender
## Revise 1.0000000 0.09406750 -0.59488837 0.08427039
## Exam 0.1206113 1.00000000 -0.22399075 -0.01999258
## Anxiety -0.5842845 -0.17158155 1.00000000 0.05110095
## Gender 0.1206031 -0.02231536 0.07446008 1.00000000
##
## $p.value
## Revise Exam Anxiety Gender
## Revise 0.000000e+00 0.34943989 5.384648e-11 0.4021110
## Exam 2.295835e-01 0.00000000 2.433797e-02 0.8426994
## Anxiety 1.414541e-10 0.08622476 0.000000e+00 0.6118058
## Gender 2.296154e-01 0.82470106 4.592829e-01 0.0000000
##
## $statistic
## Revise Exam Anxiety Gender
## Revise 0.000000 0.9401285 -7.3637762 0.8414729
## Exam 1.208892 0.0000000 -2.2867841 -0.1989634
## Anxiety -7.163533 -1.7329141 0.0000000 0.5091132
## Gender 1.208809 -0.2220903 0.7429308 0.0000000
##
## $n
## [1] 103
##
## $gp
## [1] 2
##
## $method
## [1] "pearson"
Field et al. (2012, p. 295) provides reporting guidelines:
Pearson’s: “There was a significant positive correlation between revision time and exam performance, r = .40, p < .001, 95% CI [.18, .58].”
Spearman’s: “Creativity and competition position were significantly negatively correlated, rs = -.37, p = .002.”
Key elements to report:
| Effect Size | r | |
|---|---|---|
| Small | 0.10 - 0.29 | Variables weakly related |
| Medium | 0.30 - 0.49 | Moderate relationship |
| Large | 0.50 - 1.00 | Strong relationship |