library(seedhash)
gen <- SeedHashGenerator$new("Week01 Introduction to Machine Learning - ANLY530")
seeds <- gen$generate_seeds(10)
set.seed(seeds[1])
cat("MD5 Hash:", gen$get_hash(), "\n")## MD5 Hash: e3d845fda93294e2e6da4d1bf3d264aa
## Using seed: -443845793
This Week 01 notebook introduces the core concepts of Machine Learning covered in Lecture 1:
Book reference: Géron (2019) Chapter 1 — The Machine Learning Landscape provides a thorough treatment of ML taxonomy, the ML workflow, and common challenges.
Online access: The full textbook is available free through Harrisburg University’s O’Reilly subscription: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd Ed.) Log in with your
@harrisburgu.educredentials at learning.oreilly.com.Local extracted copy:
Knowledge/extracted/chapters/01_CHAPTER 1 — The Machine Learning Landscape.md
By the end of this notebook, you will be able to:
# install.packages(c("tidyverse", "ggplot2", "knitr", "kableExtra", "seedhash"))
library(tidyverse) # Data manipulation and visualisation
library(ggplot2) # Plotting
library(knitr) # Table formatting
library(kableExtra) # Enhanced tables
library(seedhash) # Reproducible seed generation
cat("Seed:", seeds[1], " | Hash:", gen$get_hash(), "\n")## Seed: -443845793 | Hash: e3d845fda93294e2e6da4d1bf3d264aa
The course introduces two complementary definitions:
“Learning is optimizing performance (based on some criterion) using example data or past experience.”
“Statistical learning refers to a vast set of tools for understanding data… supervised statistical learning involves building a statistical model for predicting, or estimating, an output based on one or more inputs.” (James et al., 2021)
More formally:
“A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.” — Tom Mitchell (1997)
Key insight (Géron, 2019, Ch. 1): ML shines when a traditional rule-based solution would require an impractically long list of hand-crafted conditions — such as a spam filter — because an ML system automatically adapts when patterns shift.
# Illustrate Mitchell's T/E/P framework for a spam filter
framework <- tibble(
Component = c("Task (T)", "Experience (E)", "Performance Measure (P)"),
Description = c(
"Classify incoming emails as spam or not-spam. The program must learn how to sort emails without being explicitly programmed with rules like 'if email contains V1agra, then spam'.",
"A dataset of historical emails that have been manually labelled by users as 'spam' or 'not spam' (ham). This is the data the algorithm studies.",
"Accuracy: the proportion of incoming emails correctly classified. The algorithm tries to maximize this score."
)
)
kable(framework, caption = "Mitchell's T/E/P Framework — Spam Filter Example") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Component | Description |
|---|---|
| Task (T) | Classify incoming emails as spam or not-spam. The program must learn how to sort emails without being explicitly programmed with rules like ‘if email contains V1agra, then spam’. |
| Experience (E) | A dataset of historical emails that have been manually labelled by users as ‘spam’ or ‘not spam’ (ham). This is the data the algorithm studies. |
| Performance Measure (P) | Accuracy: the proportion of incoming emails correctly classified. The algorithm tries to maximize this score. |
Why it matters: In traditional programming, you write the rules to process the data and get answers. In machine learning, you provide the data and the answers (E), and the algorithm figures out the rules (T) in order to maximize performance (P).
Machine Learning systems can be classified according to the amount and type of supervision they get during training. The two most common categories are Supervised and Unsupervised learning.
In supervised learning, the algorithm is taught by example. You provide it with input features (e.g. house size, location) and the correct answers, called labels (e.g. the house price). The algorithm learns the relationship between the inputs and the labels so it can predict the label for new, unseen data. * Goal: Prediction / Approximation * Examples: Spam detection, house price prediction, image classification.
In unsupervised learning, the algorithm is given data without explicit instructions on what to do with it. There are no labeled answers. The system tries to learn without a teacher, uncovering hidden patterns, structures, or groupings in the data. * Goal: Description / Discovery * Examples: Customer segmentation, anomaly detection, topic modeling (grouping news articles).
types <- tibble(
Type = c("Supervised Learning", "Unsupervised Learning"),
Training_Data = c("Labelled (input + output pairs)", "Unlabelled (inputs only)"),
Goal = c(
"Learn a mapping f(X) → Y to predict outputs for new inputs. (You supply the answers during training).",
"Discover structure, clusters, or density in the data. (No answers provided)."
),
Examples = c(
"Spam detection (spam/not spam), house price prediction (price $), image classification (cat/dog)",
"Customer segmentation (grouping by buying habits), anomaly detection (finding rare events)"
)
)
kable(types, col.names = c("Type", "Training Data", "Goal", "Examples"),
caption = "Supervised vs. Unsupervised Learning") |>
kable_styling(bootstrap_options = c("striped", "hover"))| Type | Training Data | Goal | Examples |
|---|---|---|---|
| Supervised Learning | Labelled (input + output pairs) | Learn a mapping f(X) → Y to predict outputs for new inputs. (You supply the answers during training). | Spam detection (spam/not spam), house price prediction (price $), image classification (cat/dog) |
| Unsupervised Learning | Unlabelled (inputs only) | Discover structure, clusters, or density in the data. (No answers provided). | Customer segmentation (grouping by buying habits), anomaly detection (finding rare events) |
Géron (2019, Ch. 1): “Supervised learning is about approximation while unsupervised learning is about description.”
Regression is a supervised learning task where you want to predict a continuous numerical value.
Imagine you are trying to predict the Sales Revenue of a retail store based on how much they spend on Advertising.
Because we provide the algorithm with pairs of \((X, Y)\), it can learn to draw a line of best fit through the points. This is linear regression. Once trained, if we tell the model we plan to spend $8,000 on ads next week, it can trace up to the red line and predict our expected revenue.
set.seed(seeds[2])
# Simulate a simple supervised learning scenario
n <- 80
x <- runif(n, 0, 10) # Advertising spend (in thousands)
y <- 2.5 * x + 5 + rnorm(n, sd = 3) # Sales Revenue (in thousands)
df_sup <- tibble(x = x, y = y)
model <- lm(y ~ x, data = df_sup)
ggplot(df_sup, aes(x, y)) +
geom_point(colour = "#4C72B0", alpha = 0.7, size = 3) +
geom_smooth(method = "lm", colour = "#DD4444", se = TRUE, fill = "#DD4444", alpha = 0.2) +
labs(
title = "Supervised Learning — Linear Regression",
subtitle = sprintf("Learned Rule: Expected Sales = %.2f + %.2f * Ad Spend",
coef(model)[1], coef(model)[2]),
x = "Input Feature X (e.g. Ad Spend in $K)",
y = "Target Label Y (e.g. Sales in $K)"
) +
theme_minimal(base_size = 14) +
theme(plot.title = element_text(face = "bold"))Supervised Learning: The algorithm learns to map the input X to the target Y.
Clustering is an unsupervised learning task.
Imagine you have a database of customers, recording their Income and Spending Score. But you don’t know who is who. You want to group them into distinct categories so your marketing team can target them differently.
The K-Means algorithm look at the density and distance between points and groups them based on their proximity. We ask it to find 3 groups (clusters), and it assigns a color to each group.
set.seed(seeds[3])
# Three unlabelled clusters
df_usup <- bind_rows(
tibble(x = rnorm(40, 2, 0.6), y = rnorm(40, 2, 0.6)), # Low Income, Low Spend
tibble(x = rnorm(40, 6, 0.6), y = rnorm(40, 5, 0.6)), # High Income, Average Spend
tibble(x = rnorm(40, 4, 0.6), y = rnorm(40, 9, 0.6)) # Average Income, High Spend
)
# Run K-Means Clustering and ask for 3 centers
km <- kmeans(df_usup, centers = 3, nstart = 20)
df_usup$cluster <- factor(km$cluster)
ggplot(df_usup, aes(x, y, colour = cluster)) +
geom_point(size = 3, alpha = 0.8) +
stat_ellipse(level = 0.85, linetype = "dashed", linewidth = 1) +
scale_color_brewer(palette = "Set1", labels=c("Budget Shoppers", "Luxury Buyers", "Standard Shoppers")) +
labs(
title = "Unsupervised Learning — K-Means Clustering (k = 3)",
subtitle = "The algorithm finds internal structure and groups similar data points together.",
x = "Feature 1 (e.g. Income)",
y = "Feature 2 (e.g. Spending Score)",
colour = "Discovered\nClusters"
) +
theme_minimal(base_size = 14) +
theme(legend.position = "right", plot.title = element_text(face = "bold"))Unsupervised Learning: The algorithm finds distinct subgroups (clusters) in the unlabelled data.
The lecture frames learning algorithms along four conceptual axes:
perspectives <- tibble(
Perspective = c("Information-based", "Similarity-based",
"Probability-based", "Error-based"),
Core_Idea = c(
"Use information to guide decisions; minimise entropy or misclassification",
"Group or predict based on closeness in some feature space",
"Estimate the probability of class membership rather than assigning hard labels",
"Build a model, measure its error against known answers, and iteratively reduce that error"
),
Canonical_Algorithm = c("Decision Tree", "k-Nearest Neighbours / Regression",
"Naïve Bayes", "Neural Networks / Gradient Descent")
)
kable(perspectives,
col.names = c("Perspective", "Core Idea", "Canonical Algorithm"),
caption = "Four Perspectives on Machine Learning") |>
kable_styling(bootstrap_options = c("striped", "hover")) |>
column_spec(1, bold = TRUE)| Perspective | Core Idea | Canonical Algorithm |
|---|---|---|
| Information-based | Use information to guide decisions; minimise entropy or misclassification | Decision Tree |
| Similarity-based | Group or predict based on closeness in some feature space | k-Nearest Neighbours / Regression |
| Probability-based | Estimate the probability of class membership rather than assigning hard labels | Naïve Bayes |
| Error-based | Build a model, measure its error against known answers, and iteratively reduce that error | Neural Networks / Gradient Descent |
Let’s visualize two of these perspectives to see how they “think” differently about the exact same data finding.
set.seed(seeds[7])
# Generate some data: two classes (red and blue)
df_persp <- tibble(
x1 = c(rnorm(20, 2, 0.5), rnorm(20, 4, 0.5)),
x2 = c(rnorm(20, 2, 0.5), rnorm(20, 4, 0.5)),
class = factor(rep(c("Class A", "Class B"), each = 20))
)
# New unknown data point we want to predict
new_point <- tibble(x1 = 3, x2 = 2.8)
# Find 3 closest points for the Similarity-Based plot
distances <- sqrt((df_persp$x1 - new_point$x1)^2 + (df_persp$x2 - new_point$x2)^2)
closest_idx <- order(distances)[1:3]
closest_points <- df_persp[closest_idx, ]
# 1. Similarity-based (k-Nearest Neighbors)
p_sim <- ggplot(df_persp, aes(x1, x2, color = class)) +
geom_point(size = 3, alpha = 0.5) +
geom_point(data = new_point, aes(x1, x2), color = "black", shape = 8, size = 5, inherit.aes = FALSE) +
geom_segment(data = closest_points, aes(x = x1, y = x2, xend = new_point$x1, yend = new_point$x2),
color = "purple", linetype="dashed", linewidth = 0.8) +
annotate("text", x = 3, y = 2.4, label = "Unknown Data Point", fontface="italic", color="black") +
scale_color_manual(values = c("Class A"="#d62728", "Class B"="#1f77b4")) +
labs(title="1. Similarity-Based Perspective", subtitle="'You are what your nearest neighbors are' (e.g. k-NN)\nAlgorithm finds the 3 closest points and takes a majority vote.", x="Feature 1", y="Feature 2") +
theme_minimal(base_size = 14) + theme(legend.position="none", plot.title=element_text(face="bold"))
# 2. Error-based (Linear boundary / Logistic Regression)
p_err <- ggplot(df_persp, aes(x1, x2, color = class)) +
geom_point(size = 3, alpha = 0.5) +
geom_point(data = new_point, aes(x1, x2), color = "black", shape = 8, size = 5, inherit.aes = FALSE) +
geom_abline(intercept = 6.4, slope = -1.1, color = "#2ca02c", linewidth = 1.5) +
annotate("text", x = 2.55, y = 4.34, label = "Learned\nBoundary", color = "#2ca02c", fontface="bold") +
annotate("text", x = 3, y = 2.4, label = "Unknown Data Point", fontface="italic", color="black") +
scale_color_manual(values = c("Class A"="#d62728", "Class B"="#1f77b4")) +
labs(title="2. Error-Based Perspective", subtitle="'Draw a line that minimizes errors on the training data'\nAlgorithm checks which side of the line the new point lands on.", x="Feature 1", y="Feature 2") +
theme_minimal(base_size = 14) + theme(legend.position="bottom", plot.title=element_text(face="bold"), legend.title=element_blank())
gridExtra::grid.arrange(p_sim, p_err, ncol=2)Similarity-Based vs. Error-Based Learning. Two different ways to solve the exact same classification problem.
Before looking at specific equations and problem types, we must understand the core challenge of Machine Learning: building models that generalize well to new data.
The central tension in Machine Learning is between Underfitting and Overfitting.
set.seed(seeds[5])
# Generate a noisy Sine wave true pattern
x_curve <- seq(0, 10, length.out=50)
y_true <- sin(x_curve)
y_noisy <- y_true + rnorm(50, sd=0.4)
df_fit <- tibble(x=x_curve, y=y_noisy, y_true=y_true)
# 1. Underfit (Linear)
p_under <- ggplot(df_fit, aes(x, y)) +
geom_point(alpha=0.6, color="grey30") +
geom_smooth(method="lm", se=FALSE, color="#d62728", linewidth=1.5) +
labs(title="1. Underfitting", subtitle="Model is too simple (High Bias)", x="", y="") +
theme_minimal() + theme(plot.title=element_text(face="bold"))
# 2. Good Fit (Polynomial deg 3)
p_good <- ggplot(df_fit, aes(x, y)) +
geom_point(alpha=0.6, color="grey30") +
geom_smooth(method="lm", formula=y ~ poly(x, 3), se=FALSE, color="#2ca02c", linewidth=1.5) +
labs(title="2. Good Fit", subtitle="Model captures the true trend", x="", y="") +
theme_minimal() + theme(plot.title=element_text(face="bold"))
# 3. Overfit (Polynomial deg 25 / Spline with high df)
p_over <- ggplot(df_fit, aes(x, y)) +
geom_point(alpha=0.6, color="grey30") +
geom_smooth(method="lm", formula=y ~ splines::ns(x, 25), se=FALSE, color="#1f77b4", linewidth=1.5) +
labs(title="3. Overfitting", subtitle="Model memorizes the noise (High Variance)", x="", y="") +
theme_minimal() + theme(plot.title=element_text(face="bold"))
gridExtra::grid.arrange(p_under, p_good, p_over, ncol=3)The Goldilocks Problem of Machine Learning: Underfitting, Overfitting, and the Good Fit.
Because overfitting is such a severe problem, we never evaluate a model on the data it was trained on.
If a student memorizes the exact answers to a practice test, getting 100% on the practice test doesn’t mean they will do well on the final exam.
Therefore, we split our data: 1. Training Set (e.g. 80%): Given to the algorithm so it can learn the patterns. 2. Testing Set (e.g. 20%): Hidden from the algorithm until the very end. We use this to evaluate how well the model generalizes to unseen data.
set.seed(seeds[6])
# Create simple visualization of data split
split_data <- tibble(
ID = 1:100,
Set = c(rep("Training Data (80%)", 80), rep("Testing Data (20%)", 20))
)
ggplot(split_data, aes(x = ID, y = 1, fill = Set)) +
geom_tile(color="white") +
scale_fill_manual(values = c("Testing Data (20%)" = "#ff7f0e", "Training Data (80%)" = "#1f77b4")) +
theme_void() +
labs(title="The Train / Test Split Concept", subtitle="We hold out a portion of our data to test the model's true performance.", fill="") +
theme(axis.text=element_blank(), axis.ticks=element_blank(), panel.grid=element_blank(),
plot.title=element_text(face="bold", size=14), legend.position="bottom", legend.text=element_text(size=12))Data split visualization
A simple Train/Test split relies heavily on chance: what if all the difficult or unusual data points end up in the test set? Your model would look artificially bad!
To solve this, we use K-Fold Cross-Validation. Instead of splitting the dataset just once, we divide it into \(K\) equal chunks (folds). We then train the model \(K\) times. In every iteration, a different chunk is kept hidden as the Testing set, while the remaining chunks form the Training set. Finally, we average the \(K\) different test scores to get a highly reliable measure of the model’s performance.
# Generate K-fold data structure
folds_df <- expand.grid(
Fold = factor(1:5, levels = 5:1),
Iteration = factor(paste("Iteration", 1:5), levels = paste("Iteration", 1:5))
)
# Assign Train/Test status based on whether Fold == Iteration number
folds_df$Status <- ifelse(as.numeric(folds_df$Fold) == 6 - as.numeric(folds_df$Iteration),
"Testing Fold (Held Out)", "Training Fold (Used to learn)")
ggplot(folds_df, aes(x = Iteration, y = Fold, fill = Status)) +
geom_tile(color = "white", linewidth=1.5) +
geom_text(aes(label = paste("Data Block", 6 - as.numeric(Fold))), color="white", fontface="bold", size=5) +
scale_fill_manual(values = c("Testing Fold (Held Out)" = "#ff7f0e", "Training Fold (Used to learn)" = "#1f77b4")) +
theme_minimal(base_size = 14) +
labs(
title = "K-Fold Cross Validation (K = 5)",
subtitle = "The model is trained 5 times. Each time, a different block of data is held out to test performance.",
x = "", y = "", fill = "Data Role during Iteration:"
) +
theme(
panel.grid = element_blank(),
axis.text.y = element_blank(),
axis.text.x = element_text(face="bold", size=12),
plot.title = element_text(face = "bold"),
legend.position = "bottom"
)5-Fold Cross Validation Process. Notice how every block of data gets to act as the test set exactly once.
Once you know whether your problem is supervised or unsupervised, you must specify the exact output type the algorithm needs to produce.
problems <- tibble(
Category = c("Classification", "Regression", "Time Series", "Clustering", "Dimensionality Reduction"),
Type = c("Supervised", "Supervised", "Supervised", "Unsupervised", "Unsupervised"),
Goal = c("Assign a category or class", "Predict a numerical quantity", "Predict future values in a sequence", "Group similar items together", "Simplify data by combining features"),
Output = c("Discrete class label (e.g. 'Cat', 'Dog', 'Spam')", "Continuous numeric value (e.g. $250,500)", "Continuous numeric value at time t", "Group ID (1, 2, 3...)", "New abstract features (Component 1, Component 2)"),
Algorithm_Example = c("Logistic Regression, Random Forest", "Linear Regression, XGBoost", "ARIMA, LSTMs", "K-Means, DBSCAN", "PCA, t-SNE")
)
kable(problems,
col.names = c("Problem Category", "Learning Type", "Core Goal", "What it Outputs", "Common Algorithms"),
caption = "Machine Learning Problem Categories Explained") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) |>
column_spec(1, bold = TRUE, color = "white", background = "#4b6584") |>
row_spec(1:3, background = "#f1f8ff") |>
row_spec(4:5, background = "#fff8f1")| Problem Category | Learning Type | Core Goal | What it Outputs | Common Algorithms |
|---|---|---|---|---|
| Classification | Supervised | Assign a category or class | Discrete class label (e.g. ‘Cat’, ‘Dog’, ‘Spam’) | Logistic Regression, Random Forest |
| Regression | Supervised | Predict a numerical quantity | Continuous numeric value (e.g. $250,500) | Linear Regression, XGBoost |
| Time Series | Supervised | Predict future values in a sequence | Continuous numeric value at time t | ARIMA, LSTMs |
| Clustering | Unsupervised | Group similar items together | Group ID (1, 2, 3…) | K-Means, DBSCAN |
| Dimensionality Reduction | Unsupervised | Simplify data by combining features | New abstract features (Component 1, Component 2) | PCA, t-SNE |
Here is a side-by-side visual comparison of the two main types of Supervised Learning.
Scenario: We are predicting student outcomes based on their study habits.
set.seed(seeds[4])
# --- Classification: predict class from a threshold ---
df_clf <- tibble(
score = runif(100, 0, 100),
pass = factor(ifelse(score + rnorm(100, 0, 8) > 50, "Pass", "Fail"))
)
p1 <- ggplot(df_clf, aes(score, fill = pass)) +
geom_histogram(bins = 20, alpha = 0.8, color = "white", position = "identity") +
geom_vline(xintercept = 50, linetype = "dashed", color = "black", linewidth = 1.2) +
annotate("text", x = 40, y = 8, label = "Decision\nBoundary", angle = 90, vjust=0) +
scale_fill_manual(values = c("Pass" = "#2ca02c", "Fail" = "#d62728")) +
labs(title = "Classification Task", subtitle = "Predicting a discrete category (Pass/Fail)",
x = "Exam Score", y = "Number of Students", fill = "Prediction:") +
theme_minimal(base_size = 14) +
theme(plot.title = element_text(face = "bold", color = "#333333"),
legend.position = "bottom")
# --- Regression: predict continuous GPA ---
df_reg <- tibble(
study_hours = runif(100, 0, 10),
gpa = 1.5 + 0.25 * study_hours + rnorm(100, 0, 0.4)
)
p2 <- ggplot(df_reg, aes(study_hours, gpa)) +
geom_point(colour = "#1f77b4", alpha = 0.7, size = 3) +
geom_smooth(method = "lm", colour = "#ff7f0e", linewidth=1.5, fill = "#ff7f0e", alpha=0.2) +
annotate("text", x = 2, y = 3.5, label = "Regression Line", color = "#ff7f0e", fontface="bold") +
labs(title = "Regression Task", subtitle = "Predicting a continuous quantity (GPA 0.0 - 4.0)",
x = "Study Hours per Week", y = "Predicted Expected GPA") +
theme_minimal(base_size = 14) +
theme(plot.title = element_text(face = "bold", color = "#333333"))
gridExtra::grid.arrange(p1, p2, ncol = 2)Comparing Classification (predicting categories) with Regression (predicting quantities).
Linear algebra is the mathematical language of ML. Vectors represent data points and model weights; matrices encode transformations and datasets; eigendecomposition underlies PCA and many optimisation algorithms.
Book reference: Géron (2019) uses linear algebra throughout — see especially Ch. 4 (gradient descent), Ch. 8 (PCA).
A vector is an ordered list of numbers. By convention, vectors are column vectors.
\[\mathbf{v} = \begin{pmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{pmatrix}\]
Vector length (L2 norm):
\[\|\mathbf{v}\| = \sqrt{\sum_{i=1}^{n} v_i^2}\]
v <- c(3, 4) # 2-D column vector
u <- c(1, 2)
# Vector length
v_length <- sqrt(sum(v^2))
cat("v =", v, "\n")## v = 3 4
## ||v|| = 5 (should be 5)
## v + u = 4 6
## 3 * v = 9 12
## v · u = 11
Instead of thinking of a vector as just a list of numbers, think of it as an arrow in space. The arrow starts at the origin (0, 0) and points to the coordinates \((v_1, v_2)\).
v <- c(3, 4)
u <- c(1, 2)
sum_vu <- v + u
# Compute angle between v and u
angle_deg <- round(acos(sum(v * u) / (sqrt(sum(v^2)) * sqrt(sum(u^2)))) * 180 / pi, 1)
# Build a long data frame for geom_segment
vec_df <- data.frame(
x = c(0, 0, 0),
y = c(0, 0, 0),
xend = c(v[1], u[1], sum_vu[1]),
yend = c(v[2], u[2], sum_vu[2]),
vec = c("v = (3, 4)", "u = (1, 2)", "v + u = (4, 6)")
)
# Dashed parallelogram helper lines
par_df <- data.frame(
x = c(v[1], u[1]),
y = c(v[2], u[2]),
xend = c(sum_vu[1], sum_vu[1]),
yend = c(sum_vu[2], sum_vu[2])
)
ggplot() +
# Dashed parallelogram lines
geom_segment(data = par_df, aes(x=x, y=y, xend=xend, yend=yend),
linetype="dashed", color="grey60", linewidth=0.8) +
# Main vectors
geom_segment(data = vec_df,
aes(x=x, y=y, xend=xend, yend=yend, color=vec),
arrow = arrow(length = unit(0.35, "cm"), type="closed"),
linewidth = 1.8) +
# Labels at arrow tips
annotate("label", x=3.2, y=4.3, label="v = (3, 4)", color="#d62728", fontface="bold", size=4.5, fill="#fff0f0") +
annotate("label", x=0.8, y=2.2, label="u = (1, 2)", color="#1f77b4", fontface="bold", size=4.5, fill="#f0f4ff") +
annotate("label", x=4.5, y=6.1, label="v + u = (4, 6)",color="#2ca02c", fontface="bold", size=4.5, fill="#f0fff0") +
annotate("text", x=2.5, y=0.3,
label=paste0("Dot product v\u00b7u = ", sum(v*u),
" | Angle between v & u = ", angle_deg, "\u00b0"),
color="#555555", fontface="italic", size=4.5) +
# Axes
geom_hline(yintercept=0, colour="grey50") +
geom_vline(xintercept=0, colour="grey50") +
scale_color_manual(values=c("v = (3, 4)"="#d62728", "u = (1, 2)"="#1f77b4", "v + u = (4, 6)"="#2ca02c")) +
coord_equal(xlim=c(-0.5, 5.5), ylim=c(-0.5, 7)) +
labs(
title = "Vectors as Arrows in 2D Space",
subtitle = "Vector addition chains arrows tip-to-tail (parallelogram rule).\nThe dot product measures how much two vectors 'agree' in direction.",
x = "X axis", y = "Y axis"
) +
theme_minimal(base_size = 14) +
theme(legend.position="none", plot.title=element_text(face="bold"),
plot.subtitle=element_text(color="#555555"))Vectors as arrows: v, u, their sum, and the angle between them.
A matrix is a rectangular array of numbers. For ML, the data matrix \(X\) is \(n \times p\) (n observations, p features).
Matrix addition requires identical dimensions. Matrix multiplication \(AB\) requires cols(A) = rows(B).
A <- matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2, byrow = TRUE)
B <- matrix(c(5, 6, 7, 8), nrow = 2, ncol = 2, byrow = TRUE)
cat("A:\n"); print(A)## A:
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
##
## B:
## [,1] [,2]
## [1,] 5 6
## [2,] 7 8
##
## A + B:
## [,1] [,2]
## [1,] 6 8
## [2,] 10 12
##
## A %*% B:
## [,1] [,2]
## [1,] 19 22
## [2,] 43 50
##
## t(A):
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
This is perhaps the most powerful geometric intuition in all of linear algebra:
A matrix is a recipe for transforming space.
When you multiply a matrix by a set of points, you are stretching, rotating, shearing, or squishing those points. Think of the matrix as a special pair of glasses that changes how you see the coordinate system.
The animation below shows a regular grid of points (the before state) and where those same points land after being multiplied by matrix \(A = \begin{pmatrix} 2 & 1 \\ 0.5 & 1.5 \end{pmatrix}\) (the after state).
# Matrix to apply
M <- matrix(c(2, 0.5, 1, 1.5), nrow=2) # column-major
# Create a regular grid of points
grid_pts <- expand.grid(x = seq(-2, 2, by=0.5), y = seq(-2, 2, by=0.5))
# Apply transformation: each point (x, y) -> M %*% c(x, y)
transformed <- t(M %*% t(as.matrix(grid_pts)))
grid_pts_t <- data.frame(x = transformed[,1], y = transformed[,2])
# Basis vectors before and after
basis_before <- data.frame(
x=c(0,0), y=c(0,0), xend=c(1,0), yend=c(0,1),
label=c("e1 = (1,0)", "e2 = (0,1)")
)
basis_after <- data.frame(
x=c(0,0), y=c(0,0),
xend = c(M[1,1], M[1,2]),
yend = c(M[2,1], M[2,2]),
label=c("M·e1", "M·e2")
)
p_before <- ggplot(grid_pts, aes(x, y)) +
geom_point(color="#1f77b4", alpha=0.5, size=1.5) +
geom_segment(data=basis_before, aes(x=x,y=y,xend=xend,yend=yend,color=label),
arrow=arrow(length=unit(0.3,"cm")), linewidth=2) +
scale_color_manual(values=c("e1 = (1,0)"="#d62728", "e2 = (0,1)"="#2ca02c")) +
geom_hline(yintercept=0, colour="grey60") + geom_vline(xintercept=0, colour="grey60") +
coord_equal(xlim=c(-3,3), ylim=c(-3,3)) +
labs(title="BEFORE: Original Grid", subtitle="The standard coordinate space", x="", y="", color="Basis Vector:") +
theme_minimal(base_size=13) + theme(plot.title=element_text(face="bold"), legend.position="bottom")
p_after <- ggplot(grid_pts_t, aes(x, y)) +
geom_point(color="#ff7f0e", alpha=0.5, size=1.5) +
geom_segment(data=basis_after, aes(x=x,y=y,xend=xend,yend=yend,color=label),
arrow=arrow(length=unit(0.3,"cm")), linewidth=2) +
scale_color_manual(values=c("M·e1"="#d62728", "M·e2"="#2ca02c")) +
geom_hline(yintercept=0, colour="grey60") + geom_vline(xintercept=0, colour="grey60") +
coord_equal(xlim=c(-5,5), ylim=c(-4,4)) +
labs(title="AFTER: Transformed Grid", subtitle="The same points after applying matrix M", x="", y="", color="Transformed Basis:") +
theme_minimal(base_size=13) + theme(plot.title=element_text(face="bold"), legend.position="bottom")
gridExtra::grid.arrange(p_before, p_after, ncol=2)Left: original coordinate grid. Right: grid after applying a matrix transformation. Arrows show where selected reference points moved.
Why this matters for ML: A neural network layer with a weight matrix \(W\) is literally doing this — it’s stretching and rotating your data into a new space where patterns become easier to detect. Understanding that “matrix multiplication = space transformation” is the key insight.
The determinant of a \(2 \times 2\) matrix:
\[\det\begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad - bc\]
The inverse \(A^{-1}\) satisfies \(A A^{-1} = I\). It exists only when \(\det(A) \neq 0\).
\[A^{-1} = \frac{1}{\det(A)} \begin{pmatrix} d & -b \\ -c & a \end{pmatrix}\]
## A:
## [,1] [,2]
## [1,] 2 1
## [2,] 1 2
##
## det(A) = 3
##
## A^(-1):
## [,1] [,2]
## [1,] 0.6667 -0.3333
## [2,] -0.3333 0.6667
# Verify: A %*% A^(-1) should be the identity matrix
cat("\nA %*% A^(-1) (should be I):\n"); print(round(A %*% A_inv, 10))##
## A %*% A^(-1) (should be I):
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
Given \(A\mathbf{x} = \mathbf{b}\), the solution is \(\mathbf{x} = A^{-1}\mathbf{b}\) (when \(A\) is invertible):
\[\begin{cases} 2x + y = 5 \\ x + 2y = 4 \end{cases}\]
A <- matrix(c(2, 1, 1, 2), nrow = 2, byrow = TRUE)
b <- c(5, 4)
# Solution via matrix inverse
x <- solve(A, b)
cat("Solution: x =", x[1], ", y =", x[2], "\n")## Solution: x = 2 , y = 1
## Check A %*% x: 5 4 (should be 5 4 )
Each linear equation in a \(2 \times 2\) system describes a straight line on a 2D plot. Solving the system means finding the single point where both lines cross simultaneously — the \((x, y)\) pair that satisfies both equations at once.
This is directly analogous to how ML models find optimal parameters: they are looking for the one point in parameter space that best satisfies all training constraints at the same time.
# System: 2x + y = 5 and x + 2y = 4 => solution (x=2, y=1)
x_vals <- seq(-1, 4, length.out=200)
line1 <- 5 - 2 * x_vals # y = 5 - 2x
line2 <- (4 - x_vals) / 2 # y = (4 - x) / 2
lines_df <- data.frame(
x = rep(x_vals, 2),
y = c(line1, line2),
eq = rep(c("Equation 1: 2x + y = 5", "Equation 2: x + 2y = 4"), each=200)
)
ggplot(lines_df, aes(x, y, color=eq)) +
geom_line(linewidth=1.8) +
geom_point(aes(x=2, y=1), color="black", size=5, shape=21, fill="gold", stroke=2,
inherit.aes=FALSE) +
annotate("label", x=2.3, y=1.4,
label="Solution\n(x = 2, y = 1)",
color="black", fontface="bold", size=4.5, fill="lightyellow") +
annotate("text", x=0.2, y=4.4, label="2x + y = 5", color="#d62728", fontface="bold", size=4.5) +
annotate("text", x=3.2, y=0.2, label="x + 2y = 4", color="#1f77b4", fontface="bold", size=4.5) +
scale_color_manual(values=c("Equation 1: 2x + y = 5"="#d62728", "Equation 2: x + 2y = 4"="#1f77b4")) +
geom_hline(yintercept=0, color="grey70") + geom_vline(xintercept=0, color="grey70") +
coord_cartesian(xlim=c(-0.5, 4), ylim=c(-0.5, 5.5)) +
labs(
title = "Geometric Interpretation of a Linear System",
subtitle = "Each equation is a line. The intersection point IS the solution.",
x = "x", y = "y", color=""
) +
theme_minimal(base_size=14) +
theme(legend.position="bottom", plot.title=element_text(face="bold"),
plot.subtitle=element_text(color="#555555"))Geometric interpretation: each equation is a line; the solution is where they intersect.
For a square matrix \(A\), if \(A\mathbf{v} = \lambda\mathbf{v}\), then \(\lambda\) is an eigenvalue and \(\mathbf{v}\) is its corresponding eigenvector.
The eigenvalues are found by solving \(\det(A - \lambda I) = 0\) (the characteristic equation).
Why this matters for ML: PCA (Principal Component Analysis) is entirely built on eigendecomposition — the principal components are the eigenvectors of the covariance matrix, ordered by their eigenvalues. See Géron (2019) Ch. 8.
# Match the lecture example
A <- matrix(c(2, 1, 1, 2), nrow = 2, byrow = TRUE)
cat("A:\n"); print(A)## A:
## [,1] [,2]
## [1,] 2 1
## [2,] 1 2
##
## Eigenvalues (λ):
## [1] 3 1
##
## Eigenvectors (columns):
## [,1] [,2]
## [1,] 0.707107 -0.707107
## [2,] 0.707107 0.707107
# Verify: A * v = λ * v for the first eigenpair
lambda1 <- ea$values[1]
v1 <- ea$vectors[, 1]
cat("\nVerify A*v1 = lambda1*v1:\n")##
## Verify A*v1 = lambda1*v1:
## A*v1 = 2.12132 2.12132
## lambda1*v1 = 2.12132 2.12132
A <- matrix(c(2, 1, 1, 2), nrow = 2, byrow = TRUE)
ea <- eigen(A)
v1 <- ea$vectors[, 1] * ea$values[1] # scaled by eigenvalue
v2 <- ea$vectors[, 2] * ea$values[2]
ggplot() +
# Grid of transformed points
expand_limits(x = c(-3, 3), y = c(-3, 3)) +
# Eigenvector 1
geom_segment(aes(x = 0, y = 0, xend = v1[1], yend = v1[2]),
arrow = arrow(length = unit(0.3, "cm")),
colour = "#d62728", linewidth = 1.2) +
annotate("text", x = v1[1] + 0.2, y = v1[2] + 0.15,
label = paste0("lambda[1] == ", as.integer(ea$values[1])),
parse = TRUE, colour = "#d62728", size = 5, fontface="bold") +
# Eigenvector 2
geom_segment(aes(x = 0, y = 0, xend = v2[1], yend = v2[2]),
arrow = arrow(length = unit(0.3, "cm")),
colour = "#1f77b4", linewidth = 1.2) +
annotate("text", x = v2[1] - 0.45, y = v2[2] - 0.15,
label = paste0("lambda[2] == ", as.integer(ea$values[2])),
parse = TRUE, colour = "#1f77b4", size = 5, fontface="bold") +
geom_hline(yintercept = 0, colour = "grey50") +
geom_vline(xintercept = 0, colour = "grey50") +
coord_equal() +
labs(
title = "Visualizing Eigenvectors of a Matrix",
subtitle = "Eigenvectors show the 'principal directions' of a matrix transformation.\nThe eigenvalues (lambda) show how much the matrix stretches data in that direction.\nRed is pulled 3x, Blue is pulled 1x.",
x = "X axis", y = "Y Axis"
) +
theme_minimal(base_size = 14) +
theme(plot.subtitle = element_text(color = "#555555", face = "italic"))summary_tbl <- tibble(
Topic = c(
"Definition of Learning",
"Supervised Learning",
"Unsupervised Learning",
"Information-based",
"Similarity-based",
"Probability-based",
"Error-based",
"Classification",
"Regression",
"Clustering",
"Vector",
"Matrix multiplication",
"Eigendecomposition"
),
Key_Takeaway = c(
"Optimise performance on a task T using experience E, measured by metric P (Mitchell, 1997)",
"Learn f(X) → Y from labelled examples; goal is approximation",
"Find structure in unlabelled data; goal is description",
"Minimise entropy/error — e.g., Decision Trees",
"Predict by proximity in feature space — e.g., KNN",
"Estimate class probabilities — e.g., Naïve Bayes",
"Iteratively reduce prediction error — e.g., Neural Networks",
"Discrete output; supervised; accuracy / confusion matrix",
"Continuous output; supervised; MSE / R²",
"Group discovery; unsupervised; inertia / silhouette",
"Ordered list of numbers; length = L2 norm",
"Requires cols(A) = rows(B); result is (rows(A) × cols(B))",
"Av = λv; eigenvectors define PCA principal components"
)
)
kable(summary_tbl, col.names = c("Topic", "Key Takeaway"),
caption = "Week 01 — Concept Summary") |>
kable_styling(bootstrap_options = c("striped", "hover")) |>
column_spec(1, bold = TRUE, width = "22%")| Topic | Key Takeaway |
|---|---|
| Definition of Learning | Optimise performance on a task T using experience E, measured by metric P (Mitchell, 1997) |
| Supervised Learning | Learn f(X) → Y from labelled examples; goal is approximation |
| Unsupervised Learning | Find structure in unlabelled data; goal is description |
| Information-based | Minimise entropy/error — e.g., Decision Trees |
| Similarity-based | Predict by proximity in feature space — e.g., KNN |
| Probability-based | Estimate class probabilities — e.g., Naïve Bayes |
| Error-based | Iteratively reduce prediction error — e.g., Neural Networks |
| Classification | Discrete output; supervised; accuracy / confusion matrix |
| Regression | Continuous output; supervised; MSE / R² |
| Clustering | Group discovery; unsupervised; inertia / silhouette |
| Vector | Ordered list of numbers; length = L2 norm |
| Matrix multiplication | Requires cols(A) = rows(B); result is (rows(A) × cols(B)) |
| Eigendecomposition | Av = λv; eigenvectors define PCA principal components |
Knowledge/extracted/chapters/01_CHAPTER 1 — The Machine Learning Landscape.md