library(seedhash)
gen <- SeedHashGenerator$new("Week01 Introduction to Machine Learning - ANLY530")
seeds <- gen$generate_seeds(10)
set.seed(seeds[1])

cat("MD5 Hash:", gen$get_hash(), "\n")

## MD5 Hash: e3d845fda93294e2e6da4d1bf3d264aa

cat("Using seed:", seeds[1], "\n")

## Using seed: -443845793

Introduction

1. Overview

This Week 01 notebook introduces the core concepts of Machine Learning covered in Lecture 1:

What learning is and what Machine Learning means
Taxonomy of ML: supervised vs. unsupervised learning
Four learning perspectives (information, similarity, probability, error)
ML from a problem perspective: classification, regression, time series, clustering
Linear algebra fundamentals used throughout ML (vectors, matrices, eigenvectors)

Book reference: Géron (2019) Chapter 1 — The Machine Learning Landscape provides a thorough treatment of ML taxonomy, the ML workflow, and common challenges.

Online access: The full textbook is available free through Harrisburg University’s O’Reilly subscription: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd Ed.) Log in with your @harrisburgu.edu credentials at learning.oreilly.com.

Local extracted copy: Knowledge/extracted/chapters/01_CHAPTER 1 — The Machine Learning Landscape.md

2. Learning Objectives

By the end of this notebook, you will be able to:

Articulate at least two definitions of Machine Learning and identify the key terms in each
Distinguish supervised from unsupervised learning and give a real-world example of each
Map a problem (classification, regression, clustering) to the correct ML category
Perform foundational linear algebra operations in R: vector math, matrix multiplication, determinant, inverse, and eigendecomposition

3. Required Packages

# install.packages(c("tidyverse", "ggplot2", "knitr", "kableExtra", "seedhash"))

library(tidyverse)    # Data manipulation and visualisation
library(ggplot2)      # Plotting
library(knitr)        # Table formatting
library(kableExtra)   # Enhanced tables
library(seedhash)     # Reproducible seed generation

cat("Seed:", seeds[1], " | Hash:", gen$get_hash(), "\n")

## Seed: -443845793  | Hash: e3d845fda93294e2e6da4d1bf3d264aa

Part 1: What Is Machine Learning?

1.1 Two Definitions

The course introduces two complementary definitions:

“Learning is optimizing performance (based on some criterion) using example data or past experience.”

“Statistical learning refers to a vast set of tools for understanding data… supervised statistical learning involves building a statistical model for predicting, or estimating, an output based on one or more inputs.” (James et al., 2021)

More formally:

“A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.” — Tom Mitchell (1997)

Key insight (Géron, 2019, Ch. 1): ML shines when a traditional rule-based solution would require an impractically long list of hand-crafted conditions — such as a spam filter — because an ML system automatically adapts when patterns shift.

# Illustrate Mitchell's T/E/P framework for a spam filter
framework <- tibble(
  Component = c("Task (T)", "Experience (E)", "Performance Measure (P)"),
  Description = c(
    "Classify incoming emails as spam or not-spam. The program must learn how to sort emails without being explicitly programmed with rules like 'if email contains V1agra, then spam'.",
    "A dataset of historical emails that have been manually labelled by users as 'spam' or 'not spam' (ham). This is the data the algorithm studies.",
    "Accuracy: the proportion of incoming emails correctly classified. The algorithm tries to maximize this score."
  )
)

kable(framework, caption = "Mitchell's T/E/P Framework — Spam Filter Example") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Mitchell’s T/E/P Framework — Spam Filter Example
Component	Description
Task (T)	Classify incoming emails as spam or not-spam. The program must learn how to sort emails without being explicitly programmed with rules like ‘if email contains V1agra, then spam’.
Experience (E)	A dataset of historical emails that have been manually labelled by users as ‘spam’ or ‘not spam’ (ham). This is the data the algorithm studies.
Performance Measure (P)	Accuracy: the proportion of incoming emails correctly classified. The algorithm tries to maximize this score.

Why it matters: In traditional programming, you write the rules to process the data and get answers. In machine learning, you provide the data and the answers (E), and the algorithm figures out the rules (T) in order to maximize performance (P).

Part 2: Types of Machine Learning

Machine Learning systems can be classified according to the amount and type of supervision they get during training. The two most common categories are Supervised and Unsupervised learning.

2.1 Supervised vs. Unsupervised

Supervised Learning

In supervised learning, the algorithm is taught by example. You provide it with input features (e.g. house size, location) and the correct answers, called labels (e.g. the house price). The algorithm learns the relationship between the inputs and the labels so it can predict the label for new, unseen data. * Goal: Prediction / Approximation * Examples: Spam detection, house price prediction, image classification.

Unsupervised Learning

In unsupervised learning, the algorithm is given data without explicit instructions on what to do with it. There are no labeled answers. The system tries to learn without a teacher, uncovering hidden patterns, structures, or groupings in the data. * Goal: Description / Discovery * Examples: Customer segmentation, anomaly detection, topic modeling (grouping news articles).

types <- tibble(
  Type = c("Supervised Learning", "Unsupervised Learning"),
  Training_Data = c("Labelled (input + output pairs)", "Unlabelled (inputs only)"),
  Goal = c(
    "Learn a mapping f(X) → Y to predict outputs for new inputs. (You supply the answers during training).",
    "Discover structure, clusters, or density in the data. (No answers provided)."
  ),
  Examples = c(
    "Spam detection (spam/not spam), house price prediction (price $), image classification (cat/dog)",
    "Customer segmentation (grouping by buying habits), anomaly detection (finding rare events)"
  )
)

kable(types, col.names = c("Type", "Training Data", "Goal", "Examples"),
      caption = "Supervised vs. Unsupervised Learning") |>
  kable_styling(bootstrap_options = c("striped", "hover"))

Supervised vs. Unsupervised Learning
Type	Training Data	Goal	Examples
Supervised Learning	Labelled (input + output pairs)	Learn a mapping f(X) → Y to predict outputs for new inputs. (You supply the answers during training).	Spam detection (spam/not spam), house price prediction (price $), image classification (cat/dog)
Unsupervised Learning	Unlabelled (inputs only)	Discover structure, clusters, or density in the data. (No answers provided).	Customer segmentation (grouping by buying habits), anomaly detection (finding rare events)

Géron (2019, Ch. 1): “Supervised learning is about approximation while unsupervised learning is about description.”

Simulated Supervised Example — Linear Regression

Regression is a supervised learning task where you want to predict a continuous numerical value.

Imagine you are trying to predict the Sales Revenue of a retail store based on how much they spend on Advertising.

Input Feature ($X$): Advertising Spend
Target Label ($Y$): Sales Revenue

Because we provide the algorithm with pairs of $(X, Y)$, it can learn to draw a line of best fit through the points. This is linear regression. Once trained, if we tell the model we plan to spend $8,000 on ads next week, it can trace up to the red line and predict our expected revenue.

set.seed(seeds[2])

# Simulate a simple supervised learning scenario
n <- 80
x <- runif(n, 0, 10) # Advertising spend (in thousands)
y <- 2.5 * x + 5 + rnorm(n, sd = 3) # Sales Revenue (in thousands)

df_sup <- tibble(x = x, y = y)
model  <- lm(y ~ x, data = df_sup)

ggplot(df_sup, aes(x, y)) +
  geom_point(colour = "#4C72B0", alpha = 0.7, size = 3) +
  geom_smooth(method = "lm", colour = "#DD4444", se = TRUE, fill = "#DD4444", alpha = 0.2) +
  labs(
    title = "Supervised Learning — Linear Regression",
    subtitle = sprintf("Learned Rule: Expected Sales = %.2f + %.2f * Ad Spend",
                       coef(model)[1], coef(model)[2]),
    x = "Input Feature X (e.g. Ad Spend in $K)", 
    y = "Target Label Y (e.g. Sales in $K)"
  ) +
  theme_minimal(base_size = 14) +
  theme(plot.title = element_text(face = "bold"))

Supervised Learning: The algorithm learns to map the input X to the target Y.

Simulated Unsupervised Example — K-Means Clustering

Clustering is an unsupervised learning task.

Imagine you have a database of customers, recording their Income and Spending Score. But you don’t know who is who. You want to group them into distinct categories so your marketing team can target them differently.

Input Features ($X$): Income and Spending Score
Labels ($Y$): None!

The K-Means algorithm look at the density and distance between points and groups them based on their proximity. We ask it to find 3 groups (clusters), and it assigns a color to each group.

set.seed(seeds[3])

# Three unlabelled clusters
df_usup <- bind_rows(
  tibble(x = rnorm(40, 2, 0.6),  y = rnorm(40, 2, 0.6)), # Low Income, Low Spend
  tibble(x = rnorm(40, 6, 0.6),  y = rnorm(40, 5, 0.6)), # High Income, Average Spend
  tibble(x = rnorm(40, 4, 0.6),  y = rnorm(40, 9, 0.6))  # Average Income, High Spend
)

# Run K-Means Clustering and ask for 3 centers
km   <- kmeans(df_usup, centers = 3, nstart = 20)
df_usup$cluster <- factor(km$cluster)

ggplot(df_usup, aes(x, y, colour = cluster)) +
  geom_point(size = 3, alpha = 0.8) +
  stat_ellipse(level = 0.85, linetype = "dashed", linewidth = 1) +
  scale_color_brewer(palette = "Set1", labels=c("Budget Shoppers", "Luxury Buyers", "Standard Shoppers")) +
  labs(
    title = "Unsupervised Learning — K-Means Clustering (k = 3)",
    subtitle = "The algorithm finds internal structure and groups similar data points together.",
    x = "Feature 1 (e.g. Income)",
    y = "Feature 2 (e.g. Spending Score)",
    colour = "Discovered\nClusters"
  ) +
  theme_minimal(base_size = 14) +
  theme(legend.position = "right", plot.title = element_text(face = "bold"))

Unsupervised Learning: The algorithm finds distinct subgroups (clusters) in the unlabelled data.

Part 3: Four Learning Perspectives

The lecture frames learning algorithms along four conceptual axes:

perspectives <- tibble(
  Perspective = c("Information-based", "Similarity-based",
                  "Probability-based", "Error-based"),
  Core_Idea = c(
    "Use information to guide decisions; minimise entropy or misclassification",
    "Group or predict based on closeness in some feature space",
    "Estimate the probability of class membership rather than assigning hard labels",
    "Build a model, measure its error against known answers, and iteratively reduce that error"
  ),
  Canonical_Algorithm = c("Decision Tree", "k-Nearest Neighbours / Regression",
                           "Naïve Bayes", "Neural Networks / Gradient Descent")
)

kable(perspectives,
      col.names = c("Perspective", "Core Idea", "Canonical Algorithm"),
      caption = "Four Perspectives on Machine Learning") |>
  kable_styling(bootstrap_options = c("striped", "hover")) |>
  column_spec(1, bold = TRUE)

Four Perspectives on Machine Learning
Perspective	Core Idea	Canonical Algorithm
Information-based	Use information to guide decisions; minimise entropy or misclassification	Decision Tree
Similarity-based	Group or predict based on closeness in some feature space	k-Nearest Neighbours / Regression
Probability-based	Estimate the probability of class membership rather than assigning hard labels	Naïve Bayes
Error-based	Build a model, measure its error against known answers, and iteratively reduce that error	Neural Networks / Gradient Descent

3.1 Visualising the Perspectives

Let’s visualize two of these perspectives to see how they “think” differently about the exact same data finding.

set.seed(seeds[7])
# Generate some data: two classes (red and blue)
df_persp <- tibble(
  x1 = c(rnorm(20, 2, 0.5), rnorm(20, 4, 0.5)),
  x2 = c(rnorm(20, 2, 0.5), rnorm(20, 4, 0.5)),
  class = factor(rep(c("Class A", "Class B"), each = 20))
)

# New unknown data point we want to predict
new_point <- tibble(x1 = 3, x2 = 2.8)

# Find 3 closest points for the Similarity-Based plot
distances <- sqrt((df_persp$x1 - new_point$x1)^2 + (df_persp$x2 - new_point$x2)^2)
closest_idx <- order(distances)[1:3]
closest_points <- df_persp[closest_idx, ]

# 1. Similarity-based (k-Nearest Neighbors)
p_sim <- ggplot(df_persp, aes(x1, x2, color = class)) +
  geom_point(size = 3, alpha = 0.5) +
  geom_point(data = new_point, aes(x1, x2), color = "black", shape = 8, size = 5, inherit.aes = FALSE) +
  geom_segment(data = closest_points, aes(x = x1, y = x2, xend = new_point$x1, yend = new_point$x2), 
               color = "purple", linetype="dashed", linewidth = 0.8) +
  annotate("text", x = 3, y = 2.4, label = "Unknown Data Point", fontface="italic", color="black") +
  scale_color_manual(values = c("Class A"="#d62728", "Class B"="#1f77b4")) +
  labs(title="1. Similarity-Based Perspective", subtitle="'You are what your nearest neighbors are' (e.g. k-NN)\nAlgorithm finds the 3 closest points and takes a majority vote.", x="Feature 1", y="Feature 2") +
  theme_minimal(base_size = 14) + theme(legend.position="none", plot.title=element_text(face="bold"))

# 2. Error-based (Linear boundary / Logistic Regression)
p_err <- ggplot(df_persp, aes(x1, x2, color = class)) +
  geom_point(size = 3, alpha = 0.5) +
  geom_point(data = new_point, aes(x1, x2), color = "black", shape = 8, size = 5, inherit.aes = FALSE) +
  geom_abline(intercept = 6.4, slope = -1.1, color = "#2ca02c", linewidth = 1.5) +
  annotate("text", x = 2.55, y = 4.34, label = "Learned\nBoundary", color = "#2ca02c", fontface="bold") +
  annotate("text", x = 3, y = 2.4, label = "Unknown Data Point", fontface="italic", color="black") +
  scale_color_manual(values = c("Class A"="#d62728", "Class B"="#1f77b4")) +
  labs(title="2. Error-Based Perspective", subtitle="'Draw a line that minimizes errors on the training data'\nAlgorithm checks which side of the line the new point lands on.", x="Feature 1", y="Feature 2") +
  theme_minimal(base_size = 14) + theme(legend.position="bottom", plot.title=element_text(face="bold"), legend.title=element_blank())

gridExtra::grid.arrange(p_sim, p_err, ncol=2)

Similarity-Based vs. Error-Based Learning. Two different ways to solve the exact same classification problem.

Part 4: Core Machine Learning Challenges

Before looking at specific equations and problem types, we must understand the core challenge of Machine Learning: building models that generalize well to new data.

4.1 Underfitting vs. Overfitting

The central tension in Machine Learning is between Underfitting and Overfitting.

Underfitting (Too simple): The model is too simple to learn the underlying structure of the data. It performs poorly on the training data. (Like trying to fit a straight line through a U-shaped curve).
Good Fit (Just right): The model captures the underlying pattern but ignores the random noise.
Overfitting (Too complex): The model memorizes the training data perfectly, including all the random noise and outliers. It performs great on training data, but terribly on new data.

set.seed(seeds[5])

# Generate a noisy Sine wave true pattern
x_curve <- seq(0, 10, length.out=50)
y_true <- sin(x_curve)
y_noisy <- y_true + rnorm(50, sd=0.4)
df_fit <- tibble(x=x_curve, y=y_noisy, y_true=y_true)

# 1. Underfit (Linear)
p_under <- ggplot(df_fit, aes(x, y)) +
  geom_point(alpha=0.6, color="grey30") +
  geom_smooth(method="lm", se=FALSE, color="#d62728", linewidth=1.5) +
  labs(title="1. Underfitting", subtitle="Model is too simple (High Bias)", x="", y="") +
  theme_minimal() + theme(plot.title=element_text(face="bold"))

# 2. Good Fit (Polynomial deg 3)
p_good <- ggplot(df_fit, aes(x, y)) +
  geom_point(alpha=0.6, color="grey30") +
  geom_smooth(method="lm", formula=y ~ poly(x, 3), se=FALSE, color="#2ca02c", linewidth=1.5) +
  labs(title="2. Good Fit", subtitle="Model captures the true trend", x="", y="") +
  theme_minimal() + theme(plot.title=element_text(face="bold"))

# 3. Overfit (Polynomial deg 25 / Spline with high df)
p_over <- ggplot(df_fit, aes(x, y)) +
  geom_point(alpha=0.6, color="grey30") +
  geom_smooth(method="lm", formula=y ~ splines::ns(x, 25), se=FALSE, color="#1f77b4", linewidth=1.5) +
  labs(title="3. Overfitting", subtitle="Model memorizes the noise (High Variance)", x="", y="") +
  theme_minimal() + theme(plot.title=element_text(face="bold"))

gridExtra::grid.arrange(p_under, p_good, p_over, ncol=3)

The Goldilocks Problem of Machine Learning: Underfitting, Overfitting, and the Good Fit.

4.2 Training vs. Testing Data

Because overfitting is such a severe problem, we never evaluate a model on the data it was trained on.

If a student memorizes the exact answers to a practice test, getting 100% on the practice test doesn’t mean they will do well on the final exam.

Therefore, we split our data: 1. Training Set (e.g. 80%): Given to the algorithm so it can learn the patterns. 2. Testing Set (e.g. 20%): Hidden from the algorithm until the very end. We use this to evaluate how well the model generalizes to unseen data.

set.seed(seeds[6])
# Create simple visualization of data split
split_data <- tibble(
  ID = 1:100,
  Set = c(rep("Training Data (80%)", 80), rep("Testing Data (20%)", 20))
)

ggplot(split_data, aes(x = ID, y = 1, fill = Set)) +
  geom_tile(color="white") +
  scale_fill_manual(values = c("Testing Data (20%)" = "#ff7f0e", "Training Data (80%)" = "#1f77b4")) +
  theme_void() +
  labs(title="The Train / Test Split Concept", subtitle="We hold out a portion of our data to test the model's true performance.", fill="") +
  theme(axis.text=element_blank(), axis.ticks=element_blank(), panel.grid=element_blank(),
        plot.title=element_text(face="bold", size=14), legend.position="bottom", legend.text=element_text(size=12))

Data split visualization

4.3 Cross-Validation (K-Fold)

A simple Train/Test split relies heavily on chance: what if all the difficult or unusual data points end up in the test set? Your model would look artificially bad!

To solve this, we use K-Fold Cross-Validation. Instead of splitting the dataset just once, we divide it into $K$ equal chunks (folds). We then train the model $K$ times. In every iteration, a different chunk is kept hidden as the Testing set, while the remaining chunks form the Training set. Finally, we average the $K$ different test scores to get a highly reliable measure of the model’s performance.

# Generate K-fold data structure
folds_df <- expand.grid(
  Fold = factor(1:5, levels = 5:1),
  Iteration = factor(paste("Iteration", 1:5), levels = paste("Iteration", 1:5))
)

# Assign Train/Test status based on whether Fold == Iteration number
folds_df$Status <- ifelse(as.numeric(folds_df$Fold) == 6 - as.numeric(folds_df$Iteration), 
                          "Testing Fold (Held Out)", "Training Fold (Used to learn)")

ggplot(folds_df, aes(x = Iteration, y = Fold, fill = Status)) +
  geom_tile(color = "white", linewidth=1.5) +
  geom_text(aes(label = paste("Data Block", 6 - as.numeric(Fold))), color="white", fontface="bold", size=5) +
  scale_fill_manual(values = c("Testing Fold (Held Out)" = "#ff7f0e", "Training Fold (Used to learn)" = "#1f77b4")) +
  theme_minimal(base_size = 14) +
  labs(
    title = "K-Fold Cross Validation (K = 5)",
    subtitle = "The model is trained 5 times. Each time, a different block of data is held out to test performance.",
    x = "", y = "", fill = "Data Role during Iteration:"
  ) +
  theme(
    panel.grid = element_blank(),
    axis.text.y = element_blank(),
    axis.text.x = element_text(face="bold", size=12),
    plot.title = element_text(face = "bold"),
    legend.position = "bottom"
  )

5-Fold Cross Validation Process. Notice how every block of data gets to act as the test set exactly once.

Part 5: ML from a Problem Perspective

5.1 Problem Categories

Once you know whether your problem is supervised or unsupervised, you must specify the exact output type the algorithm needs to produce.

problems <- tibble(
  Category = c("Classification", "Regression", "Time Series", "Clustering", "Dimensionality Reduction"),
  Type = c("Supervised", "Supervised", "Supervised", "Unsupervised", "Unsupervised"),
  Goal = c("Assign a category or class", "Predict a numerical quantity", "Predict future values in a sequence", "Group similar items together", "Simplify data by combining features"),
  Output = c("Discrete class label (e.g. 'Cat', 'Dog', 'Spam')", "Continuous numeric value (e.g. $250,500)", "Continuous numeric value at time t", "Group ID (1, 2, 3...)", "New abstract features (Component 1, Component 2)"),
  Algorithm_Example = c("Logistic Regression, Random Forest", "Linear Regression, XGBoost", "ARIMA, LSTMs", "K-Means, DBSCAN", "PCA, t-SNE")
)

kable(problems,
      col.names = c("Problem Category", "Learning Type", "Core Goal", "What it Outputs", "Common Algorithms"),
      caption = "Machine Learning Problem Categories Explained") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) |>
  column_spec(1, bold = TRUE, color = "white", background = "#4b6584") |>
  row_spec(1:3, background = "#f1f8ff") |>
  row_spec(4:5, background = "#fff8f1")

Machine Learning Problem Categories Explained
Problem Category	Learning Type	Core Goal	What it Outputs	Common Algorithms
Classification	Supervised	Assign a category or class	Discrete class label (e.g. ‘Cat’, ‘Dog’, ‘Spam’)	Logistic Regression, Random Forest
Regression	Supervised	Predict a numerical quantity	Continuous numeric value (e.g. $250,500)	Linear Regression, XGBoost
Time Series	Supervised	Predict future values in a sequence	Continuous numeric value at time t	ARIMA, LSTMs
Clustering	Unsupervised	Group similar items together	Group ID (1, 2, 3…)	K-Means, DBSCAN
Dimensionality Reduction	Unsupervised	Simplify data by combining features	New abstract features (Component 1, Component 2)	PCA, t-SNE

5.2 Demo — Classification vs. Regression

Here is a side-by-side visual comparison of the two main types of Supervised Learning.

Scenario: We are predicting student outcomes based on their study habits.

Classification (Left): We have historical data on students’ exam scores and whether they Passed or Failed. We want to draw a boundary so that if a new student scores a 65, we can classify them as ‘Pass’ or ‘Fail’. The outcome is a distinct category.
Regression (Right): We have historical data on how many hours students studied and their final GPA. We want to predict the exact numerical GPA for a new student based on their study hours. The outcome is continuous.

set.seed(seeds[4])

# --- Classification: predict class from a threshold ---
df_clf <- tibble(
  score    = runif(100, 0, 100),
  pass     = factor(ifelse(score + rnorm(100, 0, 8) > 50, "Pass", "Fail"))
)

p1 <- ggplot(df_clf, aes(score, fill = pass)) +
  geom_histogram(bins = 20, alpha = 0.8, color = "white", position = "identity") +
  geom_vline(xintercept = 50, linetype = "dashed", color = "black", linewidth = 1.2) +
  annotate("text", x = 40, y = 8, label = "Decision\nBoundary", angle = 90, vjust=0) +
  scale_fill_manual(values = c("Pass" = "#2ca02c", "Fail" = "#d62728")) +
  labs(title = "Classification Task", subtitle = "Predicting a discrete category (Pass/Fail)", 
       x = "Exam Score", y = "Number of Students", fill = "Prediction:") +
  theme_minimal(base_size = 14) +
  theme(plot.title = element_text(face = "bold", color = "#333333"),
        legend.position = "bottom")

# --- Regression: predict continuous GPA ---
df_reg <- tibble(
  study_hours = runif(100, 0, 10),
  gpa         = 1.5 + 0.25 * study_hours + rnorm(100, 0, 0.4)
)

p2 <- ggplot(df_reg, aes(study_hours, gpa)) +
  geom_point(colour = "#1f77b4", alpha = 0.7, size = 3) +
  geom_smooth(method = "lm", colour = "#ff7f0e", linewidth=1.5, fill = "#ff7f0e", alpha=0.2) +
  annotate("text", x = 2, y = 3.5, label = "Regression Line", color = "#ff7f0e", fontface="bold") +
  labs(title = "Regression Task", subtitle = "Predicting a continuous quantity (GPA 0.0 - 4.0)", 
       x = "Study Hours per Week", y = "Predicted Expected GPA") +
  theme_minimal(base_size = 14) +
  theme(plot.title = element_text(face = "bold", color = "#333333"))

gridExtra::grid.arrange(p1, p2, ncol = 2)

Comparing Classification (predicting categories) with Regression (predicting quantities).

Part 6: Linear Algebra for Machine Learning

Linear algebra is the mathematical language of ML. Vectors represent data points and model weights; matrices encode transformations and datasets; eigendecomposition underlies PCA and many optimisation algorithms.

Book reference: Géron (2019) uses linear algebra throughout — see especially Ch. 4 (gradient descent), Ch. 8 (PCA).

6.1 Vectors

A vector is an ordered list of numbers. By convention, vectors are column vectors.

\[\mathbf{v} = \begin{pmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{pmatrix}\]

Vector length (L2 norm):

\[\|\mathbf{v}\| = \sqrt{\sum_{i=1}^{n} v_i^2}\]

v <- c(3, 4)          # 2-D column vector
u <- c(1, 2)

# Vector length
v_length <- sqrt(sum(v^2))
cat("v =", v, "\n")

## v = 3 4

cat("||v|| =", v_length, " (should be 5)\n\n")

## ||v|| = 5  (should be 5)

# Vector addition
v_plus_u <- v + u
cat("v + u =", v_plus_u, "\n")

## v + u = 4 6

# Scalar multiplication
cat("3 * v =", 3 * v, "\n")

## 3 * v = 9 12

# Dot product
dot <- sum(v * u)
cat("v · u =", dot, "\n")

## v · u = 11

Visualising Vectors

Instead of thinking of a vector as just a list of numbers, think of it as an arrow in space. The arrow starts at the origin (0, 0) and points to the coordinates $(v_1, v_2)$.

The direction of the arrow tells you where the vector points.
The length of the arrow is the magnitude (L2 norm).
When you add two vectors, you chain the second arrow tip-to-tail onto the first.
The dot product $\mathbf{v} \cdot \mathbf{u}$ measures alignment. It is zero when the vectors are perpendicular, positive when they point in similar directions, and negative when they oppose. This is the basis for cosine similarity used in NLP and recommendation systems.

v   <- c(3, 4)
u   <- c(1, 2)
sum_vu <- v + u

# Compute angle between v and u
angle_deg <- round(acos(sum(v * u) / (sqrt(sum(v^2)) * sqrt(sum(u^2)))) * 180 / pi, 1)

# Build a long data frame for geom_segment
vec_df <- data.frame(
  x    = c(0,    0,    0),
  y    = c(0,    0,    0),
  xend = c(v[1], u[1], sum_vu[1]),
  yend = c(v[2], u[2], sum_vu[2]),
  vec  = c("v = (3, 4)", "u = (1, 2)", "v + u = (4, 6)")
)

# Dashed parallelogram helper lines
par_df <- data.frame(
  x    = c(v[1], u[1]),
  y    = c(v[2], u[2]),
  xend = c(sum_vu[1], sum_vu[1]),
  yend = c(sum_vu[2], sum_vu[2])
)

ggplot() +
  # Dashed parallelogram lines
  geom_segment(data = par_df, aes(x=x, y=y, xend=xend, yend=yend),
               linetype="dashed", color="grey60", linewidth=0.8) +
  # Main vectors
  geom_segment(data = vec_df,
               aes(x=x, y=y, xend=xend, yend=yend, color=vec),
               arrow = arrow(length = unit(0.35, "cm"), type="closed"),
               linewidth = 1.8) +
  # Labels at arrow tips
  annotate("label", x=3.2, y=4.3, label="v = (3, 4)",    color="#d62728", fontface="bold", size=4.5, fill="#fff0f0") +
  annotate("label", x=0.8, y=2.2, label="u = (1, 2)",    color="#1f77b4", fontface="bold", size=4.5, fill="#f0f4ff") +
  annotate("label", x=4.5, y=6.1, label="v + u = (4, 6)",color="#2ca02c", fontface="bold", size=4.5, fill="#f0fff0") +
  annotate("text",  x=2.5, y=0.3,
           label=paste0("Dot product v\u00b7u = ", sum(v*u), 
                        "  |  Angle between v & u = ", angle_deg, "\u00b0"),
           color="#555555", fontface="italic", size=4.5) +
  # Axes
  geom_hline(yintercept=0, colour="grey50") +
  geom_vline(xintercept=0, colour="grey50") +
  scale_color_manual(values=c("v = (3, 4)"="#d62728", "u = (1, 2)"="#1f77b4", "v + u = (4, 6)"="#2ca02c")) +
  coord_equal(xlim=c(-0.5, 5.5), ylim=c(-0.5, 7)) +
  labs(
    title    = "Vectors as Arrows in 2D Space",
    subtitle = "Vector addition chains arrows tip-to-tail (parallelogram rule).\nThe dot product measures how much two vectors 'agree' in direction.",
    x = "X axis", y = "Y axis"
  ) +
  theme_minimal(base_size = 14) +
  theme(legend.position="none", plot.title=element_text(face="bold"),
        plot.subtitle=element_text(color="#555555"))

Vectors as arrows: v, u, their sum, and the angle between them.

6.2 Matrices

A matrix is a rectangular array of numbers. For ML, the data matrix $X$ is $n \times p$ (n observations, p features).

Matrix addition requires identical dimensions. Matrix multiplication $AB$ requires cols(A) = rows(B).

A <- matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2, byrow = TRUE)
B <- matrix(c(5, 6, 7, 8), nrow = 2, ncol = 2, byrow = TRUE)

cat("A:\n"); print(A)

## A:

##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4

cat("\nB:\n"); print(B)

## 
## B:

##      [,1] [,2]
## [1,]    5    6
## [2,]    7    8

# Addition (same dimensions)
cat("\nA + B:\n"); print(A + B)

## 
## A + B:

##      [,1] [,2]
## [1,]    6    8
## [2,]   10   12

# Multiplication (2×2 × 2×2 → 2×2)
cat("\nA %*% B:\n"); print(A %*% B)

## 
## A %*% B:

##      [,1] [,2]
## [1,]   19   22
## [2,]   43   50

# Transpose
cat("\nt(A):\n"); print(t(A))

## 
## t(A):

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

Visualising a Matrix as a Transformation

This is perhaps the most powerful geometric intuition in all of linear algebra:

A matrix is a recipe for transforming space.

When you multiply a matrix by a set of points, you are stretching, rotating, shearing, or squishing those points. Think of the matrix as a special pair of glasses that changes how you see the coordinate system.

The animation below shows a regular grid of points (the before state) and where those same points land after being multiplied by matrix $A = \begin{pmatrix} 2 & 1 \\ 0.5 & 1.5 \end{pmatrix}$ (the after state).

# Matrix to apply
M <- matrix(c(2, 0.5, 1, 1.5), nrow=2)  # column-major

# Create a regular grid of points
grid_pts <- expand.grid(x = seq(-2, 2, by=0.5), y = seq(-2, 2, by=0.5))

# Apply transformation: each point (x, y) -> M %*% c(x, y)
transformed <- t(M %*% t(as.matrix(grid_pts)))
grid_pts_t  <- data.frame(x = transformed[,1], y = transformed[,2])

# Basis vectors before and after
basis_before <- data.frame(
  x=c(0,0), y=c(0,0), xend=c(1,0), yend=c(0,1),
  label=c("e1 = (1,0)", "e2 = (0,1)")
)
basis_after <- data.frame(
  x=c(0,0), y=c(0,0),
  xend = c(M[1,1], M[1,2]),
  yend = c(M[2,1], M[2,2]),
  label=c("M·e1", "M·e2")
)

p_before <- ggplot(grid_pts, aes(x, y)) +
  geom_point(color="#1f77b4", alpha=0.5, size=1.5) +
  geom_segment(data=basis_before, aes(x=x,y=y,xend=xend,yend=yend,color=label),
               arrow=arrow(length=unit(0.3,"cm")), linewidth=2) +
  scale_color_manual(values=c("e1 = (1,0)"="#d62728", "e2 = (0,1)"="#2ca02c")) +
  geom_hline(yintercept=0, colour="grey60") + geom_vline(xintercept=0, colour="grey60") +
  coord_equal(xlim=c(-3,3), ylim=c(-3,3)) +
  labs(title="BEFORE: Original Grid", subtitle="The standard coordinate space", x="", y="", color="Basis Vector:") +
  theme_minimal(base_size=13) + theme(plot.title=element_text(face="bold"), legend.position="bottom")

p_after <- ggplot(grid_pts_t, aes(x, y)) +
  geom_point(color="#ff7f0e", alpha=0.5, size=1.5) +
  geom_segment(data=basis_after, aes(x=x,y=y,xend=xend,yend=yend,color=label),
               arrow=arrow(length=unit(0.3,"cm")), linewidth=2) +
  scale_color_manual(values=c("M·e1"="#d62728", "M·e2"="#2ca02c")) +
  geom_hline(yintercept=0, colour="grey60") + geom_vline(xintercept=0, colour="grey60") +
  coord_equal(xlim=c(-5,5), ylim=c(-4,4)) +
  labs(title="AFTER: Transformed Grid", subtitle="The same points after applying matrix M", x="", y="", color="Transformed Basis:") +
  theme_minimal(base_size=13) + theme(plot.title=element_text(face="bold"), legend.position="bottom")

gridExtra::grid.arrange(p_before, p_after, ncol=2)

Left: original coordinate grid. Right: grid after applying a matrix transformation. Arrows show where selected reference points moved.

Why this matters for ML: A neural network layer with a weight matrix $W$ is literally doing this — it’s stretching and rotating your data into a new space where patterns become easier to detect. Understanding that “matrix multiplication = space transformation” is the key insight.

6.3 Matrix Determinant and Inverse

The determinant of a $2 \times 2$ matrix:

\[\det\begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad - bc\]

The inverse $A^{-1}$ satisfies $A A^{-1} = I$. It exists only when $\det(A) \neq 0$.

\[A^{-1} = \frac{1}{\det(A)} \begin{pmatrix} d & -b \\ -c & a \end{pmatrix}\]

A <- matrix(c(2, 1, 1, 2), nrow = 2, byrow = TRUE)
cat("A:\n"); print(A)

## A:

##      [,1] [,2]
## [1,]    2    1
## [2,]    1    2

# Determinant
d <- det(A)
cat("\ndet(A) =", d, "\n")

## 
## det(A) = 3

# Inverse
A_inv <- solve(A)
cat("\nA^(-1):\n"); print(round(A_inv, 4))

## 
## A^(-1):

##         [,1]    [,2]
## [1,]  0.6667 -0.3333
## [2,] -0.3333  0.6667

# Verify: A %*% A^(-1) should be the identity matrix
cat("\nA %*% A^(-1) (should be I):\n"); print(round(A %*% A_inv, 10))

## 
## A %*% A^(-1) (should be I):

##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1

6.4 Solving a System of Linear Equations

Given $A\mathbf{x} = \mathbf{b}$, the solution is $\mathbf{x} = A^{-1}\mathbf{b}$ (when $A$ is invertible):

\[\begin{cases} 2x + y = 5 \\ x + 2y = 4 \end{cases}\]

A <- matrix(c(2, 1, 1, 2), nrow = 2, byrow = TRUE)
b <- c(5, 4)

# Solution via matrix inverse
x <- solve(A, b)
cat("Solution: x =", x[1], ", y =", x[2], "\n")

## Solution: x = 2 , y = 1

# Verify
cat("Check A %*% x:", A %*% x, "(should be", b, ")\n")

## Check A %*% x: 5 4 (should be 5 4 )

Visualising the Solution Geometrically

Each linear equation in a $2 \times 2$ system describes a straight line on a 2D plot. Solving the system means finding the single point where both lines cross simultaneously — the $(x, y)$ pair that satisfies both equations at once.

This is directly analogous to how ML models find optimal parameters: they are looking for the one point in parameter space that best satisfies all training constraints at the same time.

# System: 2x + y = 5  and  x + 2y = 4  => solution (x=2, y=1)
x_vals <- seq(-1, 4, length.out=200)
line1  <- 5 - 2 * x_vals   # y = 5 - 2x
line2  <- (4 - x_vals) / 2 # y = (4 - x) / 2

lines_df <- data.frame(
  x = rep(x_vals, 2),
  y = c(line1, line2),
  eq = rep(c("Equation 1:  2x + y = 5", "Equation 2:  x + 2y = 4"), each=200)
)

ggplot(lines_df, aes(x, y, color=eq)) +
  geom_line(linewidth=1.8) +
  geom_point(aes(x=2, y=1), color="black", size=5, shape=21, fill="gold", stroke=2,
             inherit.aes=FALSE) +
  annotate("label", x=2.3, y=1.4,
           label="Solution\n(x = 2, y = 1)",
           color="black", fontface="bold", size=4.5, fill="lightyellow") +
  annotate("text", x=0.2, y=4.4, label="2x + y = 5", color="#d62728", fontface="bold", size=4.5) +
  annotate("text", x=3.2, y=0.2, label="x + 2y = 4", color="#1f77b4", fontface="bold", size=4.5) +
  scale_color_manual(values=c("Equation 1:  2x + y = 5"="#d62728", "Equation 2:  x + 2y = 4"="#1f77b4")) +
  geom_hline(yintercept=0, color="grey70") + geom_vline(xintercept=0, color="grey70") +
  coord_cartesian(xlim=c(-0.5, 4), ylim=c(-0.5, 5.5)) +
  labs(
    title    = "Geometric Interpretation of a Linear System",
    subtitle = "Each equation is a line. The intersection point IS the solution.",
    x = "x", y = "y", color=""
  ) +
  theme_minimal(base_size=14) +
  theme(legend.position="bottom", plot.title=element_text(face="bold"),
        plot.subtitle=element_text(color="#555555"))

Geometric interpretation: each equation is a line; the solution is where they intersect.

6.5 Eigenvectors and Eigenvalues

For a square matrix $A$, if $A\mathbf{v} = \lambda\mathbf{v}$, then $\lambda$ is an eigenvalue and $\mathbf{v}$ is its corresponding eigenvector.

The eigenvalues are found by solving $\det(A - \lambda I) = 0$ (the characteristic equation).

Why this matters for ML: PCA (Principal Component Analysis) is entirely built on eigendecomposition — the principal components are the eigenvectors of the covariance matrix, ordered by their eigenvalues. See Géron (2019) Ch. 8.

# Match the lecture example
A <- matrix(c(2, 1, 1, 2), nrow = 2, byrow = TRUE)
cat("A:\n"); print(A)

## A:

##      [,1] [,2]
## [1,]    2    1
## [2,]    1    2

ea <- eigen(A)
cat("\nEigenvalues (λ):\n"); print(ea$values)

## 
## Eigenvalues (λ):

## [1] 3 1

cat("\nEigenvectors (columns):\n"); print(round(ea$vectors, 6))

## 
## Eigenvectors (columns):

##          [,1]      [,2]
## [1,] 0.707107 -0.707107
## [2,] 0.707107  0.707107

# Verify: A * v = λ * v  for the first eigenpair
lambda1 <- ea$values[1]
v1      <- ea$vectors[, 1]
cat("\nVerify A*v1 = lambda1*v1:\n")

## 
## Verify A*v1 = lambda1*v1:

cat("  A*v1 =", round(A %*% v1, 6), "\n")

##   A*v1 = 2.12132 2.12132

cat("  lambda1*v1 =", round(lambda1 * v1, 6), "\n")

##   lambda1*v1 = 2.12132 2.12132

Visualising Eigenvectors

A  <- matrix(c(2, 1, 1, 2), nrow = 2, byrow = TRUE)
ea <- eigen(A)
v1 <- ea$vectors[, 1] * ea$values[1]   # scaled by eigenvalue
v2 <- ea$vectors[, 2] * ea$values[2]

ggplot() +
  # Grid of transformed points
  expand_limits(x = c(-3, 3), y = c(-3, 3)) +
  # Eigenvector 1
  geom_segment(aes(x = 0, y = 0, xend = v1[1], yend = v1[2]),
               arrow = arrow(length = unit(0.3, "cm")),
               colour = "#d62728", linewidth = 1.2) +
  annotate("text", x = v1[1] + 0.2, y = v1[2] + 0.15,
           label = paste0("lambda[1] == ", as.integer(ea$values[1])),
           parse = TRUE, colour = "#d62728", size = 5, fontface="bold") +
  # Eigenvector 2
  geom_segment(aes(x = 0, y = 0, xend = v2[1], yend = v2[2]),
               arrow = arrow(length = unit(0.3, "cm")),
               colour = "#1f77b4", linewidth = 1.2) +
  annotate("text", x = v2[1] - 0.45, y = v2[2] - 0.15,
           label = paste0("lambda[2] == ", as.integer(ea$values[2])),
           parse = TRUE, colour = "#1f77b4", size = 5, fontface="bold") +
  geom_hline(yintercept = 0, colour = "grey50") +
  geom_vline(xintercept = 0, colour = "grey50") +
  coord_equal() +
  labs(
    title = "Visualizing Eigenvectors of a Matrix",
    subtitle = "Eigenvectors show the 'principal directions' of a matrix transformation.\nThe eigenvalues (lambda) show how much the matrix stretches data in that direction.\nRed is pulled 3x, Blue is pulled 1x.",
    x = "X axis", y = "Y Axis"
  ) +
  theme_minimal(base_size = 14) +
  theme(plot.subtitle = element_text(color = "#555555", face = "italic"))

Python Equivalent (from lecture slide 23)

The lecture also shows the NumPy equivalent. For reference:

import numpy as np
A = np.array([[2, 1], [1, 2]])
eigenvalues, eigenvectors = np.linalg.eig(A)
print(eigenvalues)    # [3. 1.]
print(eigenvectors)   # [[ 0.707 -0.707]
                      #  [ 0.707  0.707]]

Summary

summary_tbl <- tibble(
  Topic = c(
    "Definition of Learning",
    "Supervised Learning",
    "Unsupervised Learning",
    "Information-based",
    "Similarity-based",
    "Probability-based",
    "Error-based",
    "Classification",
    "Regression",
    "Clustering",
    "Vector",
    "Matrix multiplication",
    "Eigendecomposition"
  ),
  Key_Takeaway = c(
    "Optimise performance on a task T using experience E, measured by metric P (Mitchell, 1997)",
    "Learn f(X) → Y from labelled examples; goal is approximation",
    "Find structure in unlabelled data; goal is description",
    "Minimise entropy/error — e.g., Decision Trees",
    "Predict by proximity in feature space — e.g., KNN",
    "Estimate class probabilities — e.g., Naïve Bayes",
    "Iteratively reduce prediction error — e.g., Neural Networks",
    "Discrete output; supervised; accuracy / confusion matrix",
    "Continuous output; supervised; MSE / R²",
    "Group discovery; unsupervised; inertia / silhouette",
    "Ordered list of numbers; length = L2 norm",
    "Requires cols(A) = rows(B); result is (rows(A) × cols(B))",
    "Av = λv; eigenvectors define PCA principal components"
  )
)

kable(summary_tbl, col.names = c("Topic", "Key Takeaway"),
      caption = "Week 01 — Concept Summary") |>
  kable_styling(bootstrap_options = c("striped", "hover")) |>
  column_spec(1, bold = TRUE, width = "22%")

Week 01 — Concept Summary
Topic	Key Takeaway
Definition of Learning	Optimise performance on a task T using experience E, measured by metric P (Mitchell, 1997)
Supervised Learning	Learn f(X) → Y from labelled examples; goal is approximation
Unsupervised Learning	Find structure in unlabelled data; goal is description
Information-based	Minimise entropy/error — e.g., Decision Trees
Similarity-based	Predict by proximity in feature space — e.g., KNN
Probability-based	Estimate class probabilities — e.g., Naïve Bayes
Error-based	Iteratively reduce prediction error — e.g., Neural Networks
Classification	Discrete output; supervised; accuracy / confusion matrix
Regression	Continuous output; supervised; MSE / R²
Clustering	Group discovery; unsupervised; inertia / silhouette
Vector	Ordered list of numbers; length = L2 norm
Matrix multiplication	Requires cols(A) = rows(B); result is (rows(A) × cols(B))
Eigendecomposition	Av = λv; eigenvectors define PCA principal components

References

Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd ed.). O’Reilly Media.
- Chapter 1: The Machine Learning Landscape
- Chapter 8: Dimensionality Reduction (PCA and eigendecomposition)
- Local knowledge base: Knowledge/extracted/chapters/01_CHAPTER 1 — The Machine Learning Landscape.md
Mitchell, T. (1997). Machine Learning. McGraw-Hill.
Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3(3), 210–229.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.). Springer.

Week 01: Introduction to Machine Learning

Ziyuan Huang

Last Updated: April 02, 2026