The

Hundred-

Page

Machine

Learning

Book

Andriy Burkov

“All models are wrong, but some are useful.”

— George Box

The book is distributed on the “read ﬁrst, buy later” principle.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft

Preface

Let’s start by telling the truth: machines don’t learn. What a typical “learning machine”

does, is ﬁnding a mathematical formula, which, when applied to a collection of inputs (called

“training data”), produces the desired outputs. This mathematical formula also generates the

correct outputs for most other inputs (distinct from the training data) on the condition that

those inputs come from the same or a similar statistical distribution as the one the training

data was drawn from.

Why isn’t that learning? Because if you slightly distort the inputs, the output is very likely

to become completely wrong. It’s not how learning in animals works. If you learned to play

a video game by looking straight at the screen, you would still be a good player if someone

rotates the screen slightly. A machine learning algorithm, if it was trained by “looking”

straight at the screen, unless it was also trained to recognize rotation, will fail to play the

game on a rotated screen.

So why the name “machine learning” then? The reason, as is often the case, is marketing:

Arthur Samuel, an American pioneer in the ﬁeld of computer gaming and artiﬁcial intelligence,

coined the term in 1959 while at IBM. Similarly to how in the 2010s IBM tried to market

the term “cognitive computing” to stand out from competition, in the 1960s, IBM used the

new cool term “machine learning” to attract both clients and talented employees.

As you can see, just like artiﬁcial intelligence is not intelligence, machine learning is not

learning. However, machine learning is a universally recognized term that usually refers

to the science and engineering of building machines capable of doing various useful things

without being explicitly programmed to do so. So, the word “learning” in the term is used

by analogy with the learning in animals rather than literally.

Who This Book is For

This book contains only those parts of the vast body of material on machine learning developed

since the 1960s that have proven to have a signiﬁcant practical value. A beginner in machine

learning will ﬁnd in this book just enough details to get a comfortable level of understanding

of the ﬁeld and start asking the right questions.

Practitioners with experience can use this book as a collection of directions for further

self-improvement. The book also comes in handy when brainstorming at the beginning of a

project, when you try to answer the question whether a given technical or business problem

is “machine-learnable” and, if yes, which techniques you should try to solve it.

How to Use This Book

If you are about to start learning machine learning, you should read this book from the

beginning to the end. (It’s just a hundred pages, not a big deal.) If you are interested

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 3

in a speciﬁc topic covered in the book and want to know more, most sections have a QR

code. By scanning one of those QR codes with your phone, you will get a link to a page on

the book’s companion wiki theMLbook.com with additional materials: recommended reads,

videos, Q&As, code snippets, tutorials, and other bonuses.

The book’s wiki is continuously updated with contributions from the book’s author himself

as well as volunteers from all over the world. So this book, like a good wine, keeps getting

better after you buy it.

Scan the QR code below with your phone to get to the book’s wiki:

Some sections don’t have a QR code, but they still most likely have a wiki page. You can

ﬁnd it by submitting the section’s title to the wiki’s search engine.

Should You Buy This Book?

This book is distributed on the “read ﬁrst, buy later” principle. I ﬁrmly believe that paying

for the content before consuming it is buying a pig in a poke. You can see and try a car in a

dealership before you buy it. You can try on a shirt or a dress in a department store. You

have to be able to read a book before paying for it.

The read ﬁrst, buy later principle implies that you can freely download the book, read it and

share it with your friends and colleagues. If you liked the book, only then you have to buy it.

Now you are all set. Enjoy your reading!

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 4

1 Introduction

1.1 What is Machine Learning

Machine learning is a subﬁeld of computer science that is concerned with building algorithms

which, to be useful, rely on a collection of examples of some phenomenon. These examples

can come from nature, be handcrafted by humans or generated by another algorithm.

Machine learning can also be deﬁned as the process of solving a practical problem by 1)

gathering a dataset, and 2) algorithmically building a statistical model based on that dataset.

That statistical model is assumed to be used somehow to solve the practical problem.

To save keystrokes, I use the terms “learning” and “machine learning” interchangeably.

1.2 Types of Learning

Learning can be supervised, semi-supervised, unsupervised and reinforcement.

1.2.1 Supervised Learning

supervised learning

, the

dataset

is the collection of

labeled examples {

(

, y

)

}

i=1

Each element

among

is called a

feature vector

. A feature vector is a vector in which

each dimension

= 1

, . . . , D

contains a value that describes the example somehow. That

value is called a

feature

and is denoted as

(j)

. For instance, if each example

in our

collection represents a person, then the ﬁrst feature,

(1)

, could contain height in cm, the

second feature,

(2)

, could contain weight in kg,

(3)

could contain gender, and so on. For all

examples in the dataset, the feature at position

in the feature vector always contains the

same kind of information. It means that if

(2)

contains weight in kg in some example

then

(2)

will also contain weight in kg in every example

= 1

, . . . , N

. The

label y

can

be either an element belonging to a ﬁnite set of

classes {

, . . . , C}

, or a real number, or a

more complex structure, like a vector, a matrix, a tree, or a graph. Unless otherwise stated,

in this book

is either one of a ﬁnite set of classes or a real number. You can see a class as

a category to which an example belongs. For instance, if your examples are email messages

and your problem is spam detection, then you have two classes {spam, not_spam}.

The goal of a

supervised learning algorithm

is to use the dataset to produce a

model

that takes a feature vector

as input and outputs information that allows deducing the label

for this feature vector. For instance, the model created using the dataset of people could

take as input a feature vector describing a person and output a probability that the person

has cancer.

In this book, if a term is

in bold

, that means that this term can be found in the index at the end of the

book.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 3

1.2.2 Unsupervised Learning

unsupervised learning

, the dataset is a collection of

unlabeled examples {x

}

i=1

Again,

is a feature vector, and the goal of an

unsupervised learning algorithm

to create a

model

that takes a feature vector

as input and either transforms it into

another vector or into a value that can be used to solve a practical problem. For example,

clustering

, the model returns the id of the cluster for each feature vector in the dataset.

dimensionality reduction

, the output of the model is a feature vector that has fewer

features than the input

; in

outlier detection

, the output is a real number that indicates

how x is diﬀerent from a “typical” example in the dataset.

1.2.3 Semi-Supervised Learning

semi-supervised learning

, the dataset contains both labeled and unlabeled examples.

Usually, the quantity of unlabeled examples is much higher than the number of labeled

examples. The goal of a

semi-supervised learning algorithm

is the same as the goal of

the supervised learning algorithm. The hope here is that using many unlabeled examples can

help the learning algorithm to ﬁnd (we might say “produce” or “compute”) a better model

1.2.4 Reinforcement Learning

Reinforcement learning is a subﬁeld of machine learning where the machine “lives” in an

environment and is capable of perceiving the

state

of that environment as a vector of

features. The machine can execute

actions

in every state. Diﬀerent actions bring diﬀerent

rewards

and could also move the machine to another state of the environment. The goal

of a reinforcement learning algorithm is to learn a policy. A policy is a function f (similar

to the model in supervised learning) that takes the feature vector of a state as input and

outputs an optimal action to execute in that state. The action is optimal if it maximizes the

expected average reward.

Reinforcement learning solves a particular kind of problems where

decision making is sequential, and the goal is long-term, such as game

playing, robotics, resource management, or logistics. In this book, I

put emphasis on one-shot decision making where input examples are

independent of one another and the predictions made in the past. I

leave reinforcement learning out of the scope of this book.

It could look counter-intuitive that learning could beneﬁt from adding more unlabeled examples. It seems

like we add more uncertainty to the problem. However, when you add unlabeled examples, you add more

information about your problem: a larger sample reﬂects better the probability distribution the data we

labeled came from. Theoretically, a learning algorithm should be able to leverage this additional information.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 4

1.3 How Supervised Learning Works

In this section, I brieﬂy explain how supervised learning works so that you have the picture

of the whole process before we go into detail. I decided to use supervised learning as an

example because it’s the type of machine learning most frequently used in practice.

The supervised learning process starts with gathering the data. The data for supervised

learning is a collection of pairs (input, output). Input could be anything, for example, email

messages, pictures, or sensor measurements. Outputs are usually real numbers, or labels (e.g.

“spam”, “not_spam”, “cat”, “dog”, “mouse”, etc). In some cases, outputs are vectors (e.g.,

four coordinates of the rectangle around a person on the picture), sequences (e.g. [“adjective”,

“adjective”, “noun”] for the input “big beautiful car”), or have some other structure.

Let’s say the problem that you want to solve using supervised learning is spam detection.

You gather the data, for example, 10,000 email messages, each with a label either “spam” or

“not_spam” (you could add those labels manually or pay someone to do that for us). Now,

you have to convert each email message into a feature vector.

The data analyst decides, based on their experience, how to convert a real-world entity, such

as an email message, into a feature vector. One common way to convert a text into a feature

vector, called

bag of words

, is to take a dictionary of English words (let’s say it contains

20,000 alphabetically sorted words) and stipulate that in our feature vector:

•

the ﬁrst feature is equal to 1 if the email message contains the word “a”; otherwise,

this feature is 0;

•

the second feature is equal to 1 if the email message contains the word “aaron”; otherwise,

this feature equals 0;

• . . .

•

the feature at position 20,000 is equal to 1 if the email message contains the word

“zulu”; otherwise, this feature is equal to 0.

You repeat the above procedure for every email message in our collection, which gives

us 10,000 feature vectors (each vector having the dimensionality of 20,000) and a label

(“spam”/“not_spam”).

Now you have a machine-readable input data, but the output labels are still in the form of

human-readable text. Some learning algorithms require transforming labels into numbers.

For example, some algorithms require numbers like 0 (to represent the label “not_spam”)

and 1 (to represent the label “spam”). The algorithm I use to illustrate supervised learning is

called

Support Vector Machine

(SVM). This algorithm requires that the positive label (in

our case it’s “spam”) has the numeric value of +1 (one), and the negative label (“not_spam”)

has the value of −1 (minus one).

At this point, you have a

dataset

and a

learning algorithm

, so you are ready to apply

the learning algorithm to the dataset to get the model.

SVM sees every feature vector as a point in a high-dimensional space (in our case, space

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 5

is 20,000-dimensional). The algorithm puts all feature vectors on an imaginary 20,000-

dimensional plot and draws an imaginary 20,000-dimensional line (a hyperplane) that separates

examples with positive labels from examples with negative labels. In machine learning, the

boundary separating the examples of diﬀerent classes is called the decision boundary.

The equation of the hyperplane is given by two

parameters

, a real-valued vector

of the

same dimensionality as our input feature vector x, and a real number b like this:

wx − b = 0,

where the expression

means

(1)

(2)

. . .

(D)

, and

is the number

of dimensions of the feature vector x.

(If some equations aren’t clear to you right now, in Chapter 2 we revisit the math and

statistical concepts necessary to understand them. For the moment, try to get an intuition of

what’s happening here. It all becomes more clear after you read the next chapter.)

Now, the predicted label for some input feature vector x is given like this:

y = sign(wx − b),

where

sign

is a mathematical operator that takes any value as input and returns +1 if the

input is a positive number or −1 if the input is a negative number.

The goal of the learning algorithm — SVM in this case — is to leverage the dataset and ﬁnd

the optimal values

∗

and

∗

for parameters

and

. Once the learning algorithm identiﬁes

these optimal values, the model f(x) is then deﬁned as:

f(x) = sign(w

∗

x − b

∗

)

Therefore, to predict whether an email message is spam or not spam using an SVM model,

you have to take a text of the message, convert it into a feature vector, then multiply this

vector by

∗

, subtract

∗

and take the sign of the result. This will give us the prediction (+1

means “spam”, −1 means “not_spam”).

Now, how does the machine ﬁnd

∗

and

∗

? It solves an optimization problem. Machines

are good at optimizing functions under constraints.

So what are the constraints we want to satisfy here? First of all, we want the model to predict

the labels of our 10,000 examples correctly. Remember that each example

= 1

, . . . ,

10000 is

given by a pair (

, y

), where

is the feature vector of example

and

is its label that

takes values either −1 or +1. So the constraints are naturally:

• wx

− b ≥ 1 if y

= +1, and

• wx

− b ≤ −1 if y

= −1

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 6

(2)

(1)

wx—b=0

wx—b=1

wx—b=—1

b

||w||

2

||w||

Figure 1: An example of an SVM model for two-dimensional feature vectors.

We would also prefer that the hyperplane separates positive examples from negative ones with

the largest

margin

. The margin is the distance between the closest examples of two classes,

as deﬁned by the decision boundary. A large margin contributes to a better

generalization

that is how well the model will classify new examples in the future. To achieve that, we need

to minimize the Euclidean norm of w denoted by kwk and given by

j=1

(j)

)

So, the optimization problem that we want the machine to solve looks like this:

Minimize

kwk

subject to

(

− b

)

≥

1 for

= 1

, . . . , N

. The expression

(

− b

)

≥

is just a compact way to write the above two constraints.

The solution of this optimization problem, given by

∗

and

∗

, is called the

statistical

model, or, simply, the model. The process of building the model is called training.

For two-dimensional feature vectors, the problem and the solution can be visualized as shown

in ﬁg. 1. The blue and orange circles represent, respectively, positive and negative examples,

and the line given by wx − b = 0 is the decision boundary.

Why, by minimizing the norm of

, do we ﬁnd the highest margin between the two classes?

Geometrically, the equations

wx − b

= 1 and

wx − b

−

1 deﬁne two parallel hyperplanes,

as you see in ﬁg. 1. The distance between these hyperplanes is given by

kwk

, so the smaller

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 7

the norm kwk, the larger the distance between these two hyperplanes.

That’s how Support Vector Machines work. This particular version of the algorithm builds

the so-called linear model. It’s called linear because the decision boundary is a straight line

(or a plane, or a hyperplane). SVM can also incorporate

kernels

that can make the decision

boundary arbitrarily non-linear. In some cases, it could be impossible to perfectly separate

the two groups of points because of noise in the data, errors of labeling, or

outliers

(examples

very diﬀerent from a “typical” example in the dataset). Another version of SVM can also

incorporate a penalty hyperparameter for misclassiﬁcation of training examples of speciﬁc

classes. We study the SVM algorithm in more detail in Chapter 3.

At this point, you should retain the following: any classiﬁcation learning algorithm that

builds a model implicitly or explicitly creates a decision boundary. The decision boundary

can be straight, or curved, or it can have a complex form, or it can be a superposition of

some geometrical ﬁgures. The form of the decision boundary determines the

accuracy

the model (that is the ratio of examples whose labels are predicted correctly). The form of

the decision boundary, the way it is algorithmically or mathematically computed based on

the training data, diﬀerentiates one learning algorithm from another.

In practice, there are two other essential diﬀerentiators of learning algorithms to consider:

speed of model building and prediction processing time. In many practical cases, you would

prefer a learning algorithm that builds a less accurate model fast. Additionally, you might

prefer a less accurate model that is much quicker at making predictions.

1.4 Why the Model Works on New Data

Why is a machine-learned model capable of predicting correctly the labels of new, previously

unseen examples? To understand that, look at the plot in ﬁg. 1. If two classes are separable

from one another by a decision boundary, then, obviously, examples that belong to each class

are located in two diﬀerent subspaces which the decision boundary creates.

If the examples used for training were selected randomly, independently of one another, and

following the same procedure, then, statistically, it is more likely that the new negative

example will be located on the plot somewhere not too far from other negative examples.

The same concerns the new positive example: it will likely come from the surroundings of

other positive examples. In such a case, our decision boundary will still, with high probability,

separate well new positive and negative examples from one another. For other, less likely

situations, our model will make errors, but because such situations are less likely, the number

of errors will likely be smaller than the number of correct predictions.

Intuitively, the larger is the set of training examples, the more unlikely that the new examples

will be dissimilar to (and lie on the plot far from) the examples used for training. To minimize

the probability of making errors on new examples, the SVM algorithm, by looking for the

largest margin, explicitly tries to draw the decision boundary in such a way that it lies as far

as possible from examples of both classes.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 8

The reader interested in knowing more about the learnability and un-

derstanding the close relationship between the model error, the size of

the training set, the form of the mathematical equation that deﬁnes

the model, and the time it takes to build the model is encouraged to

read about the PAC learning. The PAC (for “probably approximately

correct”) learning theory helps to analyze whether and under what

conditions a learning algorithm will probably output an approximately

correct classiﬁer.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 9

2 Notation and Deﬁnitions

2.1 Notation

Let’s start by revisiting the mathematical notation we all learned at school, but some likely

forgot right after the prom.

2.1.1 Scalars, Vectors, and Sets

A scalar is a simple numerical value, like 15 or

−

25. Variables or constants that take scalar

values are denoted by an italic letter, like x or a.

Figure 1: Three vectors visualized as directions and as points.

A vector is an ordered list of scalar values, called attributes. We denote a vector as a bold

character, for example,

. Vectors can be visualized as arrows that point to some

directions as well as points in a multi-dimensional space. Illustrations of three two-dimensional

vectors,

= [2

3],

= [

−

5], and

= [1

0] is given in ﬁg. 1. We denote an attribute of a

vector as an italic value with an index, like this:

(j)

. The index

denotes a speciﬁc

dimension of the vector, the position of an attribute in the list. For instance, in the vector

shown in red in ﬁg. 1, a

(1)

= 2 and a

(2)

= 3.

The notation

(j)

should not be confused with the power operator, like this

(squared) or

(cubed). If we want to apply a power operator, say square, to an indexed attribute of a

vector, we write like this: (x

(j)

)

A variable can have two or more indices, like this:

(j)

or like this

(k)

i,j

. For example, in

neural networks, we denote as x

(j)

l,u

the input feature j of unit u in layer l.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 3

A set is an unordered collection of unique elements. We denote a set as a calligraphic

capital character, for example,

. A set of numbers can be ﬁnite (include a ﬁxed amount

of values). In this case, it is denoted using accolades, for example,

{

235

}

, x

, . . . , x

}

. A set can be inﬁnite and include all values in some interval. If a set

includes all values between

and

, including

and

, it is denoted using brackets as [

a, b

If the set doesn’t include the values

and

, such a set is denoted using parentheses like this:

(

a, b

). For example, the set [0

1] includes such values as 0, 0

0001, 0

25, 0

784, 0

9995, and

1.0. A special set denoted R includes all numbers from minus inﬁnity to plus inﬁnity.

When an element

belongs to a set

, we write

x ∈ S

. We can obtain a new set

an intersection of two sets

and

. In this case, we write

← S

∩ S

. For example

{1, 3, 5, 8} ∩ {1, 8, 4} gives the new set {1, 8}.

We can obtain a new set

as a union of two sets

and

. In this case, we write

← S

∪ S

. For example {1, 3, 5, 8} ∪ {1, 8, 4} gives the new set {1, 3, 4, 5, 8}.

2.1.2 Capital Sigma Notation

The summation over a collection X =

, x

, . . . , x

n−1

, x

}

or over the attributes of a vector

x = [x

(1)

, x

(2)

, . . . , x

(m−1)

, x

(m)

] is denoted like this:

i=1

def

= x

+ x

+ . . . + x

n−1

+ x

, or else:

j=1

(j)

def

= x

(1)

+ x

(2)

+ . . . + x

(m−1)

+ x

(m)

The notation

def

= means “is deﬁned as”.

2.1.3 Capital Pi Notation

A notation analogous to capital sigma is the capital pi notation. It denotes a product of

elements in a collection or attributes of a vector:

i=1

def

= x

· x

· . . . ·x

n−1

· x

where

a ·b

means

multiplied by

. Where possible, we omit

to simplify the notation, so

also means a multiplied by b.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 4

2.1.4 Operations on Sets

A derived set creation operator looks like this:

← {x

|x ∈ S, x >

}

. This notation means

that we create a new set

by putting into it

squared such that that

is in

, and

greater than 3.

The cardinality operator |S| returns the number of elements in set S.

2.1.5 Operations on Vectors

The sum of two vectors

is deﬁned as the vector [

(1)

, x

(2)

, . . . , x

(m)

The diﬀerence of two vectors

x −z

is deﬁned as the vector [

(1)

−z

(1)

, x

(2)

−z

(2)

, . . . , x

(m)

−

(m)

A vector multiplied by a scalar is a vector. For example x c

def

= [cx

(1)

, cx

(2)

, . . . , cx

(m)

A dot-product of two vectors is a scalar. For example,

def

i=1

(i)

. In some books,

the dot-product is denoted as

w · x

. The two vectors must be of the same dimensionality.

Otherwise, the dot-product is undeﬁned.

The multiplication of a matrix

by a vector

gives another vector as a result. Let our

matrix be,

W =



(1,1)

(1,2)

(1,3)

(2,1)

(2,2)

(2,3)



When vectors participate in operations on matrices, a vector is by default represented as a

matrix with one column. When the vector is on the right of the matrix, it remains a column

vector. We can only multiply a matrix by vector if the vector has the same number of rows

as the number of columns in the matrix. Let our vector be

def

[

(1)

, x

(2)

, x

(3)

]. Then

is a two-dimensional vector deﬁned as,

Wx =



(1,1)

(1,2)

(1,3)

(2,1)

(2,2)

(2,3)







(1)

(2)

(3)





def



(1,1)

(1)

+ w

(1,2)

(2)

+ w

(1,3)

(3)

(2,1)

(1)

+ w

(2,2)

(2)

+ w

(2,3)

(3)





(1)

(2)



If our matrix had, say, ﬁve rows, the result of the above product would be a ﬁve-dimensional

vector.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 5

When the vector is on the left side of the matrix in the multiplication, then it has to be

transposed before we multiply it by the matrix. The transpose of the vector

denoted as

makes a row vector out of a column vector. Let’s say,

x =



(1)

(2)



then,

def

(1)

, x

(2)

The multiplication of the vector x by the matrix W is given by x

W =

(1)

, x

(2)



(1,1)

(1,2)

(1,3)

(2,1)

(2,2)

(2,3)



def



(1,1)

(1)

+ w

(2,1)

(2)

, w

(1,2)

(1)

+ w

(2,2)

(2)

, w

(1,3)

(1)

+ w

(2,3)

(2)



As you can see, we can only multiply a vector by a matrix if the vector has the same number

of dimensions as the number of rows in the matrix.

2.1.6 Functions

A function is a relation that associates each element

of a set

, the domain of the function,

to a single element

of another set

, the codomain of the function. A function usually has a

name. If the function is called

, this relation is denoted

(

) (read

), the element

is the argument or input of the function, and

is the value of the function or the output.

The symbol that is used for representing the input is the variable of the function (we often

say that f is a function of the variable x).

We say that

(

) has a local minimum at

(

)

≥ f

(

) for every

in some open

interval around

. An interval is a set of real numbers with the property that any number

that lies between two numbers in the set is also included in the set. An open interval does

not include its endpoints and is denoted using parentheses. For example, (0

1) means greater

than 0 and less than 1. The minimal value among all the local minima is called the global

minimum. See illustration in ﬁg. 2.

A vector function, denoted as

(

) is a function that returns a vector

. It can have a

vector or a scalar argument.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 6

global minimum

local minimum

–2

–4

–6

f(x)

Figure 2: A local and a global minima of a function.

2.1.7 Max and Arg Max

Given a set of values A = {a

, a

, . . . , a

}, the operator,

max

a∈A

f(a)

returns the highest value

(

) for all elements in the set

. On the other hand, the operator,

arg max

a∈A

f(a)

returns the element of the set A that maximizes f(a).

Sometimes, when the set is implicit or inﬁnite, we can write max

f(a) or arg max

f(a).

Operators min and arg min operate in a similar manner.

2.1.8 Assignment Operator

The expression

a ← f

(

) means that the variable

gets the new value: the result of

(

We say that the variable

gets assigned a new value. Similarly,

a ←

[

, a

] means that the

two-dimensional vector a gets the value [a

, a

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 7

2.1.9 Derivative and Gradient

A derivative

of a function

is a function or a value that describes how fast

grows (or

decreases). If the derivative is a constant value, like 5 or

−

3, then the function grows (or

decreases) constantly at any point

of its domain. If the derivative

is a function, then the

function

can grow at a diﬀerent pace in diﬀerent regions of its domain. If the derivative

is positive at some point

, then the function

grows at this point. If the derivative of

negative at some

, then the function decreases at this point. The derivative of zero at

means that the function’s slope at x is horizontal.

The process of ﬁnding a derivative is called diﬀerentiation.

Derivatives for basic functions are known. For example if

(

) =

, then

(

) = 2

; if

(

) = 2

then

(

) = 2; if

(

) = 2 then

(

) = 0 (the derivative of any function

(

) =

where c is a constant value, is zero).

If the function we want to diﬀerentiate is not basic, we can ﬁnd its derivative using the

chain rule. For example if

(

) =

(

)), where

and

are some functions, then

(

) =

(

))

(

). For example if

(

) = (5

+ 1)

then

(

) = 5

+ 1 and

(

)) = (

(

))

By applying the chain rule, we ﬁnd F

(x) = 2(5x + 1)g

(x) = 2(5x + 1)5 = 50x + 10 .

Gradient is the generalization of derivative for functions that take several inputs (or one

input in the form of a vector or some other complex structure). A gradient of a function

is a vector of partial derivatives. You can look at ﬁnding a partial derivative of a function

as the process of ﬁnding the derivative by focusing on one of the function’s inputs and by

considering all other inputs as constant values.

For example, if our function is deﬁned as

([

(1)

, x

(2)

]) =

(1)

(2)

, then the partial

derivative of function f with respect to x

(1)

, denoted as

∂f

∂x

(1)

, is given by,

∂f

∂x

(1)

= a + 0 + 0 = a,

where

is the derivative of the function

(1)

; the two zeroes are respectively derivatives of

(2)

and

, because

(2)

is considered constant when we compute the derivative with respect

to x

(1)

, and the derivative of any constant is zero.

Similarly, the partial derivative of function f with respect to x

(2)

∂f

∂x

(2)

, is given by,

∂f

∂x

(2)

= 0 + b + 0 = b.

The gradient of function f , denoted as ∇f is given by the vector [

∂f

∂x

(1)

∂f

∂x

(2)

The chain rule works with partial derivatives too, as I illustrate in Chapter 4.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 8

(a) (b)

Figure 3: A probability mass function and a probability density function.

2.2 Random Variable

A random variable, usually written as an italic capital letter, like

, is a variable whose

possible values are numerical outcomes of a random phenomenon. There are two types of

random variables: discrete and continuous.

A discrete random variable takes on only a countable number of distinct values such as

red

yellow, blue or 1, 2, 3, . . ..

The probability distribution of a discrete random variable is described by a list of probabilities

associated with each of its possible values. This list of probabilities is called probability mass

function (pmf). For example:

(

red

) = 0

(

yellow

) = 0

45,

(

blue

) =

25. Each probability in a probability mass function is a value greater than or equal to 0.

The sum of probabilities equals 1 (ﬁg. 3a).

A continuous random variable takes an inﬁnite number of possible values in some interval.

Examples include height, weight, and time. Because the number of values of a continuous

random variable

is inﬁnite, the probability

(

) for any

is 0. Therefore, instead

of the list of probabilities, the probability distribution of a continuous random variable (a

continuous probability distribution) is described by a probability density function (pdf). The

pdf is a function whose codomain is nonnegative and the area under the curve is equal to 1

(ﬁg. 3b).

Let a discrete random variable

have

possible values

}

i=1

. The expectation of

denoted as E[X] is given by,

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 9

E[X]

def

i=1

Pr(X = x

) = x

Pr(X = x

) + x

Pr(X = x

) + ··· + x

Pr(X = x

), (1)

where

(

) is the probability that

has the value

according to the pmf. The

expectation of a random variable is also called the mean, average or expected value and is

frequently denoted with the letter

. The expectation is one of the most important statistics

of a random variable. Another important statistic is the standard deviation. For a discrete

random variable, the standard deviation usually denoted as σ is given by:

def

E[(X − µ)

] =

Pr(X = x

)(x

− µ)

+ Pr(X = x

)(x

− µ)

+ ··· + Pr(X = x

)(x

− µ)

where µ = E[X].

The expectation of a continuous random variable X is given by,

E[X]

def

(x) dx, (2)

where f

is the pdf of the variable X and

is the integral of function xf

Integral is an equivalent of the summation over all values of the function when the function

has a continuous domain. It equals the area under the curve of the function. The property of

the pdf that the area under its curve is 1 mathematically means that

(x) dx = 1.

Most of the time we don’t know

, but we can observe some values of

. In machine

learning, we call these values

examples

, and the collection of these examples is called a

sample or a dataset.

2.3 Unbiased Estimators

Because

is usually unknown, but we have a sample

}

i=1

, we often content

ourselves not with the true values of statistics of the probability distribution, such as

expectation, but with their unbiased estimators.

We say that

(

) is an unbiased estimator of some statistic

calculated using a sample

drawn from an unknown probability distribution if

θ(S

) has the following property:

θ(S

)

= θ,

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 10

where

is a sample statistic, obtained using a sample

and not the real statistic

that

can be obtained only knowing

; the expectation is taken over all possible samples drawn

from

. Intuitively, this means that if you can have an unlimited number of such samples

, and you compute some unbiased estimator, such as

ˆµ

, using each sample, then the

average of all these ˆµ equals the real statistic µ that you would get computed on X.

It can be shown that an unbiased estimator of an unknown

[

] (given by either eq. 1 or

eq. 2) is given by

i=1

(called in statistics the sample mean).

2.4 Bayes’ Rule

The conditional probability

(

x|Y

) is the probability of the random variable

have a speciﬁc value

given that another random variable

has a speciﬁc value of

. The

Bayes’ Rule (also known as the Bayes’ Theorem) stipulates that:

Pr(X = x|Y = y) =

Pr(Y = y|X = x) Pr(X = x)

Pr(Y = y)

2.5 Parameter Estimation

Bayes’ Rule comes in handy when we have a model of

’s distribution, and this model

is a

function that has some parameters in the form of a vector

. An example of such a function

could be the Gaussian function that has two parameters, µ and σ, and is deﬁned as:

(x) =

√

2πσ

−

(x−µ)

2σ

where θ

def

= [µ, σ].

This function has all the properties of a pdf. Therefore, we can use it as a model of an

unknown distribution of

. We can update the values of parameters in the vector

from the

data using the Bayes’ Rule:

Pr(θ =

θ|X = x) ←

Pr(X = x|θ =

θ) Pr(θ =

θ)

Pr(X = x)

Pr(X = x|θ =

θ) Pr(θ =

θ)

Pr(X = x|θ =

θ)

. (3)

where Pr(X = x|θ =

θ)

def

= f

If we have a sample

and the set of possible values for

is ﬁnite, we can easily estimate

(

) by applying Bayes’ Rule iteratively, one example

x ∈ S

at a time. The initial value

(

) can be guessed such that

(

) = 1. This guess of the probabilities for

diﬀerent

θ is called the prior.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 11

First, we compute

(

θ|X

) for all possible values

. Then, before updating

(

θ|X

) once again, this time for

∈ S

using eq. 3, we replace the prior

Pr(θ =

θ) in eq. 3 by the new estimate Pr(θ =

θ) ←

x∈S

Pr(θ =

θ|X = x).

The best value of the parameters

∗

given one example is obtained using the principle of

maximum-likelihood:

∗

= arg max

i=1

Pr(θ =

θ|X = x

). (4)

If the set of possible values for

isn’t ﬁnite, then we need to optimize eq. 4 directly using a

numerical optimization routine, such as gradient descent, which we consider in Chapter 4.

Usually, we optimize the natural logarithm of the right-hand side expression in eq. 4 because

the logarithm of a product becomes the sum of logarithms and it’s easier for the machine to

work with the sum than with a product

2.6 Classiﬁcation vs. Regression

Classiﬁcation

is a problem of automatically assigning a

label

to an

unlabeled example

Spam detection is a famous example of classiﬁcation.

In machine learning, the classiﬁcation problem is solved by a classiﬁcation learning algorithm

that takes a collection of

labeled examples

as inputs and produces a

model

that can take

an unlabeled example as input and either directly output a label or output a number that

can be used by the data analyst to deduce the label easily. An example of such a number is

a probability.

In a classiﬁcation problem, a label is a member of a ﬁnite set of

classes

. If the size of

the set of classes is two (“sick”/“healthy”, “spam”/“not_spam”), we talk about

binary

classiﬁcation (also called binomial in some books).

Multiclass classiﬁcation

(also called

multinomial

) is a classiﬁcation problem with three

or more classes

While some learning algorithms naturally allow for more than two classes, others are by nature

binary classiﬁcation algorithms. There are strategies allowing to turn a binary classiﬁcation

learning algorithm into a multiclass one. I talk about one of them in Chapter 7.

Regression

is a problem of predicting a real-valued label (often called a target) given an

unlabeled example. Estimating house price valuation based on house features, such as area,

the number of bedrooms, location and so on is a famous example of regression.

Multiplication of many numbers can give either a very small result or a very large one. It often results in

the problem of numerical overﬂow when the machine cannot store such extreme numbers in memory.

There’s still one label per example though.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 12

The regression problem is solved by a regression learning algorithm that takes a collection

of labeled examples as inputs and produces a model that can take an unlabeled example as

input and output a target.

2.7 Model-Based vs. Instance-Based Learning

Most supervised learning algorithms are model-based. We have already seen one such

algorithm: SVM. Model-based learning algorithms use the training data to create a model

that has

parameters

learned from the training data. In SVM, the two parameters we saw

were w

∗

and b

∗

. After the model was built, the training data can be discarded.

Instance-based learning algorithms use the whole dataset as the model. One instance-based

algorithm frequently used in practice is

k-Nearest Neighbors

(kNN). In classiﬁcation, to

predict a label for an input example the kNN algorithm looks at the close neighborhood of

the input example in the space of feature vectors and outputs the label that it saw the most

often in this close neighborhood.

2.8 Shallow vs. Deep Learning

A shallow learning algorithm learns the parameters of the model directly from the features

of the training examples. Most supervised learning algorithms are shallow. The notorious

exceptions are

neural network

learning algorithms, speciﬁcally those that build neural

networks with more than one

layer

between input and output. Such neural networks are

called

deep neural networks

. In deep neural network learning (or, simply, deep learning),

contrary to shallow learning, most model parameters are learned not directly from the features

of the training examples, but from the outputs of the preceding layers.

Don’t worry if you don’t understand what that means right now. We look at neural networks

more closely in Chapter 6.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 13

3 Fundamental Algorithms

In this chapter, I describe ﬁve algorithms which are not just the most known but also either

very eﬀective on their own or are used as building blocks for the most eﬀective learning

algorithms out there.

3.1 Linear Regression

Linear regression is a popular regression learning algorithm that learns a model which is a

linear combination of features of the input example.

3.1.1 Problem Statement

We have a collection of labeled examples

{

(

, y

)

}

i=1

, where

is the size of the collection,

is the

-dimensional feature vector of example

= 1

, . . . , N

is a real-valued

target

and every feature x

(j)

, j = 1, . . . , D, is also a real number.

We want to build a model f

w,b

(x) as a linear combination of features of example x:

w,b

(x) = wx + b, (1)

where

is a

-dimensional vector of parameters and

is a real number. The notation

w,b

means that the model f is parametrized by two values: w and b.

We will use the model to predict the unknown

for a given

like this:

y ← f

w,b

(

). Two

models parametrized by two diﬀerent pairs (

w, b

) will likely produce two diﬀerent predictions

when applied to the same example. We want to ﬁnd the optimal values (

∗

, b

∗

). Obviously,

the optimal values of parameters deﬁne the model that makes the most accurate predictions.

You could have noticed that the form of our linear model in eq. 1 is very similar to the form

of the SVM model. The only diﬀerence is the missing

sign

operator. The two models are

indeed similar. However, the hyperplane in the SVM plays the role of the decision boundary:

it’s used to separate two groups of examples from one another. As such, it has to be as far

from each group as possible.

On the other hand, the hyperplane in linear regression is chosen to be as close to all training

examples as possible.

You can see why this latter requirement is essential by looking at the illustration in ﬁg. 1. It

displays the regression line (in light-blue) for one-dimensional examples (dark-blue dots). We

can use this line to predict the value of the target

new

for a new unlabeled input example

new

. If our examples are

-dimensional feature vectors (for

D >

1), the only diﬀerence

To say that

is real-valued, we write

∈ R

, where

denotes the set of all real numbers, an inﬁnite set

of numbers from minus inﬁnity to plus inﬁnity.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 3

Figure 1: Linear Regression for one-dimensional examples.

with the one-dimensional case is that the regression model is not a line but a plane (for two

dimensions) or a hyperplane (for D > 2).

Now you see why it’s essential to have the requirement that the regression hyperplane lies as

close to the training examples as possible: if the blue line in ﬁg. 1 was far from the blue dots,

the prediction y

new

would have fewer chances to be correct.

3.1.2 Solution

To get this latter requirement satisﬁed, the optimization procedure which we use to ﬁnd the

optimal values for w

∗

and b

∗

tries to minimize the following expression:

i=1...N

w,b

) − y

)

. (2)

In mathematics, the expression we minimize or maximize is called an objective function, or,

simply, an objective. The expression (

(

)

− y

)

in the above objective is called the

loss

function

. It’s a measure of penalty for misclassiﬁcation of example

. This particular choice

of the loss function is called

squared error loss

. All model-based learning algorithms have

a loss function and what we do to ﬁnd the best model is we try to minimize the objective

known as the

cost function

. In linear regression, the cost function is given by the average

loss, also called the

empirical risk

. The average loss, or empirical risk, for a model, is the

average of all penalties obtained by applying the model to the training data.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 4

Why is the loss in linear regression a quadratic function? Why couldn’t we get the absolute

value of the diﬀerence between the true target

and the predicted value

(

) and use that

as a penalty? We could. Moreover, we also could use a cube instead of a square.

Now you probably start realizing how many seemingly arbitrary decisions are made when we

design a machine learning algorithm: we decided to use the linear combination of features to

predict the target. However, we could use a square or some other polynomial to combine the

values of features. We could also use some other loss function that makes sense: the absolute

diﬀerence between

(

) and

makes sense, the cube of the diﬀerence too; the

binary loss

(1 when f(x

) and y

are diﬀerent and 0 when they are the same) also makes sense, right?

If we made diﬀerent decisions about the form of the model, the form of the loss function,

and about the choice of the algorithm that minimizes the average loss to ﬁnd the best values

of parameters, we would end up inventing a diﬀerent machine learning algorithm. Sounds

easy, doesn’t it? However, do not rush to invent a new learning algorithm. The fact that it’s

diﬀerent doesn’t mean that it will work better in practice.

People invent new learning algorithms for one of the two main reasons:

The new algorithm solves a speciﬁc practical problem better than the existing algorithms.

The new algorithm has better theoretical guarantees on the quality of the model it

produces.

One practical justiﬁcation of the choice of the linear form for the model is that it’s simple.

Why use a complex model when you can use a simple one? Another consideration is that

linear models rarely overﬁt.

Overﬁtting

is the property of a model such that the model

predicts very well labels of the examples used during training but frequently makes errors

when applied to examples that weren’t seen by the learning algorithm during training.

An example of overﬁtting in regression is shown in ﬁg. 2. The data used to build the red

regression line is the same as in ﬁg. 1. The diﬀerence is that this time, this is the polynomial

regression with a polynomial of degree 10. The regression line predicts almost perfectly the

targets almost all training examples, but will likely make signiﬁcant errors on new data, as

you can see in ﬁg. 1 for

new

. We talk more about overﬁtting and how to avoid it Chapter 5.

Now you know why linear regression can be useful: it doesn’t overﬁt much. But what

about the squared loss? Why did we decide that it should be squared? In 1705, the French

mathematician Adrien-Marie Legendre, who ﬁrst published the sum of squares method for

gauging the quality of the model stated that squaring the error before summing is convenient.

Why did he say that? The absolute value is not convenient, because it doesn’t have a

continuous derivative, which makes the function not smooth. Functions that are not smooth

create unnecessary diﬃculties when employing linear algebra to ﬁnd closed form solutions

to optimization problems. Closed form solutions to ﬁnding an optimum of a function are

simple algebraic expressions and are often preferable to using complex numerical optimization

methods, such as gradient descent (used, among others, to train neural networks).

Intuitively, squared penalties are also advantageous because they exaggerate the diﬀerence

between the true target and the predicted one according to the value of this diﬀerence. We

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 5

new

Figure 2: Overﬁtting.

might also use the powers 3 or 4, but their derivatives are more complicated to work with.

Finally, why do we care about the derivative of the average loss? Remember from algebra

that if we can calculate the gradient of the function in eq. 2, we can then set this gradient to

zero

and ﬁnd the solution to a system of equations that gives us the optimal values

∗

and

∗

. You can spend several minutes and check it yourself.

3.2 Logistic Regression

The ﬁrst thing to say is that logistic regression is not a regression, but a classiﬁcation learning

algorithm. The name comes from statistics and is due to the fact that the mathematical

formulation of logistic regression is similar to that of linear regression.

I explain logistic regression on the case of binary classiﬁcation. However, it can naturally be

extended to multiclass classiﬁcation.

3.2.1 Problem Statement

In logistic regression, we still want to model

as a linear function of

, however, with a

binary

this is not straightforward. The linear combination of features such as

is a

function that spans from minus inﬁnity to plus inﬁnity, while

has only two possible values.

To ﬁnd the minimum or the maximum of a function, we set the gradient to zero because the value of the

gradient at extrema of a function is always zero. In 2D, the gradient at an extremum is a horizontal line.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 6

Figure 3: Standard logistic function.

At the time where the absence of computers required scientists to perform manual calculations,

they were eager to ﬁnd a linear classiﬁcation model. They ﬁgured out that if we deﬁne a

negative label as 0 and the positive label as 1, we would just need to ﬁnd a simple continuous

function whose codomain is (0

1). In such a case, if the value returned by the model for

input

is closer to 0, then we assign a negative label to

; otherwise, the example is labeled

as positive. One function that has such a property is the

standard logistic function

(also

known as the sigmoid function):

f(x) =

1 + e

−x

where

is the base of the natural logarithm (also called Euler’s number;

is also known as

the exp(x) function in Excel and many programming languages). Its graph is depicted in ﬁg.

By looking at the graph of the standard logistic function, we can see how well it ﬁts our

classiﬁcation purpose: if we optimize the values of

and

appropriately, we could interpret

the output of

(

) as the probability of

being positive. For example, if it’s higher than or

equal to the threshold 0

5 we would say that the class of

is positive; otherwise, it’s negative.

In practice, the choice of the threshold could be diﬀerent depending on the problem. We

return to this discussion in Chapter 5 when we talk about model performance assessment.

So our logistic regression model looks like this:

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 7

w,b

(x)

def

1 + e

−(wx+b)

. (3)

You can see the familiar term

from linear regression. Now, how do we ﬁnd the best

values

∗

and

∗

for our model? In linear regression, we minimized the empirical risk which

was deﬁned as the average squared error loss, also known as the

mean squared error

MSE.

3.2.2 Solution

In logistic regression, instead of using a squared loss and trying to minimize the empirical

risk, we maximize the likelihood of our training set according to the model. In statistics, the

likelihood function deﬁnes how likely the observation (an example) is according to our model.

For instance, assume that we have a labeled example (

, y

) in our training data. Assume

also that we have found (guessed) some speciﬁc values

ˆw

and

of our parameters. If we now

apply our model

using eq. 3 we will get some value 0

< p <

1 as output. If

the positive class, the likelihood of

being the positive class, according to our model, is

given by

. Similarly, if

is the negative class, the likelihood of it being the negative class is

given by 1 − p.

The optimization criterion in logistic regression is called

maximum likelihood

. Instead of

minimizing the average loss, like in linear regression, we now maximize the likelihood of the

training data according to our model:

w,b

def

i=1...N

w,b

)

(1 − f

w,b

))

(1−y

)

. (4)

The expression

w,b

(

)

−f

w,b

(

))

(1−y

)

may look scary but it’s just a fancy mathematical

way of saying: “

w,b

(

) when

= 1 and (1

− f

w,b

(

)) otherwise”. Indeed, if

= 1, then

− f

w,b

(

))

(1−y

)

equals 1 because (1

− y

) = 0 and we know that anything power 0 equals

1. On the other hand, if y

= 0, then f

w,b

(x)

equals 1 for the same reason.

You may have noticed that we used the product operator

in the objective function instead

of the sum operator

which was used in linear regression. It’s because the likelihood of

observing

labels for

examples is the product of likelihoods of each observation (assuming

that all observations are independent of one another, which is the case). You can draw

a parallel with the multiplication of probabilities of outcomes in a series of independent

experiments in the probability theory.

Because of the

exp

function used in the model, in practice, it’s more convenient to maximize

the log-likelihood instead of likelihood. The log-likelihood is deﬁned like follows:

LogL

w,b

def

= ln(L(

w,b

(x)) =

i=1

ln f

w,b

(x) + (1 − y

) ln (1 − f

w,b

(x)).

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 8

Because

is a strictly increasing function, maximizing this function is the same as maximizing

its argument, and the solution to this new optimization problem is the same as the solution

to the original problem.

Contrary to linear regression, there’s no closed form solution to the above optimization

problem. A typical numerical optimization procedure used in such cases is

gradient descent

I talk about it in the next chapter.

3.3 Decision Tree Learning

A decision tree is an acyclic graph that can be used to make decisions. In each branching

node of the graph, a speciﬁc feature

of the feature vector is examined. If the value of the

feature is below a speciﬁc threshold, then the left branch is followed; otherwise, the right

branch is followed. As the leaf node is reached, the decision is made about the class to which

the example belongs.

As the title of the section suggests, a decision tree can be learned from data.

3.3.1 Problem Statement

Like previously, we have a collection of labeled examples; labels belong to the set

{

}

. We

want to build a decision tree that would allow us to predict the class of an example given a

feature vector.

3.3.2 Solution

There are various formulations of the decision tree learning algorithm. In this book, we

consider just one, called ID3.

The optimization criterion, in this case, is the average log-likelihood:

i=1

ln f

ID3

) + (1 − y

) ln (1 − f

ID3

)), (5)

where f

ID3

is a decision tree.

By now, it looks very similar to logistic regression. However, contrary to the logistic regression

learning algorithm which builds a

parametric model f

∗

by ﬁnding an optimal solution

to the optimization criterion, the ID3 algorithm optimizes it approximately by constructing a

non-parametric model f

ID3

(x)

def

= Pr(y = 1|x).

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 9

S={(x

,y

),(x

,y

),(x

,y

),(x

,y

),(x

,y

),(x

,y

),(x

,y

),(x

,y

),(x

,y

)}

Pr(y=1|x)=(y



)/12

Pr(y=1|x)

(a)

Pr(y=1|x)=(y



)/7

Pr(y=1|x)

(3)

<18.3?



={(x

1,

),(x

2,

4,

),(x

6,

),(x

7,

8,

),(x

9,

)}

Pr(y=1|x)=

)/5

Pr(y=1|x)

+

={(x

3,

),(x

5,

),(x

10,

,y

),(x

,y

)}

Yes No

(b)

Figure 4: An illustration of a decision tree building algorithm. The set

contains 12 labeled

examples. (a) In the beginning, the decision tree only contains the start node; it makes the

same prediction for any input. (b) The decision tree after the ﬁrst split; it tests whether

feature 3 is less than 18

3 and, depending on the result, the prediction is made in one of the

two leaf nodes.

The ID3 learning algorithm works as follows. Let

denote a set of labeled examples. In the

beginning, the decision tree only has a start node that contains all examples:

def

= {

(

, y

)

}

i=1

Start with a constant model f

ID3

|S|

(x,y)∈S

y. (6)

The prediction given by the above model,

ID3

(

), would be the same for any input

. The

corresponding decision tree is shown in ﬁg 4a.

Then we search through all features

= 1

, . . . , D

and all thresholds

, and split the set

into two subsets:

−

def

= {

(

x, y

)

(

x, y

)

∈ S, x

(j)

< t}

and

{

(

x, y

)

(

x, y

)

∈

, x

(j)

≥ t}

The two new subsets would go to two new leaf nodes, and we evaluate, for all possible pairs

(

j, t

) how good the split with pieces

−

and

is. Finally, we pick the best such values (

j, t

split

into

and

−

, form two new leaf nodes, and continue recursively on

and

−

(or

quit if no split produces a model that’s suﬃciently better than the current one). A decision

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 10

tree after one split is illustrated in ﬁg 4b.

Now you should wonder what do the words “evaluate how good the split is” mean. In ID3, the

goodness of a split is estimated by using the criterion called entropy. Entropy is a measure of

uncertainty about a random variable. It reaches its maximum when all values of the random

variables are equiprobable. Entropy reaches its minimum when the random variable can have

only one value. The entropy of a set of examples S is given by:

H(S) = −f

ID3

ln f

ID3

− (1 − f

ID3

) ln(1 − f

ID3

When we split a set of examples by a certain feature

and a threshold

, the entropy of a

split, H(S

−

, S

), is simply a weighted sum of two entropies:

H(S

−

, S

) =

−

|S|

H(S

−

) +

|S|

H(S

). (7)

So, in ID3, at each step, at each leaf node, we ﬁnd a split that minimizes the entropy given

by eq. 7 or we stop at this leaf node.

The algorithm stops at a leaf node in any of the below situations:

• All examples in the leaf node are classiﬁed correctly by the one-piece model (eq. 6).

• We cannot ﬁnd an attribute to split upon.

•

The split reduces the entropy less than some



(the value for which has to be found

experimentally

• The tree reaches some maximum depth d (also has to be found experimentally).

Because in ID3, the decision to split the dataset on each iteration is local (doesn’t depend

on future splits), the algorithm doesn’t guarantee an optimal solution. The model can be

improved by using techniques like backtracking during the search for the optimal decision

tree at the cost of possibly taking longer to build a model.

The entropy-based split criterion intuitively makes sense: entropy

reaches its minimum of 0 when all examples in

have the same label;

on the other hand, the entropy is at its maximum of 1 when exactly

one-half of examples in

is labeled with 1, making such a leaf useless

for classiﬁcation. The only remaining question is how this algorithm

approximately maximizes the average log-likelihood criterion. I leave it

for further reading.

In Chapter 5, we will see how to do that when we talk about hyperparameter tuning.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 11

3.4 Support Vector Machine

We already considered SVM in the introduction, so this section only ﬁlls a couple of blanks.

Two critical questions need to be answered:

What if there’s noise in the data and no hyperplane can perfectly separate positive

examples from negative ones?

What if the data cannot be separated using a plane, but could be separated by a

higher-order polynomial?

Figure 5: Linearly non-separable cases. Left: the presence of noise. Right: inherent

nonlinearity.

You can see both situations depicted in ﬁg 5. In the left case, the data could be separated by

a straight line if not for the noise (outliers or examples with wrong labels). In the right case,

the decision boundary is a circle and not a straight line.

Remember that in SVM, we want to satisfy the following constraints:

a) wx

− b ≥ 1 if y

= +1, and

b) wx

− b ≤ −1 if y

= −1

We also want to minimize

kwk

so that the hyperplane was equally distant from the closest

examples of each class. Minimizing

kwk

is equivalent to minimizing

||w||

, and the use of

this term makes it possible to perform quadratic programming optimization later on. The

optimization problem for SVM, therefore, looks like this:

min

||w||

, such that y

w − b) − 1 ≥ 0, i = 1, . . . , N. (8)

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 12

3.4.1 Dealing with Noise

To extend SVM to cases in which the data is not linearly separable, we introduce the

hinge

loss function: max (0, 1 − y

(wx

− b)).

The hinge loss function is zero if the constraints a) and b) are satisﬁed, in other words, if

lies on the correct side of the decision boundary. For data on the wrong side of the decision

boundary, the function’s value is proportional to the distance from the decision boundary.

We then wish to minimize the following cost function,

Ckwk

i=1

max (0, 1 − y

(wx

− b)) ,

where the hyperparameter

determines the tradeoﬀ between increasing the size of the

decision boundary and ensuring that each

lies on the correct side of the decision boundary.

The value of

is usually chosen experimentally, just like ID3’s hyperparameters



and

SVMs that optimize hinge loss are called soft-margin SVMs, while the original formulation is

referred to as a hard-margin SVM.

As you can see, for suﬃciently high values of

, the second term in the cost function will

become negligible, so the SVM algorithm will try to ﬁnd the highest margin by completely

ignoring misclassiﬁcation. As we decrease the value of

, making classiﬁcation errors is

becoming more costly, so the SVM algorithm will try to make fewer mistakes by sacriﬁcing

the margin size. As we have already discussed, a larger margin is better for generalization.

Therefore,

regulates the tradeoﬀ between classifying the training data well (minimizing

empirical risk) and classifying future examples well (generalization).

3.4.2 Dealing with Inherent Non-Linearity

SVM can be adapted to work with datasets that cannot be separated by a hyperplane in

its original space. However, if we manage to transform the original space into a space of

higher dimensionality, we could hope that the examples will become linearly separable in this

transformed space. In SVMs, using a function to implicitly transform the original space into

a higher dimensional space during the cost function optimization is called the

kernel trick

The eﬀect of applying the kernel trick is illustrated in ﬁg. 6. As you can see, it’s possible

to transform a two-dimensional non-linearly-separable data into a linearly-separable three-

dimensional data using a speciﬁc mapping

x 7→ φ

(

), where

(

) is a vector of higher

dimensionality than

. For the example of 2D data in ﬁg. 5 (right), the mapping

for

example

= [

q, p

] that projects this example into a 3D space (ﬁg. 6) would look like this

([

q, p

])

def

(

√

2qp, p

), where

means

squared. You see now that the data becomes

linearly separable in the transformed space.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 13

Figure 6: The data from ﬁg. 5 (right) becomes linearly separable after a transformation into

a three-dimensional space.

However, we don’t know a priori which mapping

would work for our data. If we ﬁrst

transform all our input examples using some mapping into very high dimensional vectors and

then apply SVM to this data, and we try all possible mapping functions, the computation

could become very ineﬃcient, and we would never solve our classiﬁcation problem.

Fortunately, scientists ﬁgured out how to use

kernel functions

(or, simply,

kernels

) to

eﬃciently work in higher-dimensional spaces without doing this transformation explicitly. To

understand how kernels work, we have to see ﬁrst how the optimization algorithm for SVM

ﬁnds the optimal values for w and b.

The method traditionally used to solve the optimization problem in eq. 8 is the method of

Lagrange multipliers. Instead of solving the original problem from eq. 8, it is convenient to

solve an equivalent problem formulated like this:

max

...α

i=1

−

i=1

k=1

subject to

i=1

= 0 and α

≥ 0, i = 1, . . . , N,

where

are called Lagrange multipliers. When formulated like this, the optimization

problem becomes a convex quadratic optimization problem, eﬃciently solvable by quadratic

programming algorithms.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 14

Now, you could have noticed that in the above formulation, there is a term

, and this is

the only place where the feature vectors are used. If we want to transform our vector space

into higher dimensional space, we need to transform

into

(

) and

into

(

) and

then multiply φ(x

) and φ(x

). It would be very costly to do so.

On the other hand, we are only interested in the result of the dot-product

, which, as

we know, is a real number. We don’t care how this number was obtained as long as it’s

correct. By using the kernel trick, we can get rid of a costly transformation of original

feature vectors into higher-dimensional vectors and avoid computing their dot-product. We

replace that by a simple operation on the original feature vectors that gives the same

result. For example, instead of transforming (

, p

) into (

√

, p

) and (

, p

) into

(

√

, p

) and then computing the dot-product of (

√

, p

) and (

√

, p

)

to obtain (

+ 2

) we could ﬁnd the dot-product between (

, p

) and (

, p

)

to get (

) and then square it to get exactly the same result (

That was an example of the kernel trick, and we used the quadratic kernel

(

, x

)

def

(

)

Multiple kernel functions exist, the most widely used of which is the RBF kernel:

k(x, x

) = exp



−

kx − x

2σ



where

kx − x

is the squared

Euclidean distance

between two feature vectors. The

Euclidean distance is given by the following equation:

d(x

, x

)

def



(1)

− x

(1)





(2)

− x

(2)



+ ··· +



(N)

− x

(N)



j=1



(j)

− x

(j)



It can be shown that the feature space of the RBF (for “radial basis function”) kernel has

an inﬁnite number of dimensions. By varying the hyperparameter

, the data analyst can

choose between getting a smooth or curvy decision boundary in the original space.

3.5 k-Nearest Neighbors

k-Nearest Neighbors

(kNN) is a non-parametric learning algorithm. Contrary to other

learning algorithms that allow discarding the training data after the model is built, kNN

keeps all training examples in memory. Once a new, previously unseen example

comes in,

the kNN algorithm ﬁnds

training examples closest to

and returns the majority label (in

case of classiﬁcation) or the average label (in case of regression).

The closeness of two points is given by a distance function. For example, Euclidean distance

seen above is frequently used in practice. Another popular choice of the distance function is

the negative cosine similarity. Cosine similarity deﬁned as,

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 15

s(x

, x

)

def

= cos(∠(x

, x

)) =

j=1

(j)

j=1



(j)



j=1



(j)



is a measure of similarity of the directions of two vectors. If the angle between two vectors

is 0 degrees, then two vectors point to the same direction, and cosine similarity is equal to

1. If the vectors are orthogonal, the cosine similarity is 0. For vectors pointing in opposite

directions, the cosine similarity is

−

1. If we want to use cosine similarity as a distance metric,

we need to multiply it by

−

1. Other popular distance metrics include Chebychev distance,

Mahalanobis distance, and Hamming distance. The choice of the distance metric, as well as

the value for

, are the choices the analyst makes before running the algorithm. So these

are hyperparameters. The distance metric could also be learned from data (as opposed to

guessing it). We talk about that in Chapter 10.

Now you know how the model building algorithm works and how the prediction is made. A

reasonable question is what is the cost function here? Surprisingly, this question has not

been well studied in the literature, despite the algorithm’s popularity since the earlier 1960s.

The only attempt to analyze the cost function of kNN I’m aware of was undertaken by Li

and Yang in 2003

. Below, I outline their considerations.

For simplicity, let’s make our derivation under the assumptions of binary classiﬁcation

(

y ∈ {

}

) with cosine similarity and

normalized

feature vectors

. Under these assumptions,

kNN does a locally linear classiﬁcation with the vector of coeﬃcients,

)∈R

(x)

, (9)

where

(

) is the set of

nearest neighbors to the input example

. The above equation

says that we take the sum of all nearest neighbor feature vectors to some input vector

by ignoring those that have label 0. The classiﬁcation decision is obtained by deﬁning a

threshold on the dot-product

which, in the case of normalized feature vectors, is equal

to the cosine similarity between w

and x.

Now, deﬁning the cost function like this:

L = −

)∈R

(x)

||w||

and setting the ﬁrst order derivative of the right-hand side to zero yields the formula for the

coeﬃcient vector in eq. 9.

F. Li and Y. Yang, “A loss function analysis for classiﬁcation methods in text categorization,” in ICML

2003, pp. 472–479, 2003.

We discuss normalization later; for the moment assume that all features of feature vectors were squeezed

into the range [0, 1].

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 16

4 Anatomy of a Learning Algorithm

4.1 Building Blocks of a Learning Algorithm

You may have noticed by reading the previous chapter that each learning algorithm we saw

consisted of three parts:

1) a loss function;

an optimization criterion based on the loss function (a cost function, for example); and

an optimization routine that leverages training data to ﬁnd a solution to the optimization

criterion.

These are the building blocks of any learning algorithm. You saw in the previous chapter

that some algorithms were designed to explicitly optimize a speciﬁc criterion (both linear and

logistic regressions, SVM). Some others, including decision tree learning and kNN, optimize

the criterion implicitly. Decision tree learning and kNN are among the oldest machine

learning algorithms and were invented experimentally based on intuition, without a speciﬁc

global optimization criterion in mind, and (like it often happens in scientiﬁc history) the

optimization criteria were developed later to explain why those algorithms work.

By reading modern literature on machine learning, you often encounter references to

gradient

descent

stochastic gradient descent

. These are two most frequently used optimization

algorithms used in cases where the optimization criterion is diﬀerentiable.

Gradient descent is an iterative optimization algorithm for ﬁnding the minimum of a function.

To ﬁnd a local minimum of a function using gradient descent, one starts at some random

point and takes steps proportional to the negative of the gradient (or approximate gradient)

of the function at the current point.

Gradient descent can be used to ﬁnd optimal parameters for linear and logistic regression,

SVM and also neural networks which we consider later. For many models, such as logistic

regression or SVM, the optimization criterion is convex. Convex functions have only one

minimum, which is global. Optimization criteria for neural networks are not convex, but in

practice even ﬁnding a local minimum suﬃces.

Let’s see how gradient descent works.

4.2 Gradient Descent

In this section, I demonstrate how gradient descent ﬁnds the solution to a linear regression

problem

. I illustrate my description with Python source code as well as with plots that

show how the solution improves after some iterations of the gradient descent algorithm.

As you know, linear regression has a closed form solution. That means that gradient descent is not

needed to solve this speciﬁc type of problem. However, for illustration purposes, linear regression is a perfect

problem to explain gradient descent.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 3

I use a dataset with only one feature. However, the optimization criterion will have two

parameters:

and

. The extension to multi-dimensional training data is straightforward:

you have variables

(1)

(2)

, and

for two-dimensional data,

(1)

(2)

(3)

, and

for

three-dimensional data and so on.

Figure 1: The original data. The Y-axis corresponds to the sales in units (the quantity we

want to predict), the X-axis corresponds to our feature: the spendings on radio ads in M$.

To give a practical example, I use the real dataset with the following columns: the Spendings

of various companies on radio advertising each year and their annual Sales in terms of units

sold. We want to build a regression model that we can use to predict units sold based on

how much a company spends on radio advertising. Each row in the dataset represents one

speciﬁc company:

Company Spendings, M$ Sales, Units

1 37.8 22.1

2 39.3 10.4

3 45.9 9.3

4 41.3 18.5

.. .. ..

We have data for 200 companies, so we have 200 training examples. Fig. 1 shows all examples

on a 2D plot.

Remember that the linear regression model looks like this:

(

) =

. We don’t know

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 4

what the optimal values for

and

are and we want to learn them from data. To do that,

we look for such values for w and b that minimize the mean squared error:

l =

i=1

− (wx

+ b))

Gradient descent starts with calculating the partial derivative for every parameter:

∂l

∂w

i=1

−2x

− (wx

+ b));

∂l

∂b

i=1

−2(y

− (wx

+ b)).

(1)

To ﬁnd the partial derivative of the term (

−

(

))

with respect to

we applied the

chain rule. Here, we have the chain

(

) where

−

(

) and

. To ﬁnd

a partial derivative of

with respect to

we have to ﬁrst ﬁnd the partial derivative of

with

respect to

which is equal to 2(

−

(

)) (from calculus, we know that the derivative

∂f

∂x

= 2

) and then we have to multiply it by the partial derivative of

−

(

) with

respect to

which is equal to

−x

. So overall

∂l

∂w

i=1

−

(

−

(

)). In a similar

way, the partial derivative of l with respect to b,

∂l

∂b

, was calculated.

We initialize

= 0 and

= 0 and then iterate through our training examples, each

example having the form of (

, y

) = (

Spendings

, Sales

). For each training example, we

update

and

using our partial derivatives. The learning rate

controls the size of an

update:

← α

−2x

− (w

i−1

+ b

i−1

))

;

← α

−2(y

− (w

i−1

+ b

i−1

))

(2)

where

and

denote the values of

and

after using the example (

, y

) for the update.

One pass through all training examples is called an

epoch

. Typically, we need multiple

epochs until we start seeing that the values for w and b don’t change much; then we stop.

In complex models, such as neural networks, which have thousands of parameters, the initialization of

parameters may signiﬁcantly aﬀect the solution found using gradient descent. There are diﬀerent initialization

methods (at random, with all zeroes, with small values around zero, and others) and it is an important choice

the data analyst has to make.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 5

It’s hard to imagine a machine learning engineer who doesn’t use Python. So, if you waited

for the right moment to start learning Python, this is that moment. Below we show how to

program gradient descent in Python.

The function that updates the parameters w and b during one epoch is shown below:

def update_w_and_b(spendings, sales, w, b, alpha):

dl_dw = 0.0

dl_db = 0.0

N = len(spendings)

for i in range(N):

dl_dw += -2*spendings[i]*(sales[i] - (w*spendings[i] + b))

dl_db += -2*(sales[i] - (w*spendings[i] + b))

# update w and b

w = w - (1/float(N))*dl_dw*alpha

b = b - (1/float(N))*dl_db*alpha

return w, b

The function that loops over multiple epochs is shown below:

def train(spendings, sales, w, b, alpha, epochs):

for e in range(epochs):

w, b = update_w_and_b(spendings, sales, w, b, alpha)

# log the progress

if e % 400 == 0:

print("epoch:", e, "loss: ", avg_loss(spendings, sales, w, b))

return w, b

The function avg_loss in the above code snippet is a function that computes the mean

squared error. It is deﬁned as:

def avg_loss(spendings, sales, w, b):

N = len(spendings)

total_error = 0.0

for i in range(N):

total_error += (sales[i] - (w*spendings[i] + b))**2

return total_error / float(N)

If we run the train function for

= 0

001,

= 0

0, and 15000 epochs, we will see

the following output (shown partially):

epoch: 0 loss: 92.32078294903626

epoch: 400 loss: 33.79131790081576

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 6

epoch: 800 loss: 27.9918542960729

epoch: 1200 loss: 24.33481690722147

epoch: 1600 loss: 22.028754937538633

...

epoch: 2800 loss: 19.07940244306619

Epoch 0 Epoch 400 Epoch 800

Epoch 1200 Epoch 1600 Epoch 3000

Figure 2: The evolution of the regression line through gradient descent epochs.

You can see that the average loss decreases as the train function loops through epochs. Fig.

2 shows the evolution of the regression line through epochs.

Finally, once we have found the optimal values of parameters

and

, the only missing piece

is a function that makes predictions:

def predict(x, w, b):

return w*x + b

Try to execute the following code:

w, b = train(x, y, 0.0, 0.0, 0.001, 15000)

x_new = 23.0

y_new = predict(x_new, w, b)

print(y_new)

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 7

The output is 13.97.

The gradient descent algorithm is sensitive to the choice of the step

. It is also slow for large

datasets. Fortunately, several signiﬁcant improvements to this algorithm have been proposed.

Stochastic gradient descent

(SGD) is a version of the algorithm that speeds up the

computation by approximating the gradient using smaller batches (subsets) of the training

data. SGD itself has various “upgrades”.

Adagrad

is a version of SGD that scales

for

each parameter according to the history of gradients. As a result,

is reduced for very large

gradients and vice-versa.

Momentum

is a method that helps accelerate SGD by orienting

the gradient descent in the relevant direction and reducing oscillations. In neural network

training, variants of SGD such as RMSprop and Adam, are most frequently used.

Notice that gradient descent and its variants are not machine learning algorithms. They are

solvers of minimization problems in which the function to minimize has a gradient in most

points of its domain.

4.3 How Machine Learning Engineers Work

Unless you are a research scientist or work for a huge corporation with a large R&D budget,

you usually don’t implement machine learning algorithms yourself. You don’t implement

gradient descent or some other solver either. You use libraries, most of which are open

source. A library is a collection of algorithms and supporting tools implemented with stability

and eﬃciency in mind. The most frequently used in practice open-source machine learning

library is scikit-learn. It’s written in Python and C. Here’s how you do linear regression in

scikit-learn:

def train(x, y):

from sklearn.linear_model import LinearRegression

model = LinearRegression().fit(x,y)

return model

model = train(x,y)

x_new = 23.0

y_new = model.predict(x_new)

print(y_new)

The output will, again, be 13

97. Easy, right? You can replace LinearRegression with some

other type of regression learning algorithm without modifying anything else. It just works.

The same can be said about classiﬁcation. You can easily replace LogisticRegression algorithm

with SVC algorithm (this is scikit-learn’s name for the Support Vector Machine algorithm),

DecisionTreeClassiﬁer, NearestNeighbors or many other classiﬁcation learning algorithms

implemented in scikit-learn.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 8

4.4 Learning Algorithms’ Particularities

Here we outline some practical particularities that can diﬀerentiate one learning algorithm

from another. You already know that diﬀerent learning algorithms can have diﬀerent

hyperparameters (

in SVM,



and

in ID3). Solvers such as gradient descent can also have

hyperparameters, like α for example.

Some algorithms, like decision tree learning, can accept categorical features. For example, if

you have a feature “color” that can take values “red”, “yellow”, or “green”, you can keep

this feature as is. SVM, logistic and linear regression, as well as kNN (with cosine similarity

or Euclidean distance metrics), expect numerical values for all features. All algorithms

implemented in scikit-learn expect numerical features. I show in the next chapter how to

convert categorical features into numerical ones.

Some algorithms, like SVM, allow the data analyst to provide weightings for each class.

These weightings inﬂuence how the decision boundary is drawn. If the weight of some class

is high, the learning algorithm tries to not make errors in predicting training examples of

this class (typically, for the cost of making an error elsewhere). That could be important if

instances of some class are in the minority in your training data, but you would like to avoid

misclassifying examples of that class as much as possible.

Some classiﬁcation models, like SVM and kNN, given a feature vector only output the class.

Others, like logistic regression or decision trees, can also return the score between 0 and 1

which can be interpreted as either how conﬁdent the model is about the prediction or as the

probability that the input example belongs to a certain class

Some classiﬁcation algorithms (like decision tree learning, logistic regression, or SVM) build the

model using the whole dataset at once. If you have got additional labeled examples, you have

to rebuild the model from scratch. Other algorithms (such as Naïve Bayes, multilayer percep-

tron, SGDClassiﬁer/SGDRegressor, PassiveAggressiveClassiﬁer/PassiveAggressiveRegressor

in scikit-learn) can be trained iteratively, one batch at a time. Once new training examples

are available, you can update the model using only the new data.

Finally, some algorithms, like decision tree learning, SVM, and kNN can be used for both clas-

siﬁcation and regression, while others can only solve one type of problem: either classiﬁcation

or regression, but not both.

Usually, each library provides the documentation that explains what kind of problem each

algorithm solves, what input values are allowed and what kind of output the model produces.

The documentation also provides information on hyperparameters.

If it’s really necessary, the score for SVM and kNN predictions could be synthetically created using some

simple techniques.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 9

5 Basic Practice

Until now, I only mentioned in passing some problems a data analyst can encounter when

working on a machine learning problem: feature engineering, overﬁtting, and hyperparameter

tuning. In this chapter, we talk about these and other challenges that have to be addressed

before you can type model = LogisticRegresion().ﬁt(x,y) in scikit-learn.

5.1 Feature Engineering

When a product manager tells you “We need to be able to predict whether a particular

customer will stay with us. Here are the logs of customers’ interactions with our product for

ﬁve years.” you cannot just grab the data, load it into a library and get a prediction. You

need to build a dataset ﬁrst.

Remember from the ﬁrst chapter that the dataset is the collection of

labeled examples

{

(

, y

)

}

i=1

. Each element

among

is called a

feature vector

. A feature vector is a

vector in which each dimension

= 1

, . . . , D

contains a value that describes the example

somehow. That value is called a feature and is denoted as x

(j)

The problem of transforming raw data into a dataset is called

feature engineering

. For

most practical problems, feature engineering is a labor-intensive process that demands from

the data analyst a lot of creativity and, preferably, domain knowledge.

For example, to transform the logs of user interaction with a computer system, one could

create features that contain information about the user and various statistics extracted from

the logs. For each user, one feature would contain the price of the subscription; other features

would contain the frequency of connections per day, week and year. Another feature would

contain the average session duration in seconds or the average response time for one request,

and so on. Everything measurable can be used as a feature. The role of the data analyst is to

create informative features: those would allow the learning algorithm to build a model that

predicts well labels of the data used for training. Highly informative features are also called

features with high predictive power. For example, the average duration of a user’s session

has high predictive power for the problem of predicting whether the user will keep using the

application in the future.

We say that a model has a

low bias

when it predicts well the training data. That is, the

model makes few mistakes when we try to predict labels of the examples used to build the

model.

5.1.1 One-Hot Encoding

Some learning algorithms only work with numerical feature vectors. When some feature in

your dataset is categorical, like “colors” or “days of the week,” you can transform such a

categorical feature into several binary ones.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 3

If your example has a categorical feature “colors” and this feature has three possible values:

“red,” “yellow,” “green,” you can transform this feature into a vector of three numerical

values:

red = [1, 0, 0]

yellow = [0, 1, 0]

green = [0, 0, 1]

(1)

By doing so, you increase the dimensionality of your feature vectors. You should not transform

red into 1, yellow into 2, and green into 3 to avoid increasing the dimensionality because that

would imply that there’s an order among the values in this category and this speciﬁc order is

important for the decision making. If the order of a feature’s values is not important, using

ordered numbers as values is likely to confuse the learning algorithm,

because the algorithm

will try to ﬁnd a regularity where there’s no one, which may potentially lead to overﬁtting.

5.1.2 Binning

An opposite situation, occurring less frequently in practice, is when you have a numerical

feature but you want to convert it into a categorical one.

Binning

(also called

bucketing

)

is the process of converting a continuous feature into multiple binary features called bins or

buckets, typically based on value range. For example, instead of representing age as a single

real-valued feature, the analyst could chop ranges of age into discrete bins: all ages between

0 and 5 years-old could be put into one bin, 6 to 10 years-old could be in the second bin, 11

to 15 years-old could be in the third bin, and so on.

For example, suppose in our feature

= 18 represents age. By applying binning, we replace

this feature with the corresponding bins. Let the three new bins, “age_bin1”, “age_bin2”

and “age_bin3” be added with indexes

= 123,

= 124 and

= 125 respectively. Now if

(18)

= 7 for some example

, then we set feature

(124)

to 1; if

(18)

= 13, then we set

feature x

(125)

to 1, and so on.

In some cases, a carefully designed binning can help the learning algorithm to learn using

fewer examples. It happens because we give a “hint” to the learning algorithm that if the

value of a feature falls within a speciﬁc range, the exact value of the feature doesn’t matter.

When the ordering of values of some categorical variable matters, we can replace those values by numbers

by keeping only one variable. For example, if our variable represents the quality of an article, and the

values are

{poor, decent, good, excellent}

, then we could replace those categories by numbers, for example,

{1, 2, 3, 4}.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 4

5.1.3 Normalization

Normalization

is the process of converting an actual range of values which a numerical

feature can take, into a standard range of values, typically in the interval [−1, 1] or [0, 1].

For example, suppose the natural range of a particular feature is 350 to 1450. By subtracting

350 from every value of the feature, and dividing the result by 1100, one can normalize those

values into the range [0, 1].

More generally, the normalization formula looks like this:

¯x

(j)

− min

(j)

max

(j)

− min

(j)

where

min

(j)

and

max

(j)

are, respectively, the minimum and the maximum value of the

feature j in the dataset.

Why do we normalize? Normalizing the data is not a strict requirement. However, in practice,

it can lead to an increased speed of learning. Remember the gradient descent example from

the previous chapter. Imagine you have a two-dimensional feature vector. When you update

the parameters of

(1)

and

(2)

, you use partial derivatives of the average squared error with

respect to

(1)

and

(2)

. If

(1)

is in the range [0

1000] and

(2)

the range [0

0001], then

the derivative with respect to a larger feature will dominate the update.

Additionally, it’s useful to ensure that our inputs are roughly in the same relatively small

range to avoid problems which computers have when working with very small or very big

numbers (known as numerical overﬂow).

5.1.4 Standardization

Standardization

(or

z-score normalization

) is the procedure during which the feature

values are rescaled so that they have the properties of a standard normal distribution with

= 0 and

= 1, where

is the mean (the average value of the feature, averaged over all

examples in the dataset) and σ is the standard deviation from the mean.

Standard scores (or z-scores) of features are calculated as follows:

ˆx

(j)

− µ

(j)

You may ask when you should use normalization and when standardization. There’s no

deﬁnitive answer to this question. Usually, if your dataset is not too big and you have time,

you can try both and see which one performs better for your task.

If you don’t have time to run multiple experiments, as a rule of thumb:

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 5

•

unsupervised learning algorithms, in practice, more often beneﬁt from standardization

than from normalization;

•

standardization is also preferred for a feature if the values this feature takes are

distributed close to a normal distribution (so-called bell curve);

•

again, standardization is preferred for a feature if it can sometimes have extremely high

or low values (outliers); this is because normalization will “squeeze” the normal values

into a very small range;

• in all other cases, normalization is preferable.

Modern implementations of the learning algorithms, which you can ﬁnd in popular libraries,

are robust to features lying in diﬀerent ranges. Feature rescaling is usually beneﬁcial to most

learning algorithms, but in many cases, the model will still be good when trained from the

original features.

5.1.5 Dealing with Missing Features

In some cases, the data comes to the analyst in the form of a dataset with features already

deﬁned. In some examples, values of some features can be missing. That often happens when

the dataset was handcrafted, and the person working on it forgot to ﬁll some values or didn’t

get them measured at all.

The typical approaches of dealing with missing values for a feature include:

•

Removing the examples with missing features from the dataset. That can be done if

your dataset is big enough so you can sacriﬁce some training examples.

•

Using a learning algorithm that can deal with missing feature values (depends on the

library and a speciﬁc implementation of the algorithm).

• Using a data imputation technique.

5.1.6 Data Imputation Techniques

One technique consists in replacing the missing value of a feature by an average value of this

feature in the dataset:

ˆx

(j)

Another technique is to replace the missing value by the same value outside the normal range

of values. For example, if the normal range is [0

1], then you can set the missing value equal

to 2 or

−

1. The idea is that the learning algorithm will learn what is it better to do when the

feature has a value signiﬁcantly diﬀerent from other values. Alternatively, you can replace the

missing value by a value in the middle of the range. For example, if the range for a feature is

[

−

1], you can set the missing value to be equal to 0. Here, the idea is that if we use the

value in the middle of the range to replace missing features, such value will not signiﬁcantly

aﬀect the prediction.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 6

A more advanced technique is to use the missing value as the target variable for a regression

problem. You can use all remaining features [

(1)

, x

(2)

, . . . , x

(j−1)

, x

(j+1)

, . . . , x

(D)

] to form

a feature vector

, set

ˆy

(j)

, where

is the feature with a missing value. Now we can

build a regression model to predict

ˆy

from the feature vectors

. Of course, to build training

examples (

x, ˆy

), you only use those examples from the original dataset, in which the value of

feature j is present.

Finally, if you have a signiﬁcantly large dataset and just a few features with missing values,

you can increase the dimensionality of your feature vectors by adding a binary indicator

feature for each feature with missing values. Let’s say feature

= 12 in your

-dimensional

dataset has missing values. For each feature vector

, you then add the feature

+ 1

which is equal to 1 if the value of feature 12 is present in

and 0 otherwise. The missing

feature value then can be replaced by 0 or any number of your choice.

At prediction time, if your example is not complete, you should use the same data imputation

technique to ﬁll the missing features as the technique you used to complete the training data.

Before you start working on the learning problem, you cannot tell which data imputation

technique will work the best. Try several techniques, build several models and select the one

that works the best.

5.2 Learning Algorithm Selection

Choosing a machine learning algorithm can be a diﬃcult task. If you have much time, you

can try all of them. However, usually the time you have to solve a problem is limited. You

can ask yourself several questions before starting to work on the problem. Depending on

your answers, you can shortlist some algorithms and try them on your data.

• Explainability

Does your model have to be explainable to a non-technical audience? Most very accurate

learning algorithms are so-called “black boxes.” They learn models that make very few errors,

but why a model made a speciﬁc prediction could be very hard to understand and even

harder to explain. Examples of such models are neural networks or ensemble models.

On the other hand, kNN, linear regression, or decision tree learning algorithms produce

models that are not always the most accurate, however, the way they make their prediction

is very straightforward.

• In-memory vs. out-of-memory

Can your dataset be fully loaded into the RAM of your server or personal computer? If

yes, then you can choose from a wide variety of algorithms. Otherwise, you would prefer

incremental learning algorithms

that can improve the model by adding more data

gradually.

• Number of features and examples

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 7

How many training examples do you have in your dataset? How many features does each

example have? Some algorithms, including

neural networks

and

gradient boosting

(we

consider both later), can handle a huge number of examples and millions of features. Others,

like SVM, can be very modest in their capacity.

• Categorical vs. numerical features

Is your data composed of categorical only, or numerical only features, or a mix of both?

Depending on your answer, some algorithms cannot handle your dataset directly, and you

would need to convert your categorical features into numerical ones by using some techniques

like one-hot encoding.

• Nonlinearity of the data

Is your data linearly separable or can it be modeled using a linear model? If yes, SVM with

the linear kernel, logistic regression or linear regression can be a good choice. Otherwise,

deep neural networks or ensemble algorithms, discussed in Chapters 6 and 7, might work

better for your data.

• Training speed

How much time is a learning algorithm allowed to use to build a model? Neural networks

are known to be slow to train. Simple algorithms like logistic and linear regression as well

as decision tree learning are much faster. Some specialized libraries contain very eﬃcient

implementations of some algorithms; you may prefer to do research online to ﬁnd such

libraries. Some algorithms, such as random forests, beneﬁt from the availability of multiple

CPU cores, so their model building time can be signiﬁcantly reduced on a machine with

dozens of CPU cores.

• Prediction speed

How fast does the model have to be when generating predictions? Will your model be used in

production where very high throughput is required? Some algorithms, like SVMs, linear and

logistic regression, or some types of neural networks, are extremely fast at the prediction time.

Some others, like kNN, ensemble algorithms, and very deep or recurrent neural networks,

can be slower

If you don’t want to guess the best algorithm for your data, a popular way to choose one is

by testing it on the validation set. We talk about that below.

Alternatively, if you use scikit-learn, you could try their algorithm selection diagram shown

in ﬁg. 1.

The prediction speeds of kNN and ensemble methods implemented in the modern libraries are still very

fast. Don’t be afraid of using these algorithms in your practice.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 8

Figure 1: Machine learning algorithm selection diagram for scikit-learn.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 9

5.3 Three Sets

Until now, I used the expressions “dataset” and “training set” interchangeably. However, in

practice data analysts work with three sets of labeled examples:

1) training set,

2) validation set, and

3) test set.

Once you have got your annotated dataset, the ﬁrst thing you do is you shuﬄe the examples

and split the dataset into three subsets: training, validation, and test. The training set is

usually the biggest one, and you use it to build the model. The validation and test sets are

roughly the same sizes, much smaller than the size of the training set. The learning algorithm

cannot use examples from these two subsets to build a model. That is why those two sets are

often called hold-out sets.

There’s no optimal proportion to split the dataset into these three subsets. In the past, the

rule of thumb was to use 70% of the dataset for training, 15% for validation and 15% for

testing. However, in the age of big data, datasets often have millions of examples. In such

cases, it could be reasonable to keep 95% for training and 2.5%/2.5% for validation/testing.

You may wonder, what is the reason to have three sets and not one. The answer is simple:

when we build a model, what we do not want is for the model to only do well at predicting

labels of examples the learning algorithms has already seen. A trivial algorithm that simply

memorizes all training examples and then uses the memory to “predict” their labels will make

no mistakes when asked to predict the labels of the training examples, but such an algorithm

would be useless in practice. What we really want is that our model predicts well examples

that the learning algorithm didn’t see. So we want good performance on a hold-out set.

Why do we need two hold-out sets and not one? We use the validation set to 1) choose the

learning algorithm and 2) ﬁnd the best values of hyperparameters. We use the test set to

assess the model before delivering it to the client or putting it in production.

5.4 Underﬁtting and Overﬁtting

I mentioned above the notion of

bias

. I said that a model has a low bias if it predicts well

the labels of the training data. If the model makes many mistakes on the training data,

we say that the model has a high bias or that the model

underﬁts

. So, underﬁtting is the

inability of the model to predict well the labels of the data it was trained on. There could be

several reasons for underﬁtting, the most important of which are:

• your model is too simple for the data (for example a linear model can often underﬁt);

• the features you engineered are not informative enough.

The ﬁrst reason is easy to illustrate in the case of one-dimensional regression: the dataset can

resemble a curved line, but our model is a straight line. The second reason can be illustrated

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 10

Underﬁtting Good ﬁt Overﬁtting

Figure 2: Examples of underﬁtting (linear model), good ﬁt (quadratic model), and overﬁtting

(polynomial of degree 15).

like this: let’s say you want to predict whether a patient has cancer, and the features you

have are height, blood pressure, and heart rate. These three features are clearly not good

predictors for cancer so our model will not be able to learn a meaningful relationship between

these features and the label.

The solution to the problem of underﬁtting is to try a more complex model or to engineer

features with higher predictive power.

Overﬁtting

is another problem a model can exhibit. The model that overﬁts predicts very

well the training data but poorly the data from at least one of the two hold-out sets. I already

gave an illustration of overﬁtting in Chapter 3. Several reasons can lead to overﬁtting, the

most important of which are:

•

your model is too complex for the data (for example a very tall decision tree or a very

deep or wide neural network often overﬁt);

• you have too many features but a small number of training examples.

In the literature, you can ﬁnd another name for the problem of overﬁtting: the problem of

high variance. This term comes from statistics. The variance is an error of the model due to

its sensitivity to small ﬂuctuations in the training set. It means that if your training data

was sampled diﬀerently, the learning would result in a signiﬁcantly diﬀerent model. Which

is why the model that overﬁts performs poorly on the test data: test and training data are

sampled from the dataset independently of one another.

Several solutions to the problem of overﬁtting are possible:

Try a simpler model (linear instead of polynomial regression, or SVM with a linear

kernel instead of RBF, a neural network with fewer layers/units).

Reduce the dimensionality of examples in the dataset (for example, by using one of the

dimensionality reduction techniques discussed in Chapter 9).

3. Add more training data, if possible.

4. Regularize the model.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 11

Figure 2 illustrates a one-dimensional dataset for which a regression model underﬁts, ﬁts well

and overﬁts the data.

Regularization is the most widely used approach to prevent overﬁtting.

5.5 Regularization

Even the simplest model, such as linear, can overﬁt the data. That usually happens when the

data is high-dimensional, but the number of training examples is relatively low. In fact, when

feature vectors are very high-dimensional, the linear learning algorithm can build a model

that assigns non-zero values to most dimensions

(j)

in the parameter vector

, trying to

ﬁnd very complex relationships between all available features to predict labels of training

examples perfectly.

Such a complex model will most likely predict poorly the labels of the hold-out examples.

This is because by trying to perfectly predict labels of all training examples, the model will

also learn the idiosyncrasies of the training set: the noise in the values of features of the

training examples, the sampling imperfection due to the small dataset size, and other artifacts

extrinsic to the decision problem in hand but present in the training set.

Regularization

is an umbrella-term that encompasses methods that force the learning

algorithm to build a less complex model. In practice, that often leads to slightly higher

bias but signiﬁcantly reduces the variance. This problem is known in the literature as the

bias-variance tradeoﬀ.

The two most widely used types of regularization are called

L1 regularization

and

regularization

. The idea is quite simple. To create a regularized model, we modify the

objective function by adding a penalizing term whose value is higher when the model is more

complex.

For simplicity, I illustrate regularization using the example of linear regression. The same

principle can be applied to a wide variety of models.

Recall the linear regression objective we want to minimize:

min

w,b

i=1

w,b

) − y

)

. (2)

An L1-regularized objective looks like this:

min

w,b

C|w| +

i=1

w,b

) − y

)

, (3)

where

|w|

def

j=1

(j)

and

is a hyperparameter that controls the importance of regular-

ization. If we set

to zero, the model becomes a standard non-regularized linear regression

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 12

model. On the other hand, if we set to

to a high value, the learning algorithm will try to

set most

(j)

to a very small value or zero to minimize the objective, the model will become

very simple which can lead to underﬁtting. Your role as the data analyst is to ﬁnd such

a value of the hyperparameter

that doesn’t increase the bias too much but reduces the

overﬁtting to a level reasonable for the problem in hand. In the next section, I will show how

to do that.

An L2-regularized objective will look like this:

min

w,b

Ckwk

i=1

w,b

) − y

)

, where kwk

def

j=1

(j)

)

. (4)

In practice, L1 regularization produces a

sparse model

, a model that has most of its

parameters (in case of linear models, most of

(j)

) equal to zero (provided the hyperparameter

is large enough). So L1 makes

feature selection

by deciding which features are essential

for prediction and which are not. That can be useful in case you want to increase model

explainability. However, if your only goal is to maximize the performance of the model on

the hold-out data, then L2 usually gives better results. L2 also has the advantage of being

diﬀerentiable, so gradient descent can be used for optimizing the objective function.

L1 and L2 regularization methods are also combined in what is called

elastic net regular-

ization

with L1 and L2 regularizations being special cases. You can ﬁnd in the literature

the name ridge regularization for L2 and lasso for L1.

In addition to being widely used with linear models, L1 and L2 regularization are also

frequently used with neural networks and many other types of models, which directly

minimize an objective function.

Neural networks also beneﬁt from two other regularization techniques:

dropout

and

batch-

normalization

. There are also non-mathematical methods that have a regularization eﬀect:

data augmentation and early stopping. We talk about these techniques in Chapter 8.

5.6 Model Performance Assessment

Once you have a model which our learning algorithm has built using the training set, how

can you say how good the model is? You use the test set to assess the model.

The test set contains the examples that the learning algorithm has never seen before, so if

our model performs well on predicting the labels of the examples from the test set, we say

that our model generalizes well or, simply, that it’s good.

To be more rigorous, machine learning specialists use various formal metrics and tools to

assess the model performance. For regression, the assessment of the model is quite simple. A

well-ﬁtting regression model results in predicted values close to the observed data values. The

mean model

, which always predicts the average of the labels in the training data, generally

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 13

would be used if there were no informative features. The ﬁt of a regression model being

assessed should, therefore, be better than the ﬁt of the mean model. If this is the case, then

the next step is to compare the performances of the model on the training and the test data.

To do that, we compute the mean squared error

(MSE) for the training, and, separately,

for the test data. If the MSE of the model on the test data is substantially higher than

the MSE obtained on the training data, this is a sign of overﬁtting. Regularization or a

better hyperparameter tuning could solve the problem. The meaning of “substantially higher”

depends on the problem in hand and has to be decided by the data analyst jointly with the

decision maker/product owner who ordered the model.

For classiﬁcation, things are a little bit more complicated. The most widely used metrics and

tools to assess the classiﬁcation model are:

• confusion matrix,

• accuracy,

• cost-sensitive accuracy,

• precision/recall, and

• area under the ROC curve.

To simplify the illustration, I use a binary classiﬁcation problem. Where necessary, I show

how to extend the approach to the multiclass case.

5.6.1 Confusion Matrix

The

confusion matrix

is a table that summarizes how successful the classiﬁcation model

is at predicting examples belonging to various classes. One axis of the confusion matrix

is the label that the model predicted, and the other axis is the actual label. In a binary

classiﬁcation problem, there are two classes. Let’s say, the model predicts two classes: “spam”

and “not_spam”:

spam (predicted) not spam (predicted)

spam (actual) 23 (TP) 1 (FN)

not spam (actual) 12 (FP) 556 (TN)

The above confusion matrix shows that of the 24 examples that actually were spam, the

model correctly classiﬁed 23 as spam. In this case, we say that we have 23

true positives

or TP = 23. The model incorrectly classiﬁed 1 example as not spam. In this case, we have 1

false negative

, or FN = 1. Similarly, of 568 examples that actually were not spam, 556 were

correctly classiﬁed (556

true negatives

or TN = 556), and 12 were incorrectly classiﬁed (12

false positives, FP = 12).

The confusion matrix for multiclass classiﬁcation has as many rows and columns as there are

Or any other type of loss function you used to build your optimization problem.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 14

diﬀerent classes. It can help you to determine mistake patterns. For example, a confusion

matrix could reveal that a model trained to recognize diﬀerent species of animals tends to

mistakenly predict “cat” instead of “panther,” or “mouse” instead of “rat.” In this case, you

can decide to add more labeled examples of these species to help the learning algorithm to

“see” the diﬀerence between them. Alternatively, you can decide to add additional features

the learning algorithm can use to build a model that would better distinguish between these

species.

Confusion matrices can be used to calculate two important performance metrics:

precision

and recall.

5.6.2 Precision/Recall

The two most frequently used metrics to assess the model are

precision

and

recall

. Precision

is the ratio of correct positive predictions to the overall number of positive predictions:

Precision =

TP + FP

Recall is the ratio of correct positive predictions to the overall number of positive examples

in the test set:

Recall =

TP + FN

To understand the meaning and importance of precision and recall for the model assessment it

is often useful to think about the prediction problem as the problem of research of documents

in the database using a query. The precision is the proportion of relevant documents in the

list of all returned documents. The recall is the ratio of the relevant documents returned

by the search engine to the total number of the relevant documents that could have been

returned.

In the case of the spam detection problem, we want to have high precision (we want to avoid

making mistakes by detecting that a legitimate message is spam) and we are ready to tolerate

lower recall (we tolerate some spam messages in our inbox).

Almost always, in practice, we have to choose between a high precision or a high recall. It’s

usually impossible to have both. We can achieve either of the two by various means:

•

by assigning a higher weighting to messages with spam (the SVM algorithm accepts

weightings of classes as input);

•

by tuning hyperparameters such that the precision or recall on the validation set are

maximized;

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 15

•

by varying the decision threshold for algorithms that return probabilities of classes;

for instance, if we use logistic regression or decision tree, to increase precision (at the

cost of a lower recall), we can decide that the prediction will be positive only if the

probability returned by the model is higher than 0.9.

Even if precision and recall are deﬁned for the binary classiﬁcation case, you can always use

it to assess a multiclass classiﬁcation model. To do that, ﬁrst select a class for which you

want to assess these metrics. Then you consider all examples of the selected class as positives

and all examples of the remaining classes as negatives.

5.6.3 Accuracy

Accuracy

is given by the number of correctly classiﬁed examples divided by the total number

of classiﬁed examples. In terms of the confusion matrix, it is given by:

Accuracy =

TP + TN

TP + TN + FP + FN

. (5)

Accuracy is a useful metric when errors in predicting all classes are equally important. In

case of the spam/not spam, this may not be the case. For example, you would tolerate false

positives less than false negatives. A false positive in spam detection is the situation in which

your friend sends you an email, but the model labels it as spam and doesn’t show you. On

the other hand, the false negative is less of a problem: if your model doesn’t detect a small

percentage of spam messages, it’s not a big deal.

5.6.4 Cost-Sensitive Accuracy

For dealing with the situation in which diﬀerent classes have diﬀerent importance, a useful

metric is

cost-sensitive accuracy

. To compute a cost-sensitive accuracy, you ﬁrst assign a

cost (a positive number) to both types of mistakes: FP and FN. You then compute the counts

TP, TN, FP, FN as usual and multiply the counts for FP and FN by the corresponding cost

before calculating the accuracy using eq. 5.

5.6.5 Area under the ROC Curve (AUC)

The ROC curve (ROC stands for “receiver operating characteristic,” the term comes from

radar engineering) is a commonly used method to assess the performance of classiﬁcation

models. ROC curves use a combination the true positive rate (the proportion of positive

examples predicted correctly, deﬁned exactly as

recall

) and false positive rate (the proportion

of negative examples predicted incorrectly) to build up a summary picture of the classiﬁcation

performance.

The true positive rate and the false positive rate are respectively deﬁned as,

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 16

TPR =

(TP + FN)

and FPR =

(FP + TN)

ROC curves can only be used to assess classiﬁers that return some conﬁdence score (or a

probability) of prediction. For example, logistic regression, neural networks, and decision

trees (and ensemble models based on decision trees) can be assessed using ROC curves.

To draw a ROC curve, we ﬁrst discretize the range of the conﬁdence score. If this range for

our model is [0

1], then we can discretize it like this: [0

1].

Then, we use each discrete value as the prediction threshold and predict the labels of examples

in our dataset using our model and this threshold. For example, if we want to compute TPR

and FPR for the threshold equal to 0

7, we apply the model to each example, get the score,

and, if the score if higher than or equal to 0

7, we predict the positive class; otherwise, we

predict the negative class.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 17

0 1

AUC=0.4

Falsepositiverate

Truepositiverate

0 1

AUC=0.5

Falsepositiverate

Truepositiverate

0 1

AUC=0.6

Falsepositiverate

Truepositiverate

AUC=0.85

Falsepositiverate

Truepositiverate

Figure 3: Area under the ROC curve.

Look at the illustration in Figure 3. It’s easy to see that if the threshold is 0, all our

predictions will be positive, so both TPR and FPR will be 1 (the upper right corner). On

the other hand, if the threshold is 1, then no positive prediction will be made, both TPR

and FPR will be 0 which corresponds to the lower left corner.

The higher the

area under the ROC curve

(AUC), the better the classiﬁer. A classiﬁer

with an AUC higher than 0

5 is better than a random classiﬁer. If AUC is lower than 0

then something is wrong with your model. A perfect classiﬁer would have an AUC of 1.

Usually, if our model behaves well, we obtain a good classiﬁer by selecting the value of the

threshold that gives TPR close to 1 while keeping FPR near 0.

ROC curves are widely used because they are relatively simple to understand, they capture

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 18

more than one aspect of the classiﬁcation (by taking both false positives and false negatives

into account) and allow visually and with low eﬀort comparing the performance of diﬀerent

models.

5.7 Hyperparameter Tuning

When I presented learning algorithms, I mentioned that you as a data analyst have to select

good values for the algorithm’s hyperparameters, such as  and d for ID3, C for SVM, or α

for gradient descent. But what does that exactly mean? Which value is the best and how to

ﬁnd it? In this section, I answer these essential questions.

As you already know, hyperparameters aren’t optimized by the learning algorithm itself. The

data analyst has to “tune” hyperparameters by experimentally ﬁnding the best combination

of values, one per hyperparameter.

One typical way to do that, when you have enough data to have a decent validation set (in

which each class is represented by at least a couple of dozen examples) and the number of

hyperparameters and their range is not too large is to use grid search.

Grid search is the most simple hyperparameter tuning strategy. Let’s say you train an SVM

and you have two hyperparameters to tune: the penalty parameter

(a positive real number)

and the kernel (either “linear” or “rbf”).

If it’s the ﬁrst time you are working with this dataset, you don’t know what is the possible

range of values for

. The most common trick is to use a logarithmic scale. For example, for

you can try the following values: [0.001, 0.01, 0.1, 1.0, 10, 100, 1000]. In this case you have

14 combinations of hyperparameters to try: [(0.001, “linear”), (0.01, “linear”), (0.1, “linear”),

(1.0, “linear”), (10, “linear”), (100, “linear”), (1000, “linear”), (0.001, “rbf”), (0.01, “rbf”),

(0.1, “rbf”), (1.0, “rbf”), (10, “rbf”), (100, “rbf”), (1000, “rbf”)].

You use the training set and train 14 models, one for each combination of hyperparameters.

Then you assess the performance of each model on the validation data using one of the

metrics we discussed in the previous section (or some other metric that matters to you).

Finally, you keep the model that performs the best according to the metric.

Once you have found the best pair of hyperparameters, you can try to explore the values

close to the best ones in some region around them. Sometimes, this can result in an even

better model.

Finally, you assess the selected model using the test set.

As you could notice, trying all combinations of hyperparameters, especially if there are

more than a couple of them, could be time-consuming, especially for large datasets. There

are more eﬃcient techniques, such as

random search

and

Bayesian hyperparameter

optimization.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 19

Random search diﬀers from grid search in that you no longer provide a discrete set of

values to explore for each hyperparameter; instead, you provide a statistical distribution for

each hyperparameter from which values are randomly sampled and set the total number of

combinations you want to try.

Bayesian techniques diﬀer from random or grid search in that they use past evaluation results

to choose the next values to evaluate. The idea is to limit expensive optimization of the

objective function by choosing the next hyperparameter values based on those that have done

well in the past.

There are also

gradient-based techniques

evolutionary optimiza-

tion techniques

, and other algorithmic hyperparameter tuning tech-

niques. Most modern machine learning libraries implement one or more

such techniques. There are also hyperparameter tuning libraries that can

help you to tune hyperparameters of virtually any learning algorithm,

including ones you programmed yourself.

5.7.1 Cross-Validation

When you don’t have a decent validation set to tune your hyperparameters on, the common

technique that can help you is called

cross-validation

. When you have few training examples,

it could be prohibitive to have both validation and test set. You would prefer to use more

data to train the model. In such a case, you only split your data into a training and a test

set. Then you use cross-validation to on the training set to simulate a validation set.

Cross-validation works like follows. First, you ﬁx the values of the hyperparameters you want

to evaluate. Then you split your training set into several subsets of the same size. Each

subset is called a fold. Typically, ﬁve-fold cross-validation is used in practice. With ﬁve-fold

cross-validation, you randomly split your training data into ﬁve folds:

, F

, . . . , F

}

. Each

= 1

, . . . ,

5 contains 20% of your training data. Then you train ﬁve models as follows.

To train the ﬁrst model,

, you use all examples from folds

, and

as the training

set and the examples from

as the validation set. To train the second model,

, you

use the examples from folds

, and

to train and the examples from

as the

validation set. You continue building models iteratively like this and compute the value of

the metric of interest on each validation set, from

. Then you average the ﬁve values

of the metric to get the ﬁnal value.

You can use grid search with cross-validation to ﬁnd the best values of hyperparameters for

your model. Once you have found these values, you use the entire training set to build the

model with these best values of hyperparameters you have found via cross-validation. Finally,

you assess the model using the test set.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 20

6 Neural Networks and Deep Learning

First of all, you already know what a neural network is, and you already know how to build

such a model. Yes, it’s logistic regression! As a matter of fact, the logistic regression model,

or rather its generalization for multiclass classiﬁcation, called the softmax regression model,

is a standard unit in a neural network.

6.1 Neural Networks

If you understood linear regression, logistic regression, and gradient descent, understanding

neural networks would not be a problem.

A neural network (NN), just like a regression or an SVM model, is a mathematical function:

y = f

(x).

The function

has a particular form: it’s a nested function. You have probably already

heard of neural network

layers

. So, for a 3-layer neural network that returns a scalar,

looks like this:

y = f

(x) = f

(x))).

In the above equation, f

, f

are vector functions of the following form:

(z)

def

= g

z + b

), (1)

where

is called the layer index and can span from 1 to any number of layers. The function

is called an

activation function

. It is a ﬁxed, usually nonlinear function chosen by the

data analyst before the learning is started. The parameters

(a matrix) and

(a vector)

for each layer are learned using the familiar gradient descent by optimizing, depending on the

task, a particular cost function (such as MSE). Compare eq. 1 with the equation for logistic

regression, where you replace

by the sigmoid function, and you will not see any diﬀerence.

The function

is a scalar function for the regression task, but can also be a vector function

depending on your problem.

You may probably wonder why a matrix

is used and not a vector

. The reason is

that

is a vector function. Each row

l,u

(

for unit) of the matrix

is a vector of

the same dimensionality as

. Let

l,u

. The output of

(

) is a vector

[

(

l,1

)

, g

(

l,2

)

, . . . , g

(

l,size

)], where

is some scalar function

, and

size

is the number of

units in layer

. To make it more concrete, let’s consider one architecture of neural networks

called multilayer perceptron and often referred to as a vanilla neural network.

A scalar function outputs a scalar, that is a simple number and not a vector.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 3

6.1.1 Multilayer Perceptron Example

We have a closer look at one particular conﬁguration of neural networks called

feed-forward

neural networks

(FFNN), and more speciﬁcally the architecture called a

multilayer

perceptron

(MLP). As an illustration, we consider an MLP with three layers. Our network

takes a two-dimensional feature vector as input and outputs a number. This FFNN can be a

regression or a classiﬁcation model, depending on the activation function used in the third,

output layer.

Our MLP is depicted in ﬁg. 1. The neural network is represented graphically as a connected

combination of

units

logically organized into one or more

layers

. Each unit is represented by

either a circle or a rectangle. The inbound arrow represents an input of a unit and indicates

where this input came from. The outbound arrow indicates the output of a unit.

The output of each unit is the result of the mathematical operation written inside the circle

or a rectangle. Circle units don’t do anything with the input; they just send their input

directly to the output.

The following happens in each rectangle unit. Firstly, all inputs of the unit are joined together

to form an input vector. Then the unit applies a linear transformation to the input vector,

exactly like linear regression model does with its input feature vector. Finally, the unit

applies an activation function

to the result of the linear transformation and obtains the

output value, a real number. In a vanilla FFNN, the output value of a unit of some layer

becomes an input value of each of the units of the subsequent layer.

In ﬁg. 1, the activation function

has one index:

, the index of the layer the unit belongs to.

Usually, all units of a layer use the same activation function, but it’s not strictly necessary.

Each layer can have a diﬀerent number of units. Each unit has its own parameters

l,u

and

l,u

, where

is the index of the unit, and

is the index of the layer. The vector

l−1

in each unit is deﬁned as [

(1)

l−1

, y

(2)

l−1

, y

(3)

l−1

, y

(4)

l−1

]. The vector

in the ﬁrst layer is deﬁned as

(1)

, . . . , x

(D)

As you can see in ﬁg. 1, in multilayer perceptron all outputs of one layer are connected to

each input of the succeeding layer. This architecture is called

fully-connected

. A neural

network can contain

fully-connected layers

. Those are the layers whose units receive as

inputs the outputs of each of the units of the previous layer.

6.1.2 Feed-Forward Neural Network Architecture

If we want to solve a regression or a classiﬁcation problem discussed in previous chapters, the

last (the rightmost) layer of a neural network usually contains only one unit. If the activation

function

last

of the last unit is linear, then the neural network is a regression model. If the

last

is a logistic function, the neural network is a binary classiﬁcation model.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 4

(1)

(2)

(1)

←g

1,1

x+b

1,1

)

(1)

(2)

←g

1,2

x+b

1,2

)

(3)

←g

1,3

x+b

1,3

)

(4)

←g

1,4

x+b

1,4

)

(1)

←g

2,1

+b

2,1

)

(2)

←g

2,2

+b

2,2

)

(3)

←g

2,3

+b

2,3

)

(4)

←g

2,4

+b

2,4

)

y←g

3,1

+b

3,1

)

layer3(f

)layer2(f

)layer1(f

)

(1)

(4)

(3)

(2)

(1)

Figure 1: A multilayer perceptron with two-dimensional input, two layers with four units and one output layer with one

unit.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 5

The data analyst is free to choose any mathematical function as

l,u

, assuming it’s diﬀeren-

tiable

. The latter property is essential for gradient descent, which is used to ﬁnd the values

of the parameters

l,u

and

l,u

for all

and

. The primary purpose of having nonlinear

components in the function

is to allow the neural network to approximate nonlinear

functions. Without nonlinearities,

would be linear, no matter how many layers it has.

The reason is that

is a linear function and a linear function of a linear function is

also a linear function.

Popular choices of activation functions are the logistic function, already known to you, as well

TanH

and

ReLU

. The former is the hyperbolic tangent function, similar to the logistic

function but ranging from

−

1 to 1 (without reaching them). The latter is the rectiﬁed linear

unit function, which equals to zero when its input z is negative and to z otherwise:

tanh(z) =

− e

−z

+ e

−z

relu(z) =

(

0 if z < 0

z otherwise

As I said above,

in the expression

, is a matrix, while

is a vector. That looks

diﬀerent from linear regression’s

. In matrix

, each row

corresponds to a vector of

parameters

l,u

. The dimensionality of the vector

l,u

equals to the number of units in the

layer

l −

1. The operation

results in a vector

def

[

l,1

z, w

l,2

z, . . . , w

l,size

]. Then

the sum

gives a

size

-dimensional vector

. Finally, the function

(

) produces the

vector y

def

= [y

(1)

, y

(2)

, . . . , y

(size

)

] as output.

6.2 Deep Learning

Deep learning refers to training neural networks with more than two non-output layers. In the

past, it became more diﬃcult to train such networks as the number of layers grew. The two

biggest challenges were referred to as the problems of

exploding gradient

and

vanishing

gradient as gradient descent was used to train the network parameters.

While the problem of exploding gradient was easier to deal with by applying simple techniques

gradient clipping

and L1 or L2 regularization, the problem of vanishing gradient

remained intractable for decades.

What is vanishing gradient and why does it arise? To update the values of parameters in

neural networks the algorithm called

backpropagation

is typically used. Backpropagation

is an eﬃcient algorithm for computing gradients on neural networks using the chain rule. In

Chapter 4, we have already seen how the chain rule is used to calculate partial derivatives of

The function has to be diﬀerentiable across its whole domain or in the majority of the points of its

domain. For example, ReLU is not diﬀerentiable at 0.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 6

a complex function. During gradient descent, the neural network’s parameters receive an

update proportional to the partial derivative of the cost function with respect to the current

parameter in each iteration of training. The problem is that in some cases, the gradient will

be vanishingly small, eﬀectively preventing some parameters from changing their value. In

the worst case, this may completely stop the neural network from further training.

Traditional activation functions, such as the hyperbolic tangent function I mentioned above,

have gradients in the range (0

1), and backpropagation computes gradients by the chain rule.

That has the eﬀect of multiplying

of these small numbers to compute gradients of the earlier

(leftmost) layers in an

-layer network, meaning that the gradient decreases exponentially

with n. That results in the eﬀect that the earlier layers train very slowly, if at all.

However, the modern implementations of neural network learning algorithms allow you to

eﬀectively train very deep neural networks (up to hundreds of layers). The ReLU activation

function suﬀers much less from the problem of vanishing gradient. Also,

long short-term

memory

(LSTM) networks, which we consider below, as well as such techniques as

skip

connections

used in

residual neural networks

allow you to train even deeper neural

networks, with thousands of layers.

Therefore, today, since the problems of vanishing and exploding gradient are mostly solved

(or their eﬀect diminished) to a great extent, the term “deep learning” refers to training

neural networks using the modern algorithmic and mathematical toolkit independently of

how deep the neural network is. In practice, many business problems can be solved with

neural networks having 2-3 layers between the input and output layers. The layers that are

neither input nor output are often called hidden layers.

6.2.1 Convolutional Neural Network

You may have noticed that the number of parameters an MLP can have grows very fast as you

make your network bigger. More speciﬁcally, as you add one layer, you add

size

(

size

l−1

+ 1)

parameters (our matrix

plus the vector

). That means that if you add another 1000-unit

layer to an existing neural network, then you add more than 1 million additional parameters

to your model. Optimizing such big models is a very computationally intensive problem.

When our training examples are images, the input is very high-dimensional. If you want

to learn to classify images using an MLP, the optimization problem is likely to become

intractable.

convolutional neural network

(CNN) is a special kind of FFNN that signiﬁcantly

reduces the number of parameters in a deep neural network with many units without losing

too much in the quality of the model. CNNs have found applications in image and text

processing where they beat many previously established benchmarks.

Because CNNs were invented with image processing in mind, I explain them on the image

classiﬁcation example.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 7

You may have noticed that in images, pixels that are close to one another usually represent

the same type of information: sky, water, leaves, fur, bricks, and so on. The exception from

the rule are the edges: the parts of an image where two diﬀerent objects “touch” one another.

So, if we can train the neural network to recognize regions of the same information as well

as the edges, then this knowledge would allow the neural network to predict the object

represented in the image. For example, if the neural network detected multiple skin regions

and edges that look like parts of an oval with skin-like tone on the inside and bluish tone on

the outside, then it is very likely that there’s a face on the sky background. If our goal is to

detect people on pictures, the neural network will most likely succeed in predicting a person

in this picture.

Having in mind that the most important information in the image is local, we can split the

image into square patches using a moving window approach

. We can then train multiple

smaller regression models at once, each small regression model receiving a square patch as

input. The goal of each small regression model is to learn to detect a speciﬁc kind of pattern

in the input patch. For example, one small regression model will learn to detect the sky;

another one will detect the grass, the third one will detect edges of a building, and so on.

In CNNs, a small regression model looks like the one in ﬁg. 1, but it only has the layer 1 and

doesn’t have layers 2 and 3. To detect some pattern, a small regression model has to learn

the parameters of a matrix

(for “ﬁlter”) of size

p × p

, where

is the size of a patch. Let’s

assume, for simplicity, that the input image is back and white, with 1 representing black and

0 representing white pixels. Assume also that our patches are 3 by 3 pixels (

= 3). Some

patch could then look like the following matrix P (for “patch”):

P =





0 1 0

1 1 1

0 1 0





The above patch represents a pattern that looks like a cross. The small regression model that

will detect such patterns (and only them) would need to learn a 3 by 3 parameter matrix

where parameters at positions corresponding to the 1s in the input patch would be positive

numbers, while the parameters in positions corresponding to 0s would be close to zero. If

we calculate the dot-product between matrices

and

and then sum all values from the

resulting vector, the value we obtain is higher the more similar

is to

. For instance,

assume that F looks like this:

F =





0 2 3

2 4 1

0 3 0





Then,

P · F = [0 · 0 + 2 · 1 + 0 · 0, 2 · 1 + 4 · 1 + 3 · 1, 3 · 0 + 1 · 1 + 0 · 1] = [2, 9, 1].

Consider this as if you looked at a dollar bill in a microscope. To see the whole bill you have to gradually

move your bill from left to right and from top to bottom. At each moment in time, you see only a part of the

bill of ﬁxed dimensions. This approach is called moving window.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 8

Then the sum of all elements of the above vector is 2 + 9 + 1 = 12. This operation — the dot

product between a patch and a ﬁlter and then summing the values — is called

convolution

If our input patch P had a diﬀerent patten, for example, that of a letter T,

P =





1 1 1

0 1 0





then the convolution would give a lower result: 0 + 9 + 0 = 9. So, you can see the more

the patch “looks” like the ﬁlter, the higher the value of the convolution operation is. For

convenience, there’s also a bias parameter

associated with each ﬁlter

which is added to

the result of a convolution before applying the nonlinearity.

One layer of a CNN consists of multiple convolution ﬁlters (each with its own bias parameter),

just like one layer in a vanilla FFNN consists of multiple units. Each ﬁlter of the ﬁrst

(leftmost) layer slides — or convolves — across the input image, left to right, top to bottom,

and convolution is computed at each iteration.

0 0

-1 2

4 -2

Image

Filter

Outputbeforenonlinearity

0 0

-1 2

4 -2

4 -1

0 0

-1 2

4 -2

4 -1 7

Conv1

Conv2

Conv3

0 0

-1 2

4 -2

0 0

-1 2

4 -2

-1

0 0

-1 2

4 -2

4 -1 7

-1 7

Conv4

2 7

Conv5

Conv6

Bias

Figure 2: A ﬁlter convolving across an image.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 9

An illustration of the process is given in ﬁg. 2 where 6 steps of a ﬁlter convolving across an

image are shown.

The numbers in the ﬁlter matrix, for each ﬁlter

in each layer, as well as the value of

the bias term

, are found by the gradient descent with backpropagation, based on data by

minimizing the cost function.

A nonlinearity is applied to the sum of the convolution and the bias term. Typically, the

ReLU activation function is used in all hidden layers. The activation function of the output

layer depends on the task.

Since we can have

size

ﬁlters in each layer

, the output of the convolution layer

would

consist of size

matrices, one for each ﬁlter.

If the CNN has one convolution layer following another convolution layer, then the subsequent

layer

+ 1 treats the output of the preceding layer

as a collection of

size

image matrices.

Such a collection is called a volume. Each ﬁlter of layer

+ 1 convolves the whole volume. The

convolution of a patch of a volume is simply the sum of convolutions of the corresponding

patches of individual matrices the volume consists of.

-2

-1 1

-2 1

-2 3

5 -1

Filter

-3

Outputbeforenonlinearity

-3

-1

-2

-5

-3

2 1

3 -1 1

-3

-1

-2

1 1

-1

1 -1

-2

Bias

Volume

Figure 3: Convolution of a volume consisting of three matrices.

An example of a convolution of a patch of a volume consisting of three matrices is shown in

ﬁg. 3. The value of the convolution,

−

3, was obtained as (

−

3 + 3

1 + 5

4 +

−

1) +

(−2 · 2 + 3 · (−1) + 5 · (−3) + −1 · 1) + (−2 · 1 + 3 · (−1) + 5 · 2 + −1 · (−1)) + (−2).

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 10

In computer vision, CNNs often get volumes as input, since an image is usually represented

by three channels: R, G, and B, each channel being a monochrome picture.

By now, you should have a good high-level understanding of the CNN

architecture. We didn’t discuss some essential features of CNNs though,

such as strides, padding, and pooling. Strides and padding are two

important hyperparameters of the convolution ﬁlter and the sliding

window, while pooling is a technique that works very well in practice

by reducing the number of parameters of a CNN even more.

6.2.2 Recurrent Neural Network

Recurrent neural networks

(RNNs) are used to label, classify, or generate sequences. A

sequence is a matrix, each row of which is a feature vector and the order of rows matters.

Labeling a sequence means predicting a class to each feature vector in a sequence. Classifying

a sequence means predicting a class for the entire sequence. Generating a sequence means

to output another sequence (of a possibly diﬀerent length) somehow relevant to the input

sequence.

RNNs are often used in text processing because sentences and texts are naturally sequences

of either words/punctuation marks or sequences of characters. For the same reason, recurrent

neural networks are also used in speech processing.

A recurrent neural network is not feed-forward, because it contains loops. The idea is that

each unit

of recurrent layer

has a real-valued

state h

l,u

. The state can be seen as the

memory of the unit. In RNN, each unit

in each recurrent layer

receives two inputs: a

vector of outputs from the previous layer

l −

1 and the vector of states from this same layer

from the previous time step.

To illustrate the idea, let’s consider the ﬁrst and the second recurrent layers of an RNN. The

ﬁrst (leftmost) layer receives a feature vector as input. The second layer receives the output

of the ﬁrst layer as input.

This situation is schematically depicted in ﬁg. 4. As I said above, each training example is

a matrix in which each row is a feature vector. For simplicity, let’s illustrate this matrix

as a sequence of vectors

= [

, x

, . . . , x

t−1

, x

t+1

, . . . , x

length

], where

length

is the

length of the input sequence. If our input example

is a text sentence, then feature vector

for each t = 1, . . . , length

represents a word in the sentence at position t.

As depicted in ﬁg. 4, in an RNN, the input example is “read” by the neural network one

feature vector at a timestep. The index

denotes a timestep. To update the state

l,u

at each

timestep

in each unit

of each layer

we ﬁrst calculate a linear combination of the input

feature vector with the state vector

t−1

l,u

of this same layer from the previous timestep,

t −

The linear combination of two vectors is calculated using two parameter vectors

l,u

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 11

t

←[x

(1),t

(2),t

]

(2),t

t1

1,1

t1

l,2

1,1

1,1

←g

1,1

+u

1,1

t1

+b

1,1

)

←g

+c

)

(1),t

(2),t

t1

1,1

t1

1,2

1,2

←g

1,2

+u

1,2

t1

+b

1,2

)

(1),t

t

←[x

(1),t

(2),t

]

layer1

←[h

1,1

1,2

]

t1

2,1

t1

2,2

2,1

←g

2,1

+u

2,1

t1

+b

2,1

)

t1

2,1

t1

2,2

layer2

←[h

1,1

1,2

]

←g

+c

)

←[h

2,1

2,2

]

←[h

1,1

1,2

]

2,1

←g

2,2

+u

2,2

t1

+b

2,2

)

Figure 4: The ﬁrst two layers of an RNN. The input feature vector is two-dimensional; each layer has two units.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 12

and a parameter

l,u

. The value of

l,u

is then obtained by applying an activation function

to the result of the linear combination. A typical choice for function

tanh

. The

output

is typically a vector calculated for the whole layer

at once. To obtain

, we use

an activation function

that takes a vector as input and returns a diﬀerent vector of the

same dimensionality. The function

is applied to a linear combination of the state vector

values

l,u

calculated using a parameter matrix

and a parameter vector

l,u

. A typical

choice for g

is the softmax function:

σ(z)

def

= [σ

(1)

, . . . , σ

(D)

], where σ

(j)

def

exp



(j)



k=1

exp



(k)



The softmax function is a generalization of the sigmoid function to multidimensional data. It

has the property that

j=1

(j)

= 1 and σ

(j)

> 0 for all j.

The dimensionality of

is chosen by the data analyst such that multiplication of matrix

by the vector

results in a vector of the same dimensionality as that of the vector

. This

choice depends on the dimensionality for the output label

in your training data. (Until

now we only saw one-dimensional labels, but we will see in the future chapters that labels

can be multidimensional as well.)

The values of

l,u

, and

l,u

are computed from the training data using gradient

descent with backpropagation. To train RNN models, a special version of backpropagation is

used called backpropagation through time.

Both tanh and softmax suﬀer from the vanishing gradient problem. Even if our RNN has just

one or two recurrent layers, because of the sequential nature of the input, backpropagation

has to “unfold” the network over time. From the point of view of the gradient calculation, in

practice this means that the longer is the input sequence, the deeper is the unfolded network.

Another problem RNNs have is that of handling long-term dependencies. As the length of

the input sequence grows, the feature vectors from the beginning of the sequence tend to

be “forgotten,” because the state of each unit, which serves as network’s memory, becomes

signiﬁcantly aﬀected by the feature vectors read more recently. Therefore, in text or speech

processing, the cause-eﬀect link between distant words in a long sentence can be lost.

The most eﬀective recurrent neural network models used in practice are

gated RNNs

. These

include the

long short-term memory

(LSTM) networks and networks based on the

gated

recurrent unit (GRU).

The beauty of using gated units in RNNs is that such networks can store information in their

units for future use, much like bits in a computer’s memory. The diﬀerence with the real

memory is that reading, writing, and erasure of information stored in each unit is controlled

by activation functions that take values in the range (0

1). The trained neural network can

“read” the input sequence of feature vectors and decide at some early time step

to keep

speciﬁc information about the feature vectors. That information about the earlier feature

vectors can later be used by the model to process the feature vectors from near the end of

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 13

the input sequence. For example, if the input text starts with the word she, a language

processing RNN model could decide to store the information about the gender to interpret

correctly the word her seen later in the sentence.

Units make decisions about what information to store, and when to allow reads, writes, and

erasures. These decisions are learned from data and implemented through the concept of

gates. There are several architectures of gated units. The simplest one (working well in

practice) is called the

minimal gated GRU

and is composed of a memory cell, and a forget

gate.

Let’s look at the math of a GRU unit on an example of the ﬁrst layer of the RNN (the one

that takes the sequence of feature vectors as input). A minimal gated GRU unit

in layer

takes two inputs: the vector of the memory cell values from all units in the same layer

from the previous timestep,

t−1

, and a feature vector

. It then uses these two vectors like

follows (all operations in the below sequence are executed in the unit one after another):

l,u

← g

l,u

+ u

l,u

t−1

+ b

l,u

← g

l,u

+ o

l,u

t−1

+ a

l,u

← Γ

l,u

+ (1 − Γ

l,u

t−1

← [h

l,1

, . . . , h

l,size

]

← g

+ c

l,u

where

is the tanh activation function,

is called the gate function and is implemented as

the sigmoid function. The sigmoid function takes values in the range of (0

1). If the gate

l,u

is close to 0, then the memory cell keeps its value from the previous time step,

t−1

. On

the other hand, if the gate Γ

l,u

is close to 1, the value of the memory cell is overwritten by a

new value

l,u

(this happens in the third assignment from the top). Just like in standard

RNNs, g

is usually softmax.

A gated unit takes an input and stores it for some time. This is equivalent to applying the

identity function (

(

) =

) to the input. Because the derivative of the identity function is

constant, when a network with gated units is trained with backpropagation through time,

the gradient does not vanish.

Other important extensions to RNNs include

bi-directional RNNs

RNNs with

attention

and

sequence-to-sequence RNN

models.

Sequence-to-sequence RNNs, in particular, are frequently used to build

neural machine translation models and other models for text to text

transformations. A generalization of RNNs is a

recursive neural

network model.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 14

7 Problems and Solutions

7.1 Kernel Regression

We talked about linear regression, but what if our data doesn’t have the form of a straight

line? Polynomial regression could help. Let’s say we have a one-dimensional data

{

(

, y

)

}

i=1

We could try to ﬁt a quadratic line

to our data. By deﬁning the mean

squared error cost function, we could apply gradient descent and ﬁnd the values of parameters

, and

that minimize this cost function. In one- or two-dimensional space, we can

easily see whether the function ﬁts the data. However, if our input is a

-dimensional feature

vector, with D > 3, ﬁnding the right polynomial would be hard.

Kernel regression is a non-parametric method. That means that there are no parameters to

learn. The model is based on the data itself (like in kNN). In its simplest form, in kernel

regression we look for a model like this:

f(x) =

i=1

, where w

Nk(

−x

)

k=1

−x

)

. (1)

The function

(

) is a kernel. It can have diﬀerent forms, the most frequently used one is the

Gaussian kernel:

k(z) =

√

2π

exp



−z



Good ﬁt Slight overﬁt Strong overﬁt

Figure 1: Example of kernel regression line with a Gaussian kernel for three values of b.

The value

is a hyperparameter that we tune using the validation set (by running the model

built with a speciﬁc value of

on the validation set examples and calculating the mean

squared error). You can see an illustration of the inﬂuence

has on the shape of the regression

line in ﬁg. 1.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 3

If your inputs are multi-dimensional feature vectors, the terms

− x

and

− x

in eq. 1

have to be replaced by Euclidean distance kx

− xk and kx

− xk respectively.

7.2 Multiclass Classiﬁcation

In multiclass classiﬁcation, the label can be one of the

classes:

y ∈ {

, . . . , C}

. Many

machine learning algorithms are binary; SVM is an example. Some algorithms can naturally

be extended to handle multiclass problems. ID3 and other decision tree learning algorithms

can be simply changed like this:

ID3

def

= Pr(y

= c|x) =

|S|

{y | (x,y)∈S,y=c}

for all c ∈ {1, . . . , C}.

Logistic regression can be naturally extended to multiclass learning problems by replacing

the sigmoid function with the softmax function which we already saw in Chapter 6.

The kNN algorithm is also straightforward to extend to the multiclass case: when we ﬁnd

the

closest examples for the input

and examine them, we return the class that we saw

the most among the k examples.

SVM cannot be naturally extended to multiclass problems. Some algorithms can be imple-

mented more eﬃciently in the binary case. What should you do if you have a multiclass

problem but a binary classiﬁcation learning algorithm? One common strategy is called

one

versus rest

. The idea is to transform a multiclass problem into

binary classiﬁcation

problems and build

binary classiﬁers. For example, if we have three classes,

y ∈ {

}

we create copies of the original datasets and modify them. In the ﬁrst copy, we replace all

labels not equal to 1 by 0. In the second copy, we replace all labels not equal to 2 by 0. In the

third copy, we replace all labels not equal to 3 by 0. Now we have three binary classiﬁcation

problems where we have to learn to distinguish between labels 1 and 0, 2 and 0, and between

labels 3 and 0.

Once we have the three models and we need to classify the new input feature vector

we apply the three models to the input, and we get three predictions. We then pick the

prediction of a non-zero class which is the most certain. Remember that in logistic regression,

the model returns not a label but a score (0

1) that can be interpreted as the probability

that the label is positive. We can also interpret this score as the certainty of prediction. In

SVM, the analog of certainty is the distance from the input

to the decision boundary. This

distance is given by,

d =

∗

x + b

∗

kwk

The larger the distance, the more certain is the prediction. Most learning algorithm either

can be naturally converted to a multiclass case, or they return a score we can use in the one

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 4

versus rest strategy.

7.3 One-Class Classiﬁcation

One-class classiﬁcation

, also known as unary classiﬁcation or class modeling, tries to

identify objects of a speciﬁc class among all objects, by learning from a training set containing

only the objects of that class. That is diﬀerent from and more diﬃcult than the traditional

classiﬁcation problem, which tries to distinguish between two or more classes with the

training set containing objects from all classes. A typical one-class classiﬁcation problem is

the classiﬁcation of the traﬃc in a secure network as normal. In this scenario, there are few,

if any, examples of the traﬃc under an attack or during an intrusion. However, the examples

of normal traﬃc are often in abundance. One-class classiﬁcation learning algorithms are used

for outlier detection, anomaly detection, and novelty detection.

There are several one-class learning algorithms. The most widely used in practice are

one-class Gaussian, one-class kmeans, one-class kNN, and one-class SVM.

The idea behind the one-class gaussian is that we model our data as if it came from a Gaussian

distribution, more precisely multivariate normal distribution (MND). The probability density

function (pdf) for MND is given by the following equation:

µ,Σ

(x) =

exp



−

(x − µ)

−1

(x − µ)



(2π)

|Σ|

where

µ,Σ

(

) returns the probability density corresponding to the input feature vector

Probability density can be interpreted as the likelihood that example

was drawn from the

probability distribution we model as an MND. Values

(a vector) and

(a matrix) are

the parameters we have to learn. The

maximum likelihood

criterion (similarly to how

we solved the logistic regression learning problem) is optimized to ﬁnd the optimal values

for these two parameters.

|Σ|

det Σ

is the determinant of the matrix Σ; the notation

means the transpose of the vector a, and Σ

−1

is the inverse of the matrix Σ.

If the terms determinant, transpose, and inverse are new to you, don’t worry. These are

standard operations on vector and matrices from the branch of mathematics called matrix

theory. If you feel the need to know what they are, Wikipedia explains these concepts very

well.

In practice, the numbers in the vector

determine the place where the curve of our Gaussian

distribution is centered, while the numbers in

determine the shape of the curve. For

a training set consisting of two-dimensional feature vectors, an example of the one-class

Gaussian model is given in ﬁg 2.

Once we have our model parametrized by

and

learned from the data, we predict the

likelihood of every input

by using

µ,Σ

(

). Only if the likelihood is above a certain

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 5

Figure 2: One-class classiﬁcation solved using the one-class gaussian method. Left: two-

dimensional feature vectors. Right: the MND curve that maximizes the likelihood of the

examples on the left.

threshold, we predict that the example belongs to our class; otherwise, it is classiﬁed as the

outlier. The value of the threshold is found experimentally or using an “educated guess.”

When the data has a more complex shape, a more advanced algorithm can use a combination

of several Gaussians (called a mixture of Gaussians). In this case, there are more parameters

to learn from data: one

and one

for each Gaussian as well as the parameters that allow

combining multiple Gaussians to form one pdf. In Chapter 9, we consider a mixture of

Gaussians with an application to clustering.

One-class kmeans and one-class kNN are based on a similar principle as that of one-class

Gaussian: build some model of the data and then deﬁne a threshold to decide whether our

new feature vector looks similar to other examples according to the model. In the former,

all training examples are clustered using the

kmeans

clustering algorithm and, when a new

example

is observed, the distance

(

) is calculated as the minimum distance between

and the center of each cluster. If

(

) is less than a particular threshold, then

belongs to

the class.

One-class SVM, depending on formulation, tries either 1) to separate all

training examples from the origin (in the feature space) and maximize

the distance from the hyperplane to the origin, or 2) to obtain a spherical

boundary around the data by minimizing the volume of this hypersphere.

I leave the description of the one-class kNN algorithm, as well as the

details of the one-class kmeans and one-class SVM for the complementary

reading.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 6

Figure 3: A picture labeled as “people”, “concert”, and “nature”.

7.4 Multi-Label Classiﬁcation

multi-label classiﬁcation

, each training example doesn’t just have one label, but several

of them. For instance, if we want to describe an image, we could assign several labels to it:

“people,” “concert,” “nature,” all three at the same time (ﬁg. 3).

If the number of possible values for labels is high, but they are all of the same nature, like

tags, we can transform each labeled example into several labeled examples, one per label.

These new examples all have the same feature vector and only one label. That becomes a

multiclass classiﬁcation problem. We can solve it using the one versus rest strategy. The

only diﬀerence with the usual multiclass problem is that now we have a new hyperparameter:

threshold. If the prediction score for some label is above the threshold, this label is predicted

for the input feature vector. In this scenario, multiple labels can be predicted for one feature

vector. The value of the threshold is chosen using the validation set.

Analogously, algorithms that naturally can be made multiclass (decision trees, logistic

regression and neural networks among others) can be applied to multi-label classiﬁcation

problems. Because they return the score for each class, we can deﬁne a threshold and then

assign multiple labels to one feature vector if the threshold is above some value chosen

experimentally using the validation set.

Neural networks algorithms can naturally train multi-label classiﬁcation models by using the

binary cross-entropy

cost function. The output layer of the neural network, in this case,

has one unit per label. Each unit of the output layer has the sigmoid activation function.

Accordingly, each label

is binary (

i,l

∈ {

}

), where

= 1

, . . . , L

and

= 1

, . . . , N

. The

binary cross-entropy of predicting the probability

ˆy

i,l

that example

has label

is deﬁned

−

(

i,l

(

ˆy

i,l

) + (1

− y

i,l

)

− ˆy

i,l

)). The minimization criterion is simply the average of

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 7

all binary cross-entropy terms across all training examples and all labels of those examples.

In cases where the number of possible values each label can take is small, one can convert

multilabel into a multiclass problem using a diﬀerent approach. Imagine the following problem.

We want to label images and labels can be of two types. The ﬁrst type of label can have

two possible values:

{photo, painting}

; the label of the second type can have three possible

values

{portrait, paysage, other}

. We can create a new fake class for each combination of

the two original classes, like this:

Fake Class Real Class 1 Real Class 2

1 photo portrait

2 photo paysage

3 photo other

4 painting portrait

5 painting paysage

6 painting other

Now we have the same labeled examples, but we replace real multi-labels with one fake label

with values from 1 to 6. This approach works well in practice when there are not too many

possible combinations of classes. Otherwise, you need to use much more training data to

compensate for an increased set of classes.

The primary advantage of this latter approach is that you keep your labels correlated,

contrary to the previously seen methods that predict each label independently of one another.

Correlation between labels can be an essential property in many problems. For example, if

you want to predict for an email message whether it’s spam or not_spam at the same time

as you predict whether it’s ordinary or priority email. You would like to avoid predictions

like [spam, priority].

7.5 Ensemble Learning

Ensemble learning

is a learning paradigm that, instead of trying to learn one super-accurate

model, focuses on training a large number of low-accuracy models and then combining the

predictions given by those weak models to obtain a high-accuracy meta-model.

Low-accuracy models are usually learned by

weak learners

, that is learning algorithms that

cannot learn complex models, and thus are typically fast at the training and at the prediction

time. The most frequently used weak learner is a decision tree learning algorithm in which

we often stop splitting the training set after just a few iterations. The obtained trees are

shallow and not particularly accurate, but the idea behind ensemble learning is that if the

trees are not identical and each tree is at least slightly better than random guessing, then we

can obtain high accuracy by combining a large number of such trees.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 8

To obtain the prediction for input

, the predictions of each weak model are combined using

some sort of weighted voting. The speciﬁc form of vote weighting depends on the algorithm,

but, independently of the algorithm, the idea is the same: if the council of weak models

predicts that the message is spam, then we assign the label spam to x.

Two most widely used and eﬀective ensemble learning algorithms are

random forest

and

gradient boosting.

7.5.1 Random Forest

There are two ensemble learning paradigms:

bagging

and

boosting

. Bagging consists of

creating many “copies” of the training data (each copy is slightly diﬀerent from another) and

then apply the weak learner to each copy to obtain multiple weak models and then combine

them. The bagging paradigm is behind the random forest learning algorithm.

The “vanilla” bagging algorithm works like follows. Given a training set, we create

random

samples S

(for each

= 1

, . . . , B

) of the training set and build a decision tree model

using each sample S

as the training set. To sample S

for some

, we do the sampling with

replacement. This means that we start with an empty set, and then pick at random an

example from the training set and put its exact copy to S

by keeping the original example

in the original training set. We keep picking examples at random until the |S

| = N .

After training, we have

decision trees. The prediction for a new example

is obtained as

the average of B predictions:

y ←

f(x)

def

b=1

(x),

in the case of regression, or by taking the majority vote in the case of classiﬁcation.

The random forest algorithm is diﬀerent from the vanilla bagging in just one way. It uses

a modiﬁed tree learning algorithm that inspects, at each split in the learning process, a

random subset of the features. The reason for doing this is to avoid the correlation of the

trees: if one or a few features are very strong predictors for the target, these features will

be selected to split examples in many trees. This would result in many correlated trees in

our “forest.” Correlated predictors cannot help in improving the accuracy of prediction. The

main reason behind a better performance of model ensembling is that models that are good

will likely agree on the same prediction, while bad models will likely disagree on diﬀerent

ones. Correlation will make bad models more likely to agree, which will hamper the majority

vote or the average.

The most important hyperparameters to tune are the number of trees,

, and the size of the

random subset of the features to consider at each split.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 9

Random forest is one of the most widely used ensemble learning algorithms. Why is it so

eﬀective? The reason is that by using multiple samples of the original dataset, we reduce

the

variance

of the ﬁnal model. Remember that the low variance means low

overﬁtting

Overﬁtting happens when our model tries to explain small variations in the dataset because our

dataset is just a small sample of the population of all possible examples of the phenomenon we

try to model. If we were unlucky with how our training set was sampled, then it could contain

some undesirable (but unavoidable) artifacts: noise, outliers and over- or underrepresented

examples. By creating multiple random samples with replacement of our training set, we

reduce the eﬀect of these artifacts.

7.5.2 Gradient Boosting

Another eﬀective ensemble learning algorithm is gradient boosting. Let’s ﬁrst look at gradient

boosting for regression. To build a strong regressor, we start with a constant model

(just like we did in ID3):

f = f

(x)

def

i=1

Then we modify labels of each example i = 1, . . . , N in our training set like follows:

ˆy

← y

− f(x

), (2)

where ˆy

, called the residual, is the new label for example x

Now we use the modiﬁed training set, with residuals instead of original labels, to build a new

decision tree model,

. The boosting model is now deﬁned as

def

= f

αf

, where

is the

learning rate (a hyperparameter).

Then we recompute the residuals using eq. 2 and replace the labels in the training data once

again, train the new decision tree model

, redeﬁne the boosting model as

def

= f

αf

and the process continues until the maximum of

(another hyperparameter) trees are

combined.

Intuitively, what’s happening here? By computing the residuals, we ﬁnd how well (or poorly)

the target of each training example is predicted by the current model

. We then train

another tree to ﬁx the errors of the current model (this is why we use residuals instead if

real labels) and add this new tree to the existing model with some weight

. Therefore, each

additional tree added to the model partially ﬁxes the errors made by the previous trees until

the maximum number of trees are combined.

Now you should reasonably ask why the algorithm is called gradient boosting? In gradient

boosting, we don’t calculate any gradient contrary to what we did in Chapter 4 for linear

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 10

regression. To see the similarity between gradient boosting and gradient descent remember

why we calculated the gradient in linear regression: we did that to get an idea on where we

should move the values of our parameters so that the MSE cost function reaches its minimum.

The gradient showed the direction, but we didn’t know how far we should go in this direction,

so we used a small step

at each iteration and then reevaluated our direction. The same

happens in gradient boosting. However, instead of getting the gradient directly, we use its

proxy in the form of residuals: they show us how the model has to be adjusted so that the

error (the residual) is reduced.

The three principal hyperparameters to tune in gradient boosting are the number of trees,

the learning rate, and the depth of trees — all three aﬀect model accuracy. The depth of

trees also aﬀects the speed of training and prediction: the shorter, the faster.

It can be shown that training on residuals optimizes the overall model

for the mean squared

error criterion. You can see the diﬀerence with bagging here: boosting reduces the bias (or

underﬁtting) instead of the variance. As such, boosting can overﬁt. However, by tuning the

depth and the number of trees, overﬁtting can be largely avoided.

The gradient boosting algorithm for classiﬁcation is similar, but the steps are slightly diﬀerent.

Let’s consider the binary case. Assume we have

regression decision trees. Similarly to

logistic regression, the prediction of the ensemble of decision trees is modeled using the

sigmoid function:

Pr(y = 1|x, f )

def

1 + e

−f(x)

where f(x) =

m=1

(x) and f

is a regression tree.

Again, like in logistic regression, we apply the maximum likelihood principle by trying to

ﬁnd such an

that maximizes

i=1

(

= 1

, f

)). Again, to avoid numerical

overﬂow, we maximize the sum of log-likelihoods rather than the product of likelihoods.

The algorithm starts with the initial constant model

1−p

, where

i=1

(It can be shown that such initialization is optimal for the sigmoid function.) Then at each

iteration

, a new tree

is added to the model. To ﬁnd the best

, ﬁrst the partial

derivative g

of the current model is calculated for each i = 1, . . . , N:

where

is the ensemble classiﬁer model built at the previous iteration

m −

1. To calculate

we need to ﬁnd the derivatives of

(

= 1

, f

)) with respect to

for all

. Notice that

(

= 1

, f

))

def

= ln

(

1+e

−f (x

)

). The derivative of the right-hand term in the previous

equation with respect to f equals to

f (x

)

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 11

We then transform our training set by replacing the original label

with the corresponding

partial derivative

, and we build a new tree

using the transformed training set. Then

we ﬁnd the optimal update step ρ

as:

= arg max

f+ρf

At the end of iteration m, we update the ensemble model f by adding the new tree f

f ← f + αρ

We iterate until m = M , then we stop and return the ensemble model f.

Gradient boosting is one of the most powerful machines learning algorithms. Not just because

it creates very accurate models, but also because it is capable of handling huge datasets with

millions of examples and features. It usually outperforms random forest in accuracy but,

because of its sequential nature, can be signiﬁcantly slower in training.

7.6 Learning to Label Sequences

A sequence is one the most frequently observed types of structured data. We communicate

using sequences of words and sentences, we execute tasks in sequences, our genes, the music

we listen and videos we watch, our observations of a continuous process, such as a moving

car or the price of a stock are all sequential.

In sequence labeling, a labeled sequential example is a pair of lists (

X, Y

), where

is a list

of feature vectors, one per time step,

is a list of the same length of labels. For example,

could represent words in a sentence such as [“big”, “beautiful”, “car”], and

would be the

list of the corresponding parts of speech, such as [“adjective”, “adjective”, “noun”]). More

formally, in an example

= [

, x

, . . . , x

size

], where

size

is the length of the sequence

of the example i, Y

= [y

, y

, . . . , y

size

] and y

∈ {1, 2, . . . , C}.

You have already seen that an RNN can be used to annotate a sequence. At each time step

it reads an input feature vector

(t)

, and the last recurrent layer outputs a label

(t)

last

(in the

case of binary labeling) or y

(t)

last

(in the case of multiclass or multilabel labeling).

However, RNN is not the only possible model for sequence labeling. The model called

Conditional Random Fields

(CRF) is a very eﬀective alternative that often performs well

in practice for the feature vectors that have many informative features. For example, imagine

we have the task of

named entity extraction

and we want to build a model that would

label each word in the sentence such as “I go to San Francisco” with one of the following

classes:

{location, name, company_name, other}

. If our feature vectors (which represent

words) contain such binary features as “whether or not the word starts with a capital letter”

and “whether or not the word can be found in the list of locations,” such features would

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 12

be very informative and help to classify the words San and Francisco as location. Building

handcrafted features is known to be a labor-intensive process that requires a signiﬁcant level

of domain expertise.

CRF is an interesting model and can be seen as a generalization of

logistic regression to sequences. However, in practice, it has been

outperformed by bidirectional deep gated RNN for sequence labeling

tasks. CRFs are also signiﬁcantly slower in training which makes them

diﬃcult to apply to large training sets (with hundreds of thousands of

examples). Additionally, a large training set is where a deep neural

network thrives.

7.7 Sequence-to-Sequence Learning

Sequence-to-sequence learning

(often abbreviated as seq2seq learning) is a generalization

of the sequence labeling problem. In seq2seq,

and

can have diﬀerent length. seq2seq

models have found application in machine translation (where, for example, the input is

an English sentence, and the output is the corresponding French sentence), conversational

interfaces (where the input is a question typed by the user, and the output is the answer

from the machine), text summarization, spelling correction, and many others.

Many but not most sequence-to-sequence learning problems are currently best solved by

neural networks. Machine translation is a notorious example. There are multiple neural

network architectures for seq2seq which perform better than others depending on the task.

All those network architectures have one property in common: they have two parts, an

encoder

and a

decoder

(for this reason they are also known as

encoder-decoder

neural

networks).

In seq2seq learning, the encoder is a neural network that accepts sequential input. It can

be an RNN, but also a CNN or some other architecture. The role of the encoder is to read

the input and generate some sort of state (similar to the state in RNN) that can be seen

as a numerical representation of the meaning of the input the machine can work with. The

meaning of some entity, whether it be an image, a text or a video, is usually a vector or a

matrix that contains real numbers. This vector (or matrix) is called in the machine learning

jargon the embedding of the input.

The decoder in seq2seq learning is another neural network that takes an embedding as input

and is capable of generating a sequence of outputs. As you could have already guessed, that

embedding comes from the encoder. To produce a sequence of outputs, the decoder takes a

start of sequence input feature vector

(0)

(typically all zeroes), produces the ﬁrst output

(1)

, updates its state by combining the embedding and the input

(0)

, and then uses the

output

(1)

as its next input

(1)

. For simplicity, the dimensionality of

(t)

can be the same

as that of

(t)

; however, it is not strictly necessary. As we saw in Chapter 6, each layer of an

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 13

RNN can produce many simultaneous outputs: one can be used to generate the label

(t)

while another one, of diﬀerent dimensionality, can be used as the x

(t)

Encoder Decoder

The

weather

fine

<start>

Il fait beau

t= 1 2 3 4 1 2 3

Figure 4: A traditional seq2seq architecture.

Both encoder and decoder are trained simultaneously using the training data. The errors at

the decoder output are propagated to the encoder via backpropagation.

A traditional seq2seq architecture is illustrated in ﬁg. 4. More accurate predictions can be

obtained using an architecture with

attention

. Attention mechanism is implemented by an

additional set of parameters that combine some information from the encoder (in RNNs,

this information is the list of state vectors of the last recurrent layer from all encoder time

steps) and the current state of the decoder to generate the label. That allows for even better

retention of long-term dependencies than provided by gated units and bidirectional RNN. A

seq2seq architecture with attention is illustrated in ﬁg. 5.

Sequence-to-sequence learning is a relatively new research domain. Novel

network architectures are regularly discovered and published. Training

such architectures can be challenging as the number of hyperparame-

ters to tune and other architectural decisions can be overwhelming. I

recommend consulting the book’s wiki for the state of the art material,

tutorials and code samples.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 14

The

weather

fine

<start>

Il fait beau

t= 1 2 3 4 1 2 3

Attention

Figure 5: A seq2seq architecture with attention.

7.8 Active Learning

Active learning

is an interesting supervised learning paradigm. It is usually applied when

obtaining labeled examples is costly. That is often the case in the medical or ﬁnancial

domains, where the opinion of an expert may be required to annotate patients’ or customers’

data. The idea is that we start the learning with relatively few labeled examples, and a large

number of unlabeled ones, and then add labels only to those examples that contribute the

most to the model quality.

There are multiple strategies of active learning. Here, we discuss only the following two:

1) data density and uncertainty based, and

2) support vector-based.

The former strategy applies the current model

, trained using the existing labeled examples,

to each of the remaining unlabelled examples (or, to save the computing time, to some

random sample of them). For each unlabeled example

, the following importance score is

computed:

density

(

)

· uncertainty

(

). Density reﬂects how many examples surround

its close neighborhood, while

uncertainty

(

) reﬂects how uncertain the prediction of the

model

is for

. In binary classiﬁcation with sigmoid, the closer the prediction score is to

5, the more uncertain is the prediction. In SVM, the closer the example is to the decision

boundary, the most uncertain is the prediction.

In multiclass classiﬁcation, entropy can be used as a typical measure of uncertainty:

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 15

(x) = −

c=1

Pr(y

(c)

; f(x)) ln Pr(y

(c)

; f(x)),

where

(

(c)

;

(

)) is the probability score the model

assigns to class

(c)

when classifying

. You can see that if for each

(c)

(

(c)

) =

then the model is the most uncertain and

the entropy is at its maximum of 1; on the other hand, if for some

(c)

(

(c)

) = 1, then the

model is certain about the class y

(c)

and the entropy is at its minimum of 0.

Density for the example

can be obtained by taking the average of the distance from

each of its k nearest neighbors (with k being a hyperparameter).

Once we know the importance score of each unlabeled example, we pick the one with the

highest importance score and ask the expert to annotate it. Then we add the new annotated

example to the training set, rebuild the model and continue the process until some stopping

criterion is satisﬁed. A stopping criterion can be chosen in advance (the maximum number

of requests to the expert based on the available budget) or depend on how well our model

performs according to some metric.

The support vector-based active learning strategy consists in building an SVM model using

the labeled data. We then ask our expert to annotate the unlabeled example that lies the

closest to the hyperplane that separates the two classes. The idea is that if the example lies

closest to the hyperplane, then it is the least certain and would contribute the most to the

reduction of possible places where the true (the one we look for) hyperplane could lie.

Some active learning strategies can incorporate the cost of asking an

expert for a label. Others learn to ask expert’s opinion. The “query by

committee” strategy consists of training multiple models using diﬀerent

methods and then asking an expert to label example on which those

models disagree the most. Some strategies try to select examples to

label so that the variance or the bias of the model are reduced the most.

7.9 Semi-Supervised Learning

semi-supervised learning

(SSL) we also have labeled a small fraction of the dataset;

most of the remaining examples are unlabeled. Our goal is to leverage a large number of

unlabeled examples to improve the model performance without asking an expert for additional

labeled examples.

Historically, there were multiple attempts at solving this problem. None of them could be

called universally acclaimed and frequently used in practice. For example, one frequently

cited SSL method is called “self-learning.” In self-learning, we use a learning algorithm to

build the initial model using the labeled examples. Then we apply the model to all unlabeled

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 16

examples and label them using the model. If the conﬁdence score of prediction for some

unlabeled example

is higher than some threshold (chosen experimentally), then we add this

labeled example to our training set, retrain the model and continue like this until a stopping

criterion is satisﬁed. We could stop, for example, if the accuracy of the model has not been

improved during the last m iterations.

The above method can bring some improvement to the model compared to just using the

initially labeled dataset, but the increase in performance usually is not very impressive.

Furthermore, in practice, the quality of the model could even decrease. That depends on the

properties of the statistical distribution the data was drawn from, which we usually do not

know.

On the other hand, the recent advancements in neural network learning brought some

impressive results. For example, it was shown that for some datasets, such as MNIST (a

frequent testbench in computer vision that consists of labeled images of handwritten digits

from 0 to 9) the model trained in a semi-supervised way has an almost perfect performance

with just 10 labeled examples per class (100 labeled examples overall). For comparison,

MNIST contains 70,000 labeled examples (60,000 for training and 10,000 for test). The

neural network architecture that attained such a remarkable performance is called a

ladder

network

. To understand ladder networks you have to understand what an

autoencoder

is.

An autoencoder is a feed-forward neural network with an encoder-decoder architecture. It

is trained to reconstruct its input. So the training example is a pair (

x, x

). We want the

output

x of the model f(x) to be as similar to the input x as possible.

Embedding

Encoder

Decoder

Figure 6: Autoencoder.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 17

An important detail here is that an autoencoder’s network looks like an hourglass with a

bottleneck layer

in the middle that contains the embedding of the

-dimensional input

vector; the embedding layer usually has much fewer units than

. The goal of the decoder is

to reconstruct the input feature vector from this embedding. Theoretically, it is suﬃcient

to have 10 units in the bottleneck layer to successfully encode MNIST images. In a typical

autoencoder schematically depicted in ﬁg. 6, the cost function is usually either the mean

squared error (when features can be any number) or the negative log-likelihood (when features

are binary and the units of the last layer of the decoder have the sigmoid activation function).

If the cost is the mean squared error, then it is given by:

i=1

− f(x

where kx

− f(x

)k is the Euclidean distance between two vectors.

denoising autoencoder

corrupts the left-hand side

in the training example (

x, x

) by

adding some random perturbation to the features. If our examples are grayscale images with

pixels represented as values between 0 and 1, usually a

normal Gaussian noise

is added to

each feature. For each feature

of the input feature vector

the noise value

(j)

is sampled

from the following distribution:

(j)

∼

√

2π

exp



−

(−µ)

2σ



where the notation

∼

means “sampled from,”

is the constant 3

14159

. . .

and

is a

hyperparameter that has to be tuned. The new, corrupted value of the feature

(j)

is given

by x

(j)

+ n

(j)

A ladder network is a denoising autoencoder with an upgrade. The encoder and the decoder

have the same number of layers. The bottleneck layer is used directly to predict the label

(using the softmax activation function). The network has several cost functions. For each

layer

of the encoder and the corresponding layer

of the decoder, one cost

penalizes

the diﬀerence between the outputs of the two layers (using the squared Euclidean distance).

When a labeled example is used during training, another cost function,

, penalizes the error

in prediction of the label (the negative log-likelihood cost function is used). The combined

cost function,

l=1

(averaged over all examples in the batch), is optimized by the

stochastic gradient descent with backpropagation. The hyperparameters

for each layer

determine the tradeoﬀ between the classiﬁcation and encoding-decoding cost.

In the ladder network, not just the input is corrupted with the noise, but also the output of

each encoder layer (during training). When we apply the trained model to the new input

to predict its label, we do not corrupt the input.

Other semi-supervised learning techniques, not related to training neural networks, exist.

One of them implies building the model using the labeled data and then cluster the unlabeled

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 18

and labeled examples together using any clustering technique (we consider some of them in

Chapter 9).

For each new example, we then output as a prediction the majority label

in the cluster it belongs to. Another technique, called S3VM, is based

on using SVM. We build one SVM model for each possible labeling of

unlabeled examples and then we pick the model with the largest margin.

The paper on S3VM describes an approach that allows solving this

problem without actually enumerating all possible labelings.

7.10 One-Shot Learning

This chapter would be incomplete without mentioning two other important supervised learning

paradigms. One of them is

one-shot learning

. In one-shot learning, typically applied in

face recognition, we want to build a model that can recognize that two photos of the same

person represent that same person. If we present to the model two photos of two diﬀerent

people, we expect the model to recognize that the two people are diﬀerent.

One way to build such a model is to train a

siamese neural network

(SNN). An SNN can

be implemented as any kind of neural network, a CNN, an RNN, or an MLP. What matters

is how we train the network.

To train an SNN, we use the

triplet loss

function. For example, let us have three images of

a face: the image

(for anchor), the image

(for positive) and the image

(for negative).

and

are two diﬀerent pictures of the same person;

is a picture of another person.

Each training example i is now a triplet (A

, P

, N

Let’s say we have a neural network model

that can take a picture of a face as input and

output an embedding of this picture. The triplet loss for one example is deﬁned as,

max(kf (A

) − f (P

− kf(A

) − f (N

+ α, 0). (3)

The cost function is deﬁned as the average triplet loss:

i=1

max(kf (A

) − f (P

− kf(A

) − f (N

+ α, 0),

where

is a positive hyperparameter. Intuitively,

(

)

− f

(

)

is low when our neural

network outputs similar embedding vectors for

and

;

(

)

− f

(

)

is high when the

embedding for pictures of two diﬀerent people are diﬀerent. If our model works the way

we want, then the term

(

)

− f

(

)

− kf

(

)

− f

(

)

will always be negative,

because we subtract a high value from a small value. By setting

higher, we force the term

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 19

to be even smaller, to make sure that the model learned to recognize the two same faces

and two diﬀerent faces with a high margin. If

is not small enough, then because of

the

cost will be positive, and the model parameters will be adjusted in backpropagation.

Rather than randomly choose an image for

, a better way to create triplets for training is

to use the current model after several epochs of learning and ﬁnd candidates for

that are

similar to

and

according to that model. Using random examples as

would signiﬁcantly

slow down the training because the neural network will easily see the diﬀerence between

pictures of two random people, so the average triplet loss will be low most of the time and

the parameters will not be updated fast enough.

To build an SNN, we ﬁrst decide on the architecture of our neural network. For example,

CNN is a typical choice if our inputs are images. Given an example, to calculate the average

triplet loss, we apply, consecutively, the model to

, then to

, and then we

compute the loss for that example using eq. 3. We repeat that for all triplets in the batch and

then compute the cost; gradient descent with backpropagation propagates the cost through

the network to update its parameters.

It’s a common misconception that for one-shot learning we need only one example of each

entity for training. In practice, we need much more than one example of each person for the

person identiﬁcation model to be accurate. It’s called one-shot because of the most frequent

application of such a model: face-based authentication. For example, such a model could be

used to unlock your phone. If your model is good, then you only need to have one picture

of you on your phone and it will recognize you, and also it will recognize that someone else

is not you. When we have the model, to decide whether two pictures

and

belong to

the same person, we check if

(

)

− f

(

)

is less than some threshold

, which is another

hyperparameter of the model.

7.11 Zero-Shot Learning

We ﬁnish this chapter with

zero-shot learning

. It is a relatively new

research area, so there are no algorithms that proved to have a signiﬁcant

practical utility yet. Therefore, I only outline here the basic idea and

leave the details of various algorithms for further reading. In zero-shot

learning (ZSL) we want to train a model to assign labels to objects. The

most frequent application is to learn to assign labels to images.

However, we want the model to be able to predict labels that we didn’t have in the training

data. How is that possible?

The trick is to use embeddings not just to represent the input

but also to represent the

output

. Imagine that we have a model that for any word in English can generate an

embedding vector with the following property: if a word

has a similar meaning to the

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 20

word

, then the embedding vectors for these two words will be similar. For example, if

Paris and

is Rome, then they will have embeddings that are similar; on the other hand, if

is potato, then the embeddings of

and

will be dissimilar. Such embedding vectors are

called “word embeddings,” and they are usually compared using cosine similarity metrics

Word embeddings have such a property that each dimension of the embedding represents a

speciﬁc feature of the meaning of the word. For example, if our word embedding has four

dimensions (usually they are much wider, between 50 and 300 dimensions), then these four

dimensions could represent such features of the meaning as animalness, abstractness, sourness,

and yellowness (yes, sounds funny, but it’s just an example). So the word bee would have an

embedding like this [1

1], the word yellow like this [0

1], the word unicorn like this

0]. The values for each embedding are obtained using a speciﬁc training procedure

applied to a vast text corpus.

Now, in our classiﬁcation problem, we can replace the label

for each example

in our

training set with its word embedding and train a multi-label model that predicts word

embeddings. To get the label for a new example

, we apply our model

, get the

embedding

and then search among all English words those whose embeddings are the most

similar to

y using cosine similarity.

Why does that work? Take a zebra for example. It is white, it is a mammal, and it has stripes.

Take a clownﬁsh: it is orange, not a mammal, and has stripes. Now take a tiger: it is orange,

it has stripes, and it is a mammal. If these three features are present in word embeddings,

the CNN would learn to detect these same features in pictures. Even if the label tiger was

absent in the training data, but other objects including zebras and clownﬁsh were, then the

CNN will most likely learn the notion of mammalness, orangeness, and stripeness to predict

labels of those objects. Once we present the picture of a tiger to the model, those features

will be correctly identiﬁed from the image and most likely the closest word embedding from

our English dictionary to the predicted embedding will be that of tiger.

I will show in Chapter 10 how to learn words embeddings from data.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 21

8 Advanced Practice

This chapter contains the description of techniques that you could ﬁnd useful in your practice

in some contexts. It’s called “Advanced Practice” not because the presented techniques are

more complex, but rather because they are applied in some very speciﬁc contexts. In many

practical situations, you will most likely not need to resort to using these techniques, but

sometimes they are very helpful.

8.1 Handling Imbalanced Datasets

In many practical situations, your labeled dataset will have underrepresented the examples

of some class. This is the case, for example, when your classiﬁer has to distinguish between

genuine and fraudulent e-commerce transactions: the examples of genuine transactions are

much more frequent. If you use SVM with soft margin, you can deﬁne a cost for misclassiﬁed

examples. Because noise is always present in the training data, there are high chances that

many examples of genuine transactions would end up on the wrong side of the decision

boundary by contributing to the cost.

(2)

(1)

(a)

(2)

(1)

(b)

Figure 1: An illustration of an imbalanced problem. (a) Both classes have the same weight;

(b) examples of the minority class have a higher weight.

The SVM algorithm will try to move the hyperplane to avoid as much as possible misclassiﬁed

examples. The “fraudulent” examples, which are in the minority, risk being misclassiﬁed in

order to classify more numerous examples of the majority class correctly. This situation is

illustrated in Figure 1a. This problem is observed for most learning algorithms applied to

imbalanced datasets.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 3

If you set the cost of misclassiﬁcation of examples of the minority class higher, then the

model will try harder to avoid misclassifying those examples, obviously for the cost of

misclassiﬁcation of some examples of the majority class, as illustrated in Figure 1b.

Some SVM implementations (including SVC in scikit-learn) allow you to provide weights for

every class. The learning algorithm takes this information into account when looking for the

best hyperplane.

If your learning algorithm doesn’t allow weighting classes, you can try to increase the

importance of examples of some class by making multiple copies of the examples of this class

(this is called oversampling).

An opposite approach is to randomly remove from the training set some examples of the

majority class (undersampling).

You might also try to create synthetic examples by randomly sampling feature values of

several examples of the minority class and combining them to obtain a new example of

that class. There two popular algorithms that oversample the minority class by creating

synthetic examples: the

synthetic minority oversampling technique

(SMOTE) and the

adaptive synthetic sampling method (ADASYN).

SMOTE and ADASYN work similarly in many ways. For a given example

of the minority

class, they pick

nearest neighbors of this example (let’s call this set of

examples

) and

then create a synthetic example

new

(

− x

), where

is an example of the

minority class chosen randomly from

. The interpolation hyperparameter

is a random

number in the range [0, 1].

Both SMOTE and ADASYN randomly pick all possible

in the dataset. In ADASYN,

the number of synthetic examples generated for each

is proportional to the number of

examples in

which are not from the minority class. Therefore, more synthetic examples

are generated in the area where the examples of the minority class are rare.

Some algorithms are less sensitive to the problem of an imbalanced dataset. Decision trees,

as well as random forest and gradient boosting, often perform well on imbalanced datasets.

8.2 Combining Models

Ensemble algorithms, like Random Forest, typically combine models of the same nature. They

boost performance by combining hundreds of weak models. In practice, we can sometimes

get an additional performance gain by combining strong models made with diﬀerent learning

algorithms. In this case, we usually use only two or three models.

There are three typical ways to combine models:

1) averaging,

2) majority vote, and

3) stacking.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 4

Averaging works for regression as well as those classiﬁcation models that return classiﬁcation

scores. You simply apply all your models, let’s call them base models, to the input

and

then average the predictions. To see if the averaged model works better than each individual

algorithm, you test it on the validation set using a metric of your choice.

Majority vote works for classiﬁcation models. You apply all your base models to the input

and then return the majority class among all predictions. In the case of a tie, you either

randomly pick one of the classes, or, you return an error message (if the fact of misclassifying

would incur a signiﬁcant cost).

Stacking consists of building a meta-model that takes the output of your base models as

input. Let’s say you want to combine a classiﬁer

and a classiﬁer

, both predicting the

same set of classes. To create a training example (

, ˆy

) for the stacked model, you set

= [f

(x), f

(x)] and ˆy

= y

If some of your base models return not just a class, but also a score for each class, you can

use these values as features too.

To train the stacked model, it is recommended to use examples from the training set and

tune the hyperparameters of the stacked model using cross-validation.

Obviously, you have to make sure that your stacked model performs better on the validation

set than each of the base models you stacked.

The reason that combining multiple models can bring better performance overall is the

observation that when several uncorrelated strong models agree they are more likely to agree

on the correct outcome. The keyword here is “uncorrelated.” Ideally, diﬀerent strong models

have to be obtained using diﬀerent features or using algorithms of a diﬀerent nature — for

example, SVMs and Random Forest. Combining diﬀerent versions of decision tree learning

algorithm, or several SVMs with diﬀerent hyperparameters may not result in a signiﬁcant

performance boosting.

8.3 Training Neural Networks

In neural network training, one challenging aspect is to convert your data into the input the

network can work with. If your input is images, ﬁrst of all, you have to resize all images so

that they have the same dimensions. After that, pixels are usually ﬁrst standardized and

then normalized to the range [0, 1].

Texts have to be tokenized (that is split into pieces, such as words, punctuation marks, and

other symbols). For CNN and RNN, each token is converted into a vector using the one-hot

encoding, so the text becomes a list of one-hot vectors. Another, often a better way to

represent tokens is by using word embeddings. For multilayer perceptron, to convert texts

to vectors the bag of words approach may work well, especially for larger texts (larger than

SMS messages and tweets).

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 5

The choice of speciﬁc neural network architecture is a diﬃcult one. For the same problem,

like seq2seq learning, there is a variety of architectures, and new ones are proposed almost

every year. I recommend making the research on the state of the art solutions for your

problem using Google Scholar or Microsoft Academic search engines that allow searching for

scientiﬁc publications using keywords and time range. If you don’t mind working with less

modern architecture, I recommend looking for implemented architectures on GitHub and

ﬁnd one that could be applied to your data with minor modiﬁcations.

In practice, the advantage of modern architecture over an older one becomes less signiﬁcant

as you preprocess, clean and normalize your data, and create a larger training set. Many

modern neural network architectures are a result of the collaboration of several scientists

from several labs and companies; such models could be very complex to implement them on

your own and usually require much computational power to train. The time spent on trying

to replicate the results from a recent scientiﬁc paper may not be worth it. This time could

better be spent on building the solution around a less modern but stable model and getting

more training data.

Once you decided on the architecture of your network, you have to decide on the number

of layers, their type, and size. It is recommended to start with one or two layers, train a

model and see if it ﬁts the training data well (has a low bias). If not, gradually increase the

size of each layer and the number of layers until the model perfectly ﬁts the training data.

Once this is the case, if the model doesn’t perform well on the validation data (has a high

variance), you should add regularization to your model. If, after you added regularization,

the model doesn’t ﬁt the training data anymore, you slightly increase the size of the network

once again, and you continue to work iteratively like this until the model ﬁts both training

and validation data well enough according to your metric.

8.4 Advanced Regularization

In neural networks, besides L1 and L2 regularization, you can use neural network speciﬁc

regularizers:

dropout

batch normalization

, and

early stopping

. Batch normalization

is technically not a regularization technique, but it often has a regularization eﬀect on the

model.

The concept of dropout is very simple. Each time you run a training example through the

network, you temporarily exclude at random some units from the computation. The higher

the percentage of units excluded the higher the regularization eﬀect. Neural network libraries

allow you to add a dropout layer between two successive layers, or you can specify the dropout

parameter for the layer. The dropout parameter is in the range [0

1] and it has to be found

experimentally by tuning it on the validation data.

Batch normalization (which rather has to be called batch standardization) is a technique that

consists of standardizing the outputs of each layer before the units of the subsequent layer

receive them as input. In practice, batch normalization results in a faster and more stable

training, as well as in some regularization eﬀect. So it’s always a good idea to try to use

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 6

batch normalization. In neural network libraries, you can often insert a batch normalization

layer between two layers.

Early stopping is the way to train a neural network by saving the preliminary model after

every epoch and assessing the performance of the preliminary model on the validation set. As

you remember from the section about gradient descent in Chapter 4, as the number of epochs

increases, the cost decreases. The decreased cost means that the model ﬁts the training data

well. However, at some point, after some epoch

, the model can start overﬁtting: the cost

keeps decreasing, but the performance of the model on the validation data deteriorates. If

you keep, in a ﬁle, the version of the model after each epoch, you can stop the training once

you start observing a decreased performance on the validation set. Alternatively, you can

keep running the training process for a ﬁxed number of epochs and then, in the end, you

pick the best model. Models saved after each epoch are called checkpoints. Some machine

learning practitioners rely on this technique very often; others try to properly regularize the

model to avoid such undesirable behavior.

Another regularization technique that can be applied not just to neural networks, but to

virtually any learning algorithm, is called

data augmentation

. This technique is often

used to regularize models that work with images. Once you have your original labeled

training set, you can create a synthetic example from an original example by applying various

transformations to the original image: zooming it slightly, rotating, ﬂipping, darkening, and

so on. You keep the original label in these synthetic examples. In practice, this often results

in increased performance of the model.

8.5 Handling Multiple Inputs

In many of your practical problems, you will work with multimodal data. For example, your

input could be an image and text and the binary output could indicate whether the text

describes this image or not.

Shallow learning algorithms are not particularly well suited to work with multimodal data.

However, it doesn’t mean that it is impossible. For example, you can train one model on the

image and another one on the text. Then you can use a model combination technique we

discussed above.

If you cannot divide your problem into two independent subproblems, you can try to vectorize

each input (by applying the corresponding feature engineering method) and then simply

concatenate two feature vectors together to form one wider feature vector. For example,

if your image has features [

(1)

, i

(2)

, i

(3)

] and your text has features [

(1)

, t

(2)

, t

(3)

, t

(4)

] your

concatenated feature vector will be [i

(1)

, i

(2)

, i

(3)

, t

(1)

, t

(2)

, t

(3)

, t

(4)

With neural networks, you have more ﬂexibility. You can build two subnetworks, one for

each type of input. For example, a CNN subnetwork would read the image while an RNN

subnetwork would read the text. Both subnetworks have as their last layer an embedding:

CNN has an embedding for the image, while RNN has an embedding for the text. You can

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 7

then concatenate two embeddings and then add a classiﬁcation layer, such as softmax or

sigmoid, on top of the concatenated embeddings. Neural network libraries provide simple to

use tools that allow concatenating or averaging layers from several subnetworks.

8.6 Handling Multiple Outputs

In some problems, you would like to predict multiple outputs for one input. We considered

multi-label classiﬁcation in the previous chapter. Some problems with multiple outputs can

be eﬀectively converted into a multi-label classiﬁcation problem. Especially those that have

labels of the same nature (like tags) or fake labels can be created as a full enumeration of

combinations of original labels.

However, in some cases the outputs are multimodal, and their combinations cannot be

eﬀectively enumerated. Consider the following example: you want to build a model that

detects an object on an image and returns its coordinates. The same model has to also return

the label of the object, such as “person,” “cat,” or “hamster.” Your training examples will

have an image as input and one vector with coordinates of an object and another vector with

a one-hot encoded label.

To handle a situation like this, you can create one subnetwork that would work as an

encoder. It will read the input image using, for example, one or several convolution layers.

The encoder’s last layer would be the embedding of the image. Then you add two other

subnetworks on top of the embedding layer: one that takes the embedding vector as input

and predicts the coordinates of an object. This ﬁrst subnetwork can have a ReLU as the last

layer, which is a good choice for predicting positive real numbers, such as coordinates; this

subnetwork could use the mean squared error cost

. The second subnetwork will take the

same embedding vector as input and predict the probabilities for each label. This second

subnetwork can have a softmax as the last layer, which is appropriate for the probabilistic

output, and use the averaged negative log-likelihood cost

(also called cross-entropy cost).

Obviously, you are interested in both accurately predicted coordinates and the label. However,

it is impossible to optimize the two cost functions at the same time. By trying to optimize one,

you risk hurting the second one and the other way around. What you can do is add another

hyperparameter

in the range (0

1) and deﬁne the combined cost function as

γC

+(1

−γ

)

Then you tune the value for γ on the validation data just like any other hyperparameter.

8.7 Transfer Learning

Transfer learning

is probably where neural networks have a unique advantage over the

shallow models. In transfer learning, you pick an existing model trained on some dataset,

and you adapt this model to predict examples from another dataset, diﬀerent from the one

the model was built on. This second dataset is not like hold-out sets you use for validation

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 8

and test. It may represent some other phenomenon, or, as machine learning scientists say, it

may come from another statistical distribution.

For example, imagine you have trained your model to recognize (and label) wild animals on a

big labeled dataset. After some time, you have another problem to solve: you need to build a

model that would recognize domestic animals. With shallow learning algorithms, you do not

have many options: you have to build another big labeled dataset, now for domestic animals.

With neural networks, the situation is much more favorable. Transfer learning in neural

networks works like this.

1. You build a deep model on the original big dataset (wild animals).

You compile a much smaller labeled dataset for your second model (domestic animals).

You remove the last one or several layers from the ﬁrst model. Usually, these are layers

responsible for the classiﬁcation or regression; they usually follow the embedding layer.

4. You replace the removed layers with new layers adapted for your new problem.

5. You “freeze” the parameters of the layers remaining from the ﬁrst model.

You use your smaller labeled dataset and gradient descent to train the parameters of

only the new layers.

Usually, there is an abundance of deep models for visual problems available online. You can

ﬁnd one that has high chances to be of use for your problem, download that model, remove

several last layers (the quantity of layers to remove is a hyperparameter), put your own

prediction layers and train your model.

Even if you don’t have an existing model, transfer learning can still help you in situations when

your problem requires a labeled dataset very costly to obtain, but you can get another dataset

for which labels are more readily available. Let’s say you build a document classiﬁcation

model. You got the taxonomy of labels from your employer, and it contains a thousand

categories. In this case, you would need to pay someone to a) read, understand and memorize

the diﬀerences between categories and b) read up to a million documents and annotate them.

That doesn’t sound good.

To save on labeling so many examples, you could consider using Wikipedia pages as the dataset

to build your ﬁrst model. The labels for a Wikipedia page can be obtained automatically by

taking the category the Wikipedia page belongs to. Once your ﬁrst model has learned to

predict Wikipedia categories well, you can transfer this learning to predict the categories of

your employer’s taxonomy. Usually, you will need much fewer annotated examples for your

employer’s problem than you would need if you started solving your original problem from

scratch.

8.8 Algorithmic Eﬃciency

Not all algorithms capable of solving a problem are practical. Some can be fast; some can be

too slow. Some problems can be solved by a fast algorithm, for others, no fast algorithms

can exist.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 9

The subﬁeld of computer science called analysis of algorithms is concerned with determining

and comparing the complexity of algorithms. The big O notation is used to classify algorithms

according to how their running time or space requirements grow as the input size grows.

For example, let’s say we have the problem of ﬁnding the two most distant one-dimensional

examples in the set of examples

of size

. One algorithm we could craft to solve this

problem would look like this (here and below, in Python):

1 def find_max_distance(S):

2 result = None

3 max_distance = 0

4 for x1 in S:

5 for x2 in S:

6 if abs(x1 - x2) >= max_distance:

7 max_distance = abs(x1 - x2)

8 result = (x1, x2)

9 return result

In the above algorithm, we loop over all values in

, and at every iteration of the ﬁrst loop, we

loop over all values in

once again. Therefore, the above algorithm makes

comparisons

of numbers. If we take as a unit time the time the comparison (once), abs (twice) and

assignment (twice) operations take, then the time complexity (or, simply, complexity) of this

algorithm is at most 5

. When the complexity of an algorithm is measured in the worst

case, the big O notation is used. For the above algorithm, using the big O notation, we write

that the algorithm’s complexity is O(N

) (the constants, like 5, are ignored).

For the same problem, we can craft another algorithm like this:

1 def find_max_distance(S):

2 result = None

3 min_x = float("inf")

4 max_x = float("-inf")

5 for x in S:

6 if x < min_x:

7 min_x = x

8 elif x > max_x:

9 max_x = x

10 result = (max_x, min_x)

11 return result

In the above algorithm, we loop over all values in

only once, so the algorithm’s complexity

is O(N). In this case, we say that the latter algorithm is more eﬃcient than the former.

Usually, an algorithm is called eﬃcient when its complexity in big O notation is polynomial

in the size of the input. Therefore both

(

) and

(

) are eﬃcient. However, for very

large inputs an O(N

) algorithm can still be very slow. In the big data era, scientists often

look for O( logN ) algorithms.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 10

From a practical standpoint, when you implement your algorithm, you should avoid using

loops whenever possible. For example, you should use operations on matrices and vectors,

instead of loops. In Python, to compute wx, you should write

1 import numpy

2 wx = numpy.dot(w,x)

and not

1 wx = 0

2 for i in range(N):

3 wx += w[i]*x[i]

Use appropriate data structures. If the order of elements in a collection doesn’t matter, use

set instead of list. In Python, the operation of verifying whether a speciﬁc example

belongs

to S is eﬃcient when S is declared as a set and is ineﬃcient when S is declared as a list.

Another important data structure, which you can use to make your Python code more eﬃcient

dict

. It is called a dictionary or a hashmap in other languages. It allows you to deﬁne a

collection of key-value pairs with very fast lookups for keys.

Unless you know exactly what you do, always prefer using popular libraries to writing your

own scientiﬁc code. Scientiﬁc Python packages like numpy, scipy, and scikit-learn were built

by experienced scientists and engineers with eﬃciency in mind. They have many methods

implemented in the C programming language for maximum speed.

If you need to iterate over a vast collection of elements, use

generators

that create a function

that returns one element at a time rather than all the elements at once.

Use cProﬁle package in Python to ﬁnd ineﬃciencies in your code.

Finally, when nothing can be improved in your code from the algorithmic perspective, you

can further boost the speed of your code by using:

• multiprocessing package to run computations in parallel, and

•

PyPy, Numba or similar tools to compile your Python code into fast, optimized machine

code.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 11

9 Unsupervised Learning

Unsupervised learning deals with problems in which your dataset doesn’t have labels. This

property is what makes it very problematic for many practical applications. The absence

of labels which represent the desired behavior for your model means the absence of a solid

reference point to judge the quality of your model. In this book, I only present unsupervised

learning methods that allow building models that can be evaluated based on data as opposed

to human judgment.

9.1 Density Estimation

Density estimation

is a problem of modeling the probability density function (pdf) of the

unknown probability distribution from which the dataset has been drawn. It can be useful for

many applications, in particular for novelty or intrusion detection. In Chapter 7, we already

estimated the pdf to solve the one-class classiﬁcation problem. To do that, we decided that

our model would be parametric, more precisely a multivariate normal distribution (MVN).

This decision was somewhat arbitrary because if the real distribution from which our dataset

was drawn is diﬀerent from the MVN, our model will be very likely far from perfect. We

also know that models can be nonparametric. We used a nonparametric model in kernel

regression. It turns out that the same approach can work for density estimation.

Let

}

i=1

be a one-dimensional dataset (a multi-dimensional case is similar) whose examples

were drawn from a distribution with an unknown pdf

with

∈ R

for all

= 1

, . . . , N

. We

are interested in modeling the shape of this function

. Our kernel model of

, let’s denote it

f, is given by,

(x) =

i=1



x − x



, (1)

where

is a hyperparameter that controls the tradeoﬀ between bias and variance of our

model and k is a kernel. Again, like in Chapter 7, we use a Gaussian kernel:

k(z) =

√

2π

exp



−z



We look for such a value of

that minimizes the diﬀerence between the real shape of

and

the shape of our model

. A reasonable choice of measure of this diﬀerence is called the

mean integrated squared error:

MISE(b) = E



(

(x) − f (x))



. (2)

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 3

Intuitively, you see in eq. 2 that we square the diﬀerence between the real pdf

and our

model of it

. The integral

replaces the summation

i=1

we employed in the average

squared error, while the expectation operator E replaces the average

(a) (b)

Figure 1: Kernel density estimation: (a) good ﬁt; (b) overﬁtting; (c) underﬁtting; (d) the

curve of grid search for the best value for b.

Indeed, when our loss is a function with a continuous domain, such as (

(

)

− f

(

))

, we

have to replace the summation with the integral. The expectation operation

means that

we want

to be optimal for all possible realizations of our training set

}

i=1

. That is

important because

is deﬁned on a ﬁnite sample of some probability distribution, while the

real pdf f is deﬁned on an inﬁnite domain (the set R).

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 4

Now, we can rewrite the right-hand side term in eq. 2 like this:



(x)dx



− 2E



(x)f(x)dx



+ E



f(x)



The third term in the above summation is independent of

and thus can be ignored. An

unbiased estimator of the ﬁrst term is given by

(

)

while the unbiased estimator of

the second term can be approximated by

−

i=1

(i)

(

), where

(i)

is a kernel model of

f computed on our training set with the example x

excluded.

The term

i=1

(i)

) is known in statistics as the leave one out estimate, a form of cross-

validation in which each fold consists of one example. You could have noticed that the term

(

)

(

)

(let’s call it

) is the expected value of the function

, because

is a pdf. It

can be demonstrated that the leave one out estimate is an unbiased estimator of Ea.

Now, to ﬁnd the optimal value b

∗

for b, we want to minimize the cost deﬁned as:

(x)dx −

i=1

(i)

We can ﬁnd

∗

using grid search. For

-dimensional feature vectors

, the error term

x − x

in eq. 1 can be replaced by the Euclidean distance

kx − x

. In ﬁg. 1 you can see the

estimates for the same pdf obtained with three diﬀerent values of

from a dataset containing

100 examples, as well as the grid search curve. We pick

∗

at the minimum of the grid search

curve.

9.2 Clustering

Clustering

is a problem of learning to assign a label to examples by leveraging an unlabeled

dataset. Because the dataset is completely unlabeled, deciding on whether the learned model

is optimal is much more complicated than in supervised learning.

There is a variety of clustering algorithms, and, unfortunately, it’s hard to tell which one is

better in quality for your dataset. Usually, the performance of each algorithm depends on

the unknown properties of the probability distribution the dataset was drawn from.

9.2.1 K-Means

The

k-means

clustering algorithm works as follows. First, the analyst has to choose

— the

number of classes (or clusters). Then we randomly put

feature vectors, called

centroids

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 5

to the feature space

. We then compute the distance from each example

to each centroid

using some metric, like the Euclidean distance. Then we assign the closest centroid to each

example (like if we labeled each example with a centroid id as the label). For each centroid,

we calculate the average feature vector of the examples labeled with it. These average feature

vectors become the new locations of the centroids.

(a) original data (b) iteration 1

Figure 2: The progress of the kmeans algorithm for

= 3. The circles are two-dimensional

feature vectors; the squares are moving centroids.

Some variants of k-means compute the initial positions of centroids based on some properties of the

dataset.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 6

We recompute the distance from each example to each centroid, modify the assignment and

repeat the procedure until the assignments don’t change after the centroid locations were

recomputed. The model is the list of assignments of centroids IDs to the examples.

The initial position of centroids inﬂuence the ﬁnal positions, so two runs of k-means can

result in two diﬀerent models. One run of the k-means algorithm is illustrated in ﬁg. 2.

Diﬀerent background colors represent regions in which all points belong to the same cluster.

The value of

, the number of clusters, is a hyperparameter that has to be tuned by the data

analyst. There are some techniques for selecting

. None of them is proven optimal. Most of

them require from the analyst to make an “educated guess” by looking at some metrics or

by examining cluster assignments visually. Later in this chapter, we consider one technique

which allows choosing a reasonably good value for

without looking at the data and making

guesses.

9.2.2 DBSCAN and HDBSCAN

While k-means and similar algorithms are centroid-based,

DBSCAN

is a density-based

clustering algorithm. Instead of guessing how many clusters you need, by using DBSCAN,

you deﬁne two hyperparameters:



and

. You start by picking an example

from your

dataset at random and assign it to cluster 1. Then you count how many examples have

the distance from

less than or equal to



. If this quantity is greater than or equal to

then you put all these



-neighbors to the same cluster 1. You then examine each member of

cluster 1 and ﬁnd their respective



-neighbors. If some member of cluster 1 has

or more



-neighbors, you expand cluster 1 by putting those



-neighbors to the cluster. You continue

expanding cluster 1 until there are no more examples to put in it. In the latter case, you pick

from the dataset another example not belonging to any cluster and put it to cluster 2. You

continue like this until all examples either belong to some cluster or are marked as outliers.

An outlier is an example whose -neighborhood contains less than n examples.

The advantage of DBSCAN is that it can build clusters that have an arbitrary shape, while k-

means and other centroid-based algorithms create clusters that have a shape of a hypersphere.

An obvious drawback of DBSCAN is that it has two hyperparameters and choosing good

values for them (especially



) could be challenging. Furthermore, having



ﬁxed, the clustering

algorithm cannot eﬀectively deal with clusters of varying density.

HDBSCAN

is the clustering algorithm that keeps the advantages of DBSCAN, by removing

the need to decide on the value of



. The algorithm is capable of building clusters of

varying density. HDBSCAN is an ingenious combination of multiple ideas and describing the

algorithm in full is out of the scope of this book.

HDBSCAN only has one important hyperparameter:

, that is the minimum number of

examples to put in a cluster. This hyperparameter is relatively simple to choose by intuition.

HDBSCAN has very fast implementations: it can deal with millions of examples eﬀectively.

Modern implementations of k-means are much faster than HDBSCAN though, but the qualities

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 7

of the latter may outweigh its drawbacks for many practical tasks. It is recommended to

always try HDBSCAN on your data ﬁrst.

9.2.3 Determining the Number of Clusters

The most important question is how many clusters does your dataset have? When the feature

vectors are one-, two- or three-dimensional, you can look at the data and see “clouds” of

points in the feature space. Each cloud is a potential cluster. However, for

-dimensional

data, with D > 3, looking at the data is problematic

There’s one practically useful method of determining the reasonable number of clusters based

on the concept of

prediction strength

. The idea is to split the data into training and test

set, similarly to how we do in supervised learning. Once you have the training and test sets,

of size

and

of size

respectively, you ﬁx

, the number of clusters, and run a

clustering algorithm

on sets

and

and obtain the clustering results

(

, k

) and

C(S

, k).

Let

be the clustering

(

, k

) built using the training set. Note that the clusters in

can be deﬁned by some regions. If an example falls within one of those regions, then this

example belongs to some speciﬁc cluster. For example, if we apply the k-means algorithm to

some dataset, it results in a partition of the feature space into

polygonal regions, as we saw

in ﬁg. 2.

Deﬁne the

× N

co-membership matrix

[

A, S

] as follows:

[

A, S

]

(i,i

)

= 1 if and

only if examples

and

from the test set belong to the same cluster according to the

clustering A. Otherwise D[A, S

]

(i,i

)

= 0.

Let’s take a break and see what we have here. We have built, using the training set of

examples, a clustering

that has

clusters. Then we have built the co-membership matrix

that indicates whether two examples from the test set belong to the same cluster in A.

Intuitively, if the quantity

is the reasonable number of clusters, then two examples that

belong to the same cluster in clustering

(

, k

) will most likely belong to the same cluster

in clustering

(

, k

). On the other hand, if

is not reasonable (too high or too low), then

training data-based and test data-based clusterings will likely be less consistent.

Some analysts look at multiple two-dimensional scatter plots, in which only a pair of features are present

at the same time. It might give an intuition about the number of clusters. However, such an approach suﬀers

from subjectivity, is prone to error and counts as an educated guess rather than a scientiﬁc method.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 8

full dataset training set test set

(a) (b) (c)

Figure 3: The clustering for k = 4: (a) training data clustering; (b) test data clustering; (c)

test data plotted over the training clustering.

The idea is illustrated in ﬁg. 3. The plots in ﬁg. 3a and 3b show respectively

(

4) and

(

4) with their respective cluster regions. Fig. 3c shows the test examples plotted over

the training data cluster regions. You can see in 3c that orange test examples don’t belong

anymore to the same cluster according to the clustering regions obtained from the training

data. This will result in many zeroes in the matrix

[

A, S

] which, in turn, is an indicator

that k = 4 is likely not the best number of clusters.

More formally, the prediction strength for the number of clusters k is given by,

ps(k)

def

= min

j=1,...,k

|(|A

| − 1)

i,i

∈A

D[A, S

]

(i,i

)

where A

def

= C(S

, k), A

is j

cluster from the clustering C(S

, k) and |A

| is the number

of examples in cluster A

Given a clustering

(

, k

), for each test cluster, we compute the proportion of observation

pairs in that cluster that are also assigned to the same cluster by the training set centroids.

The prediction strength is the minimum of this quantity over the k test clusters.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 9

Figure 4: Predictive strength for diﬀerent values of k for two-, three- and four-cluster data.

Experiments suggest that a reasonable number of clusters is the largest

such that

(

) is

above 0

8. You can see in ﬁg. examples of predictive strength for diﬀerent values of

for

two-, three- and four-cluster data.

For non-deterministic clustering algorithms, such as k-means, which

can generate diﬀerent clusterings depending on the initial positions

of centroids, it is recommended to do multiple runs of the clustering

algorithm for the same

and compute the average prediction strength

¯ps

(

) over multiple runs. Another eﬀective method to estimate the

number of clusters is the

gap statistic

method. Other, less automatic

methods, which some analysts still use, include the

elbow method

and the average silhouette method.

9.2.4 Other Clustering Algorithms

DBSCAN and k-means compute so-called

hard clustering

, in which each example can

belong to only one cluster.

Gaussian mixture model

(GMM) allow each example to be a

member of several clusters with diﬀerent

membership score

(HDBSCAN allows that too,

by the way). Computing a GMM is very similar to doing model-based density estimation.

In GMM, instead of having just one multivariate normal distribution (MND), we have a

weighted sum of several MNDs:

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 10

j=1

,Σ

where

,Σ

is a MND

, and

is its weight in the sum. The values of parameters

and

, for all

= 1

, . . . , k

are obtained using the

expectation maximization algorithm

(EM) to optimize the maximum likelihood criterion.

Again, for simplicity, let us look at the one-dimensional data. Also assume that there are two

clusters: k = 2. In this case, we have two Gaussian distributions,

f(x | µ

, σ

) =

2πσ

−

(x−µ

)

2σ

and f(x | µ

, σ

) =

2πσ

−

(x−µ

)

2σ

, (3)

where

(

x | µ

, σ

) and

(

x | µ

, σ

) are two parametrized pdf deﬁning the likelihood of

X = x.

We use the EM algorithm to estimate

, and

. The parameters

and

of the GMM are useful for the density estimation task and less useful for the clustering task,

as we will see below.

EM works like follows. In the beginning, we guess the initial values for

, and

and set φ

= φ

(in general, it’s

for each φ

At each iteration of EM, the following four steps are executed:

1. For all i = 1, . . . , N , calculate the likelihood of each x

using eq. 3:

f(x

| µ

, σ

) ←

2πσ

−

−µ

)

2σ

and f(x

| µ

, σ

) ←

2πσ

−

−µ

)

2σ

Using

Bayes’ Rule

, for each example

, calculate the likelihood

(j)

that the example

belongs to cluster

j ∈ {

}

(in other words, the likelihood that the example was drawn

from the Gaussian j):

(j)

←

f(x

| µ

, σ

)φ

f(x

| µ

, σ

)φ

+ f(x

| µ

, σ

)φ

The parameter

reﬂects how likely is that our Gaussian distribution

with parameters

and

may have produced our dataset. That is why in the beginning we set

: we

don’t know how each of the two Gaussians is likely, and we reﬂect our ignorance by setting

the likelihood of both to one half.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 11

iteration 1 iteration 2

iteration 10 iteration 40

Figure 5: The progress of the Gaussian mixture model estimation using the EM algorithm

for two clusters (k = 2).

3. Compute the new values of µ

and σ

, j ∈ {1, 2} as,

←

i=1

(j)

i=1

(j)

and σ

←

i=1

(j)

− µ

)

i=1

(j)

. (4)

4. Update φ

, j ∈ {1, 2} as,

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 12

←

i=1

(j)

The steps 1

−

4 are executed iteratively until the values

and

don’t change much: for

example, the change is below some threshold . Fig. 5 illustrates this process.

You may have noticed that the EM algorithm is very similar to the k-means algorithm: start

with random clusters, then iteratively update each cluster’s parameters by averaging the

data that is assigned to that cluster. The only diﬀerence in the case of the GMM is that the

assignment of an example

to the cluster

is soft:

belongs to cluster

with probability

(j)

. This is why we calculate the new values for

and

in eq. 4 not as an average (used

in k-means) but as a weighted average with weights b

(j)

Once we have learned the parameters

and

for each cluster

, the membership score of

example x in cluster j is given by f (x | µ

, σ

The extension to

-dimensional data (

D >

1) is straightforward. The only diﬀerence is

that instead of the variance

, we now have the covariance matrix

that parametrizes the

multinomial normal distribution (MND). The advantage of GMM over k-means is that the

clusters in GMM can have a form of an ellipse that can have an arbitrary elongation and

rotation. The values in the covariance matrix control these properties.

How to choose

in GMM? Unfortunately, there’s no universally recognized method. What

is usually recommended to do is to split your dataset into training and test set. Then try

diﬀerent

and build a diﬀerent model

for each

on the training data. Then choose the

model that maximizes the likelihood of examples in the test set:

arg max

i=1

where N

is the size of the test set.

There is a variety of clustering algorithms described in the literature.

Worth mentioning are

spectral clustering

and

hierarchical cluster-

ing

. For some datasets, you may ﬁnd those more appropriate. However,

in most practical cases, kmeans, HDBSCAN and the Gaussian mixture

model would satisfy your needs.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 13

9.3 Dimensionality Reduction

Many modern machine learning algorithms, such as ensemble algorithms and neural networks

handle well very high-dimensional examples, up to millions of features. With modern

computers and graphical processing units (GPUs), dimensionality reduction techniques are

used much less in practice than in the past. The most frequent use case for dimensionality

reduction is data visualization: humans can only interpret on a plot the maximum of three

dimensions.

Another situation in which you could beneﬁt from dimensionality reduction is when you

have to build an interpretable model and to do so you are limited in your choice of learning

algorithms. For example, you can only use decision tree learning or linear regression. By

reducing your data to lower dimensionality and by ﬁguring out which quality of the original

example each new feature in the reduced feature space reﬂects, one can use simpler algorithms.

Dimensionality reduction removes redundant or highly correlated features; it also reduces the

noise in the data — all that contributes to the interpretability of the model.

The three most widely used techniques of dimensionality reduction are

principal com-

ponent analysis

(PCA),

uniform manifold approximation and projection

(UMAP),

and autoencoders.

I already explained autoencoders in Chapter 7. You can use the low-dimensional output of the

bottleneck layer

of the autoencoder as the vector of reduced dimensionality that represents

the high-dimensional input feature vector. You know that this low-dimensional vector

represents the essential information contained in the input vector because the autoencoder is

capable of reconstructing the input feature vector based on the bottleneck layer output alone.

9.3.1 Principal Component Analysis

Principal component analysis or PCA is one of the oldest methods. The math behind it

involves operation on matrices that I didn’t explain in Chapter 2, so I leave the math of

PCA for your further reading. Here, I only provide intuition and illustrate the method on an

example.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 14

(a) (b) (c)

Figure 6: PCA: (a) the original data; (b) two principal components displayed as vectors; (c)

the data projected on the ﬁrst principal component.

Consider a two-dimensional data as shown in ﬁg. 6a. Principal components are vectors that

deﬁne a new coordinate system in which the ﬁrst axis goes in the direction of the highest

variance in the data. The second axis is orthogonal to the ﬁrst one and goes in the direction

of the second highest variance in the data. If our data was three-dimensional, the third axis

would be orthogonal to both the ﬁrst and the second axes and go in the direction of the third

highest variance, and so on. In ﬁg. 6b, the two principal components are shown as arrows.

The length of the arrow reﬂects the variance in this direction.

Now, if we want to reduce the dimensionality of our data to

new

< D

, we pick

new

largest principal components and project our data points on them. For our two-dimensional

illustration, we can set

new

= 1 and project our examples to the ﬁrst principal component

to obtain the orange points in ﬁg. 6c.

To describe each orange point, we need only one coordinate instead

of two: the coordinate with respect to the ﬁrst principal component.

When our data is very high-dimensional, it often happens in practice

that the ﬁrst two or three principal components account for most of the

variation in the data, so by displaying the data on a 2D or 3D plot we

can indeed see a very high-dimensional data and its properties.

9.3.2 UMAP

The idea behind many of the modern dimensionality reduction algorithms, especially those

designed speciﬁcally for visualization purposes, such as

t-SNE

and

UMAP

, is basically

the same. We ﬁrst design a similarity metric for two examples. For visualization purposes,

besides the Euclidean distance between the two examples, this similarity metric often reﬂects

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 15

some local properties of the two examples, such as the density of other examples around

them.

In UMAP, this similarity metric w is deﬁned as follows,

w(x

, x

)

def

= w

, x

) + w

, x

) − w

, x

). (5)

The function w

, x

) is deﬁned as,

, x

)

def

= exp



−

d(x

, x

) − ρ



where

(

, x

) is the Euclidean distance between two examples,

is the distance from

to its closest neighbor, and

is the distance from

to its

closest neighbor (

is a

hyperparameter of the algorithm).

It can be shown that the metric in eq. 5 varies in the range from 0 to 1 and is symmetric,

which means that w(x

, x

) = w(x

, x

Let

denote the similarity of two examples in the original high dimensional space and let

be the similarity given by the same eq. 5 in the new low-dimensional space. Because the

values of

and

lie in the range between 0 and 1, we can see them as two probability

distributions. A widely used metric of similarity between two probability distributions is

cross-entropy:

C(w, w

) =

i=1

j=1

w(x

, x

) ln

w(x

, x

)

, x

)

+ (1 − w(x

, x

)) ln

1 − w(x

, x

)

1 − w

, x

)

, (6)

where x

is the low-dimensional “version” of the original high-dimensional example x.

As you can see from eq. 6, when

(

, x

) is similar to

(

, x

), for all pairs (

i, j

), then

(

w, w

) is minimized. And this is precisely what we want: for any two examples

and

, we want their similarity metric in the original and the lower-dimensional spaces to be as

similar as possible.

In eq. 6 the unknown parameters are

(for all

= 1

, . . . , N

) and we can compute them by

gradient descent by minimizing C(w, w

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 16

PCA UMAP Autoencoder

Figure 7: Dimensionality reduction of the MNIST dataset using three diﬀerent techniques.

In ﬁg. 7, you can see the result of dimensionality reduction applied to the MNIST dataset of

handwritten digits. MNIST is commonly used for benchmarking various image processing

systems; it contains 70,000 labeled examples. Ten diﬀerent colors on the plot correspond to

ten classes. Each point on the plot corresponds a speciﬁc example in the dataset. As you can

see, UMAP separates examples visually better (remember, it doesn’t have access to labels).

In practice, UMAP is slightly slower than PCA but faster than autoencoder.

9.4 Outlier Detection

Outlier detection

is the problem of detecting in the dataset the examples that are very

diﬀerent from what a typical example in the dataset looks like. We have already seen several

techniques that could help to solve this problem: autoencoder and one-class classiﬁer learning.

If we use autoencoder, we train it on our dataset. Then, if we want to predict whether an

example is an outlier, we can use the autoencoder model to reconstruct the example from

the bottleneck layer. The model will unlikely be capable of reconstructing an outlier.

In one-class classiﬁcation, the model either predicts that the input example belongs to the

class, or it’s an outlier.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 17

10 Other Forms of Learning

10.1 Metric Learning

I mentioned that the most frequently used metrics of similarity (or dissimilarity) between

two feature vectors are

Euclidean distance

and

cosine similarity

. Such choices of metric

seem logical but arbitrary, just like the choice of the squared error in linear regression. The

fact that one metric can work better than another depending on the dataset is an indicator

that none of them is perfect.

You can create your metric that would work better for your dataset. It’s then possible to

integrate your metric into any learning algorithm that needs a metric, like k-means or kNN.

How can you know, without trying all possibilities, which equation would be a good metric?

You can train your metric from data.

Remember the Euclidean distance between two feature vectors x and x

d(x, x

)

def

(x − x

)

(x − x

)(x − x

We can slightly modify this metric to make it parametrizable and then learn these parameters

from data. Consider the following modiﬁcation:

(x, x

) = kx − x

def

(x − x

)

A(x − x

where A is a D × D matrix. Let’s say D = 3. If we let A be the identity matrix,

def





1 0 0

0 1 0

0 0 1





then d

becomes the Euclidean distance. If we have a general diagonal matrix, like this,

def





2 0 0

0 8 0

0 0 1





then diﬀerent dimensions have diﬀerent importance in the metric. (In the above example,

the second dimension is the most important in the metric calculation.) More generally, to be

called a metric a function of two variables has to satisfy three conditions:

1. d(x, x

) ≥ 0 nonnegativity

2. d(x, x

) ≤ d(x, x

) + d(x

, z) triangle inequality

3. d(x, x

) = d(x

, x) symmetry

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 3

To satisfy the ﬁrst two conditions, the matrix

has to be positive semideﬁnite. You can

see a positive semideﬁnite matrix as the generalization of the notion of a nonnegative real

number to matrices. Any positive semideﬁnite matrix M satisﬁes:

Mz ≥ 0.

The above property follows from the deﬁnition of a positive semideﬁnite matrix. The proof

that the second condition is satisﬁed when the matrix

is positive semideﬁnite can be found

on the book’s companion website.

To satisfy the third condition, we can simply take (d(x, x

) + d(x

, x))/2.

Let’s say we have an unannotated set

}

i=1

. To build the training data for our

metric learning problem, we manually create two sets. The ﬁrst set

is such that a pair of

examples (

, x

) belongs to set

and

are similar (from our subjective perspective).

The second set

is such that a pair of examples (

, x

) belongs to set

and

are

dissimilar.

To train the matrix of parameters

from the data, we want to ﬁnd a positive semideﬁnite

matrix A that solves the following optimization problem:

min

)∈S

kx − x

such that

)∈D

kx − x

≥ c,

where c is a positive constant (can be any number).

The solution to this optimization problem is found by gradient descent

with a modiﬁcation that ensures that the found matrix

is positive

semideﬁnite. We leave the description of the algorithm out of the scope

of this book for further reading. You should know that there are many

other ways to learn a metric, including non-linear and kernel-based.

However, the one presented in this book gives a good result for most

practical problems.

10.2 Learning to Rank

Learning to rank

is a supervised learning problem. Among others, one frequent problem

solved using learning to rank is the optimization of search results returned by a search engine

for a query. In search result ranking optimization, a labeled example

in the training set

of size

is a ranked collection of documents of size

(labels are ranks of documents). A

feature vector represents each document in the collection. The goal of the learning is to ﬁnd

a ranking function

which outputs values that can be used to rank documents. For each

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 4

training example, an ideal function

would output values that induce the same ranking of

documents as given by labels.

Each example

= 1

, . . . , N

, is a collection of feature vectors with labels:

{

(

i,j

, y

i,j

)

}

j=1

. Features in a feature vector

i,j

represent the document

= 1

, . . . , r

For example,

(1)

i,j

could represent how recent is the document,

(2)

i,j

would reﬂect whether the

words of the query can be found in the document title,

(3)

i,j

could represent the size of the

document, and so on. The label

i,j

could be the rank (1

, . . . , r

) or a score. For example,

the lower the score, the higher the document should be ranked.

There are three principal approaches to solve such a learning problem: pointwise, pairwise,

and listwise.

Pointwise approach transforms each training example into multiple examples: one example

per document. The learning problem becomes a standard supervised learning problem, either

regression or logistic regression. In each example (

x, y

) of the pointwise learning problem,

is the feature vector of some document, and

is the original score (if

i,j

is a score) or a

synthetic score obtained from the ranking (the higher the rank, the lower is the synthetic

score). Any supervised learning algorithm can be used in this case. The solution is usually

far from perfect. Principally, this is because each document is considered in isolation, while

the original ranking (given by the labels y

i,j

of the original training set) could optimize the

positions of the whole set of documents. For example, if we have already given a high rank to

a Wikipedia page in some collection of documents, we would not give a high rank to another

Wikipedia page for the same query.

In the pairwise approach, the problem also considers documents in isolation, however, in this

case, a pair of documents is considered at once. Given a pair of documents (

, x

) we want

to build a model

that, given (

, x

) as input, outputs a value close to 1 if

has to be put

higher than

in the ranking. Otherwise,

outputs a value close to 0. At the test time,

given a model, the ﬁnal ranking for an unlabeled example

is obtained by aggregating the

predictions for all pairs of documents in

. Such an approach works better than pointwise,

but still far from perfect.

The state of the art rank learning algorithms, such as

LambdaMART

, implement the

listwise approach. In the listwise approach, we try to optimize the model directly on some

metric that reﬂects the quality of ranking. There are various metrics for assessing search

engine result ranking, including precision and recall. One popular metric that combines both

precision and recall is called mean average precision (MAP).

To deﬁne MAP, let us ask judges (Google call those people rankers) to examine a collection

of search results for a query and assign relevancy labels to each search result. Labels could

be binary (1 for “relevant” and 0 for “irrelevant”) or on some scale, say from 1 to 5: the

higher the value, the more relevant the document is to the search query. Let our judges build

such relevancy labeling for a collection of 100 queries. Now, let us test our ranking model on

this collection. The precision of our model for some query is given by:

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 5

precision =

|{relevant documents} ∩ {retrieved documents}|

|{retrieved documents}|

where the notation

| · |

means “the number of.” The

average precision

metric, AveP, is

deﬁned for a ranked collection of documents returned by a search engine for a query q as,

AveP(q) =

k=1

(P (k) × rel(k))

|{relevant documents}|

where

is the number of retrieved documents,

(

) denotes the precision computed for

the top

search results returned by our ranking model for the query,

rel

(

) is an indicator

function equaling 1 if the item at rank

is a relevant document (according to judges) and

zero otherwise. Finally, the MAP for a collection of search queries of size Q is given by,

MAP =

q =1

AveP(q)

Now we get back to LambdaMART. This algorithm implements a pairwise approach, and it

uses gradient boosting to train the ranking function

(

). Then the binary model

(

, x

)

that predicts whether the document

should have a higher rank than the document

(for

the same search query) is given by a sigmoid with a hyperparameter α,

f(x

, x

)

def

1 + exp((h(x

) − h(x

))α

Again, as with many models that predict probability, the cost function is cross-entropy

computed using the model

. In our gradient boosting, we combine multiple regression trees

to build the function

by trying to minimize the cost. Remember that in gradient boosting

we add a tree to the model to reduce the error that the current model makes on the training

data. For the classiﬁcation problem, we computed the derivative of the cost function to

replace real labels of training examples with these derivatives. LambdaMART works similarly,

with one exception. It replaces the real gradient with a combination of the gradient and

another factor that depends on the metric, such as MAP. This factor modiﬁes the original

gradient by increasing or decreasing it so that the metric value is improved.

That is a very bright idea and not many supervised learning algorithms can boast that they

optimize a metric directly. Optimizing a metric is what we really want, but what we do in a

typical supervised learning algorithm is we optimize the cost instead of the metric. Usually,

in supervised learning, as soon as we have found a model that optimizes the cost function, we

try to tweak hyperparameters to improve the value of the metric. LambdaMART optimizes

the metric directly.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 6

The remaining question is how do we build the ranked list of results based on the predictions

of the model

which predicts whether its ﬁrst input has to be ranked higher than the second

input. It’s generally a computationally hard problem, and there are multiple implementations

of rankers capable of transforming pairwise comparisons into a ranking list. The most

straightforward approach is to use an existing sorting algorithm.

Sorting algorithms sort a collection of numbers in increasing or decreas-

ing order. (The simplest sorting algorithm is called bubble sort. It’s

usually taught in engineering schools.) Typically, sorting algorithms

iteratively compare a pair of numbers in the collection and change their

positions in the list based on the result of that comparison. If we plug

our function

into a sorting algorithm to execute this comparison, the

sorting algorithm will sort documents and not numbers.

10.3 Learning to Recommend

Leaning to recommend is an approach to build recommender systems. Usually, we have a

user who consumes some content. We have the history of consumption, and we want to

suggest this user new content that the user would like. It could be a movie on Netﬂix or a

book on Amazon.

Traditionally, two approaches were used to give recommendations: content-based ﬁltering

and collaborative ﬁltering.

Content-based ﬁltering is based on learning what do users like based on the description of

the content they consume. For example, if the user of a news site often reads news articles on

science and technology, then we would suggest to this user more documents on science and

technology. More generally, we could create one training set per user and add news articles

to this dataset as a feature vector

and whether the user recently read this news article as a

label

. Then we build the model of each user and can regularly examine each new piece of

content to determine whether a speciﬁc user would read it or not.

The content-based approach has many limitations. For example, the user can be trapped in

the so-called ﬁlter bubble: the system will always suggest to that user the information that

looks very similar to what user already consumed. That could result in complete isolation of

the user from information that disagrees with their viewpoints or expands them. On a more

practical side, the users might just get recommendations of items they already know about,

which is undesirable.

Collaborative ﬁltering has a signiﬁcant advantage over content-based ﬁltering: the recommen-

dations to one user are computed based on what other users consume or rate. For instance,

if two users gave high ratings to the same ten movies, then it’s more likely that user 1 will

appreciate new movies recommended based on the tastes of the user 2 and vice versa. The

drawback of this approach is that the content of the recommended items is ignored.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 7

In collaborative ﬁltering, the information on user preferences is organized in a matrix. Each

row corresponds to a user, and each column corresponds to a piece of content that user rated

or consumed. Usually, this matrix is huge and extremely sparse, which means that most of

its cells aren’t ﬁlled (or ﬁlled with a zero). The reason for such a sparsity is that most users

consume or rate just a tiny fraction of available content items. It’s is very hard to make

meaningful recommendations based on such sparse data.

Most real-world recommender systems use a hybrid approach: they combine recommendations

obtained by the content-based and collaborative ﬁltering models.

I already mentioned that content-based recommender model could be built using a classiﬁca-

tion or regression model that predicts whether a user will like the content based on content’s

features. Examples of features could include the words in books or news articles the user

liked, the price, the recency of the content, the identity of the content author and so on.

Two eﬀective collaborative-ﬁltering learning algorithms are

factorization machines

(FM)

and denoising autoencoders (DAE).

10.3.1 Factorization Machines

Factorization machines is a relatively new kind of algorithm. It was explicitly designed for

sparse datasets. Let’s illustrate the problem.

(1)

(2)

(3)

(4)

(5)

(6)

...

(D)

000

... ... ... ...

0 0

...

user

Ed Al Zak

...

100

... ... ... ...

0 0

0 1

0 0

1 0

...

It Up Jaws Her

movie

0.2

100

... ... ... ...

0.8 0.4

0.40.8

0.8 0.4

0 0.7

0.8 0

...

0.6

...

0.7

0.1

It Up Jaws Her

ratedmovies

100

0.3

0.35

0.5

0.95

...

0.8

0.8

0.78

0.77

...

0.85

...

(1)

(2)

(3)

(4)

(5)

(6)

(D)

...

... ...

...

Figure 1: Example for sparse feature vectors x and their respective labels y.

In ﬁg. 1 you see an example of sparse feature vectors with labels. Each feature vector

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 8

represents information about one speciﬁc user and one speciﬁc movie. Features in the blue

section represent a user. Users are encoded as one-hot vectors. Features in the green section

represent a movie. Movies are also encoded as one-hot vectors. Features in the yellow section

represent scores the user in green gave to each movie they rated. Feature

represents

the ratio of movies with an Oscar among those the speciﬁc user has watched. Feature

100

represents the percentage of the movie watched by the user in blue before they scored the

movie in green. The target

represents the score given by the user in blue to the movie in

green.

In real recommender systems, the number of users can count in millions, so the matrix in

ﬁg. 1 would count hundreds of millions of rows. The number of features could be hundreds

of thousands, depending on how reach is the choice of content and how creative you, as a

data analyst, are in feature engineering. Features

and

100

were handcrafted during the

feature engineering process, and I only show two features for the illustration purposes.

Trying to ﬁt a regression or classiﬁcation model to such an extremely sparse dataset would in

practice result in very poor generalization. Factorization machines approach this problem in

a diﬀerent way.

The factorization machine model is deﬁned as follows:

f(x)

def

= b +

i=1

j=i+1

where

and

= 1

, . . . , D

are scalar parameters similar to those used in linear regression.

Vectors

= 1

, . . . , D

, are

-dimensional vectors of factors.

is a hyperparameter and

is usually much smaller than

. The expression (

) is a dot-product of the

and

vectors of factors. As you can see, instead of trying to ﬁnd just one wide vector of

parameters which can reﬂect poorly interactions between features because of sparsity, we

complete it by additional parameters that apply to pairwise interactions

between

features. However, instead of having a parameter

i,j

for each interaction, which would add

an enormous

quantity of new parameters to the model, we factorize

i,j

into

by adding

only Dk  D(D − 1) parameters to the model

Depending on the problem, the loss function could be squared error loss (for regression) or

hinge loss. For classiﬁcation with

y ∈ {−

}

, with hinge loss or logistic loss the prediction

is made as y = sign(f(x)). The logistic loss is deﬁned as,

loss(f (x), y) =

ln 2

ln(1 + e

−y f (x)

Gradient descent can be used to optimize the average loss. In the example in ﬁg. 1, the

labels are in

{

}

, so it’s a multiclass problem. We can use

one versus rest

strategy

To be more precise we would add D(D − 1) parameters w

i,j

The notation  means “much less than.”

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 9

to convert this multiclass problem into ﬁve binary classiﬁcation problems.

10.3.2 Denoising Autoencoders

From Chapter 7, you know what a denoising autoencoder is: it’s a neural network that

reconstructs its input from the bottleneck layer. The fact that the input is corrupted by

noise while the output shouldn’t be, makes denoising autoencoders an ideal tool to build a

recommender model.

The idea is very straightforward: new movies a user could like are seen as if they were

removed from the complete set of preferred movies by some corruption process. The goal of

the denoising autoencoder is to reconstruct those removed items.

To prepare the training set for our denoising autoencoder, remove the blue and green features

from the training set in ﬁg. 1. Because now some examples become duplicates, keep only the

unique ones.

At the training time, randomly replace some of the non-zero yellow features in the input

feature vectors with zeros. Train the autoencoder to reconstruct the uncorrupted input.

At the prediction time, build a feature vector for the user. The feature vector will include

uncorrupted yellow features as well as the handcrafted features like

and

100

. Use the

trained DAE model to reconstruct the uncorrupted input. Recommend to the user movies

that have the highest scores at the model’s output.

Another eﬀective collaborative-ﬁltering model is a feed-forward neural

network with two inputs and one output. Remember from Chapter 8

that neural networks are good at handling multiple simultaneous inputs.

A training example here is a triplet (

u, m, r

). The input vector

is a

one-hot encoding of a user. The second input vector m is a one-hot

encoding of a movie. The output layer could be either a sigmoid (in

which case the label

is in [0

1]) or ReLU, in which case

can be in

some typical range, [1, 5] for example.

10.4 Self-Supervised Learning: Word Embeddings

We have already discussed word embeddings in Chapter 7. Recall that word embeddings are

feature vectors that represent words. They have the property that similar words have similar

word vectors. The question that you probably want to ask is where these word embedding

come from. The answer is: they are learned from data.

There are many algorithms to learn word embeddings. Here, we consider in detail only one of

them:

word2vec

, and only one version of word2vec called

skip-gram

, which works very well

in practice. Pretrained word2vec embeddings for many languages are available to download

online.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 10

In word embedding learning, our goal is to build a model which we can use to convert a

one-hot encoding of a word into a word embedding. Let our dictionary contain 10000 words.

The one-hot vector for each word is a 10000-dimensional vector of all zeroes except for one

dimension that contains a 1. Diﬀerent words have 1 in diﬀerent dimensions.

Consider a sentence: “I almost ﬁnished reading the book on machine learning.” Now, consider

the same sentence from which we have removed some word, say “book.” Our sentence becomes:

“I almost ﬁnished reading the

on machine learning.” Now let’s only keep the three words

before the

and three words after it: “ﬁnished reading the

on machine learning.” Looking

at this seven-word window around the

, if I ask you to guess what

stands for, you would

probably say: “book,” “article,” or “paper.” That is how the context words let you predict

the word they surround. It’s also how the machine can discover that words “book,” “paper,”

and “article” have a similar meaning: because they share similar contexts in multiple texts.

It turns out that it works the other way around too: a word can predict the context that

surrounds it. The piece “ﬁnished reading the

on machine learning” is called a skip-gram

with window size 7 (3 + 1 + 3). By using the documents available on the Web, we can easily

create hundreds of millions of skip-grams.

Let’s denote words in a skip-gram like this: [

−3

, x

−2

, x

−1

, x, x

, x

]. In our above

example of the sentence,

−3

is the one-hot vector for “ﬁnished,”

−2

corresponds to “reading,”

is the skipped word (

is “on” and so on. A skip-gram with window size 5 will look

like this: [x

−2

, x

−1

, x, x

, x

The skip-gram model with window size 5 is schematically depicted in ﬁg. 2. It is a fully-

connected network, like the multilayer perceptron. The input word is the one denoted as

our skip-gram. The neural network has to learn to predict the context words of the skip-gram

given the central word.

You can see now why the learning of this kind is called self-supervised: the labeled examples

get extracted from the unlabeled data such as text.

The activation function used in the output layer is softmax. The cost function is the negative

log-likelihood. The embedding for a word is obtained as the output of the embedding layer

when the one-hot encoding of this word is given as the input to the model.

Because of the large number of parameters in the word2vec models, two

techniques are used to make the computation more eﬃcient: hierarchical

softmax (an eﬃcient way of computing softmax that consists in repre-

senting the outputs of softmax as leaves of a binary tree) and negative

sampling (essentially, the idea is only to update a random sample of all

outputs per iteration of gradient descent). We leave these for further

reading.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 11

10000

...

10000

...

300

10000

...

10000

...

... ... ... ...

2

1

input

word

input

layer

embedding

layer

output

layer

Figure 2: The skip-gram model with window size 5 and the embedding layer of 300 units.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 12

11 Conclusion

Wow, that was fast! You are really goo d if you got here and managed to understand most of

the b o ok’s material.

If you lo ok at the number at the b ottom of this page, you see that we have oversp ent pap er,

which means that the title of the bo ok was slightly misleading. I hope that you forgive me

this little marketing trick. After all, if I wanted to make this b o ok exactly a hundred pages, I

could reduce font size, white margins, and line spacing, or remove the section on UMAP and

leave you on your own with the original paper. Believe me: you would not want to be left

alone with the original pap er on UMAP!

However, by stopping now, I feel conﬁdent that you have got everything you need to become

a great modern data analyst or machine learning engineer. That do esn’t mean that I covered

everything, but what I covered in a hundred pages you would ﬁnd in a bunch of b o oks, each

thousand-page thick. Much of what I covered is not in the b ooks at all: typical machine

learning b o oks are conservative and academic, while I emphasize those algorithms and

metho ds that you will ﬁnd useful in your day to day work.

What exactly I didn’t cover, but would have covered if it was a thousand-page machine

learning b o ok?

11.1 Topic Modeling

In text analysis, topic modeling is a prevalent unsup ervised learning problem. You have a

collection of text documents, and you would like to discover topics present in each document.

Latent Dirichlet Allocation

(LDA) is a very eﬀective algorithm of topic discovery. You

decide how many topics are present in your collection of do cuments and the algorithm assigns

a topic to each word in this collection. Then, to extract the topics from a do cument, you

simply count how many words of each topic are present in that do cument.

11.2 Gaussian Processes

Gaussian processes

(GP) is a sup ervised learning method that comp etes with kernel

regression. It has some advantages over the latter. For example, it provides conﬁdence

intervals for the regression line in each p oint. I decided not to explain GP because I could

not ﬁgure out a simple way to explain them, but you deﬁnitely could sp end some time to

learn ab out GP. It will be time well sp ent.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 3

11.3 Generalized Linear Models

Generalized linear model

(GLM) is a generalization of the linear regression to modeling

various forms of dep endency b etween the input feature vector and the target. Logistic

regression, for instance, is one form of GLMs. If you are interested in regression and you

lo ok for simple and explainable mo dels, you should deﬁnitely read more on GLM.

11.4 Probabilistic Graphical Models

We have mentioned one example of

probabilistic graphical models

(PGMs) in Chapter 7:

conditional random ﬁelds

(CRF). With CRF we can mo del the input sequence of words

and relationships between the features and labels in this sequence as a sequential dependency

graph. More generally, a PGM can be any graph. A graph is a structure consisting of a

collection of no des and edges that join a pair of no des. Each no de in PGM represents some

random variable (values of which can be observed or unobserved), and edges represent the

conditional dep endence of one random variable on another random variable. For example,

the random variable “sidewalk wetness” dep ends on the random variable “weather condition.”

By observing values of some random variables, an optimization algorithm can learn from

data the dep endency b etween observed and unobserved variables.

PGMs allow the data analyst to see how the values of one feature dep end on the values of

other features. If the edges of the dep endency graph are directed, it becomes p ossible to infer

causality. Unfortunately, constructing such mo dels by hand require a substantial amount of

domain expertise and a strong understanding of probability theory and statistics. The latter

is often a problem for many domain experts. Some algorithms allow learning the structure

of dependency graphs from data, but the learned mo dels are often hard to interpret by a

human and thus they aren’t b eneﬁcial for understanding complex probabilistic pro cesses that

generated the data. CRF is by far the most used PGM with applications mostly in text and

image pro cessing. However, in these two domains, they were surpassed by neural networks.

Another graphical mo del,

hidden Markov model

or HMM, in the past was frequently used

in speech recognition, time series analysis, and other temporal inference tasks, but, again

HMM lost to neural networks.

PGMs are also known under names of Bayesian networks, b elief networks, and probabilistic

indep endence networks.

11.5 Markov Chain Monte Carlo

If you work with graphical mo dels and want to sample examples from a very complex

distribution deﬁned by the dependency graph, you could use

Markov chain Monte Carlo

(MCMC) algorithms. MCMC is a class of algorithms for sampling from any probability

distribution deﬁned mathematically. Rememb er that when we talked ab out the denoising

auto encoder, we sampled the noise from the normal distribution. Sampling from standard

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 4

distributions, such as normal or uniform, is relatively easy because their prop erties are well

known. However, the task of sampling becomes signiﬁcantly more complicated when the

probability distribution can have an arbitrary form deﬁned by a dependency graph learned

from data.

11.6 Genetic Algorithms

Genetic algorithms

(GA) are a numerical optimization technique used to optimize undif-

ferentiable optimization objective functions. They use concepts from evolutionary biology

to search for a global optimum (minimum or maximum) of an optimization problem, by

mimicking evolutionary biological pro cesses.

GA work by starting with an initial generation of candidate solutions. If we lo ok for optimal

values of the parameters of our mo del, we ﬁrst randomly generate multiple combinations of

parameter values. We then test each combination of parameter values against the objective

function. Imagine each combination of parameter values as a point in a multi-dimensional

space. We then generate a subsequent generation of p oints from the previous generation by

applying such concepts as “selection,” “crossover,” and “mutation.”

In a nutshell, this results in each new generation keeping more points similar to those p oints

from the previous generation that performed the b est against the ob jective. In the new

generation, the points that performed the worst in the previous generation are replaced by

“mutations” and “crossovers” of the p oints that p erformed the b est. A mutation of a point is

obtained by a random distortion of some attributes of the original point. A crossover is a

certain combination of several points (for example, an average).

Genetic algorithms allow ﬁnding solutions to any measurable optimization criteria. For

example, GA can b e used to optimize the hyp erparameters of a learning algorithm. They are

typically much slower than gradient-based optimization techniques.

11.7 Reinforcement Learning

As we already discussed, reinforcement learning (RL) solves a very sp eciﬁc kind of problems

where the decision making is sequential. Usually, there’s an agent acting in an unknown

environment. Each action brings a reward and moves the agent to another state of the

environment (usually, as a result of some random pro cess with unknown prop erties). The

goal of the agent is to optimize its long-term reward.

Reinforcement learning algorithms, such as Q-learning, as well as its neural network based

counterparts, are used in learning to play video games, robotic navigation and coordination,

inventory and supply chain management, optimization of complex electric p ower systems

(p ower grids), and learning ﬁnancial trading strategies.

∗ ∗ ∗

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 5

The b o ok stops here. Don’t forget to o ccasionally visit the bo ok’s companion wiki to stay

up dated on new developments in each machine learning area considered in the b ook. As I

said in Preface, this bo ok, thanks to the constantly up dated wiki, like a go o d wine keeps

getting better after you buy it. Oh, and don’t forget that the b o ok is distributed on the read

ﬁrst, buy later principle. That means that if while reading these words you look at a digital

screen, you are probably the right person for buying this b o ok.

Andriy Burkov The Hundred-Page Machine Learning Book - Draft 6