www.deeplearningbook.org

Chapter
Mac
hine
Learning
Basics
Deep
learning
is
sp
eciﬁc
kind
of
mac
hine
learning.
understand
deep
learning
ell,
one
ust
ha
solid
understanding
of
the
basic
principles
of
mac
hine
learning.
This
hapter
pro
vides
brief
course
in
the
most
important
general
principles
that
are
applied
throughout
the
rest
of
the
ok.
Novice
readers
or
those
who
wan
wider
ersp
ective
are
encouraged
to
consider
mac
hine
learning
textb
oks
with
more
comprehensiv
co
erage
of
the
fundamen
tals,
such
as
Murphy
2012
or
Bishop
2006
).
If
you
are
already
familiar
with
machine
learning
basics,
feel
free
to
skip
ahead
to
section
5.11
That
section
cov
ers
some
ersp
ectiv
es
on
traditional
machine
learning
techniques
that
ha
strongly
inﬂuenced
the
developmen
of
deep
learning
algorithms.
egin
with
deﬁnition
of
what
learning
algorithm
is
and
present
an
example:
the
linear
regression
algorithm.
then
pro
ceed
to
describ
ho
the
hallenge
of
ﬁtting
the
training
data
diﬀers
from
the
hallenge
of
ﬁnding
patterns
that
generalize
to
new
data.
Most
machine
learning
algorithms
hav
settings
called
hyp
erp
ar
ameters
whic
ust
determined
outside
the
learning
algorithm
itself;
we
discuss
ho
to
set
these
using
additional
data.
Mac
hine
learning
is
essen
tially
form
of
applied
statistics
with
increased
emphasis
on
the
use
of
computers
to
statistically
estimate
complicated
functions
and
decreased
emphasis
on
proving
conﬁdence
interv
als
around
these
functions;
we
therefore
present
the
cen
tral
approaches
to
statistics:
frequentist
estimators
and
Bay
esian
inference.
Most
mac
hine
learning
algorithms
can
divided
into
the
categories
of
sup
ervised
learning
and
unsup
ervised
learning;
describ
these
categories
and
give
some
examples
of
simple
learning
algorithms
from
each
category
Most
deep
learning
algorithms
are
based
on
an
optimization
algorithm
called
sto
hastic
gradient
96
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
descen
t.
describ
how
to
combine
arious
algorithm
comp
onents,
such
as
an
optimization
algorithm,
cost
function,
mo
del,
and
dataset,
to
build
mac
hine
learning
algorithm.
Finally
in
section
5.11
describ
some
of
the
factors
that
hav
limited
the
ability
of
traditional
machine
learning
to
generalize.
These
hallenges
hav
motiv
ated
the
developmen
of
deep
learning
algorithms
that
ercome
these
obstacles.
5.1
Learning
Algorithms
mac
hine
learning
algorithm
is
an
algorithm
that
is
able
to
learn
from
data.
But
what
do
mean
learning?
Mitchell
1997
provides
succinct
deﬁnition:
“A
computer
program
is
said
to
learn
from
exp
erience
with
resp
ect
to
some
class
of
tasks
and
erformance
measure
if
its
erformance
at
tasks
in
as
measured
impro
es
with
exp
erience
.”
One
can
imagine
wide
ariet
of
exp
eriences
tasks
and
erformance
measures
and
we
do
not
attempt
in
this
ok
to
formally
deﬁne
what
ma
used
for
each
of
these
en
tities.
Instead,
in
the
follo
wing
sections,
we
provide
intuitiv
descriptions
and
examples
of
the
diﬀeren
kinds
of
tasks,
erformance
measures,
and
exp
eriences
that
can
used
to
construct
machine
learning
algorithms.
5.1.1
The
ask,
Mac
hine
learning
enables
us
to
tackle
tasks
that
are
to
diﬃcult
to
solve
with
ﬁxed
programs
written
and
designed
uman
eings.
rom
scientiﬁc
and
philosophical
oin
of
view,
machine
learning
is
in
teresting
ecause
developing
our
understanding
of
it
en
tails
developing
our
understanding
of
the
principles
that
underlie
intelligence.
In
this
relatively
formal
deﬁnition
of
the
word
“task,”
the
pro
cess
of
learning
itself
is
not
the
task.
Learning
is
our
means
of
attaining
the
ability
to
erform
the
task.
or
example,
if
we
an
rob
ot
to
able
to
walk,
then
alking
is
the
task.
could
program
the
rob
ot
to
learn
to
walk,
or
we
could
attempt
to
directly
write
program
that
sp
eciﬁes
how
to
alk
man
ually
Mac
hine
learning
tasks
are
usually
describ
ed
in
of
how
the
mac
hine
learning
system
should
pro
cess
an
example
An
example
is
collection
of
features
that
hav
een
quantitativ
ely
measured
from
some
ob
ject
or
even
that
an
the
machine
learning
system
to
pro
cess.
ypically
represent
an
example
as
ector
where
each
entry
of
the
vector
is
another
feature.
or
example,
the
features
of
an
image
are
usually
the
alues
of
the
pixels
in
the
image.
97
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
Man
kinds
of
tasks
can
solved
with
mac
hine
learning.
Some
of
the
most
common
machine
learning
tasks
include
the
follo
wing:
Classiﬁcation
In
this
yp
of
task,
the
computer
program
is
ask
ed
to
sp
ecify
whic
of
categories
some
input
elongs
to.
solve
this
task,
the
learning
algorithm
is
usually
asked
to
pro
duce
function
When
the
mo
del
assigns
an
input
describ
ed
vector
to
category
iden
tiﬁed
by
numeric
co
de
There
are
other
arian
ts
of
the
classiﬁcation
task,
for
example,
where
outputs
probabilit
distribution
ov
er
classes.
An
example
of
classiﬁcation
task
is
ob
ject
recognition,
where
the
input
is
an
image
(usually
describ
ed
as
set
of
pixel
brightness
alues),
and
the
output
is
numeric
co
de
identifying
the
ob
ject
in
the
image.
or
example,
the
Willow
Garage
PR2
rob
ot
is
able
to
act
as
waiter
that
can
recognize
diﬀeren
kinds
of
drinks
and
deliv
er
them
to
eople
on
command
Go
d-
fello
et
al.
2010
).
Mo
dern
ob
ject
recognition
is
est
accomplished
with
deep
learning
Krizhevsky
et
al.
2012
Ioﬀe
and
Szegedy
2015
).
Ob
ject
recognition
is
the
same
basic
technology
that
enables
computers
to
recognize
faces
aigman
et
al.
2014
),
which
can
used
to
automatically
tag
eople
in
photo
collections
and
for
computers
to
in
teract
more
naturally
with
their
users.
Classiﬁcation
with
missing
inputs
Classiﬁcation
ecomes
more
chal-
lenging
if
the
computer
program
is
not
guaran
teed
that
every
measurement
in
its
input
ector
will
alw
ys
pro
vided.
solve
the
classiﬁcation
task,
the
learning
algorithm
only
has
to
deﬁne
single
function
mapping
from
vector
input
to
categorical
output.
When
some
of
the
inputs
may
missing,
rather
than
pro
viding
single
classiﬁcation
function,
the
learning
algorithm
ust
learn
set
of
functions.
Each
function
corresp
onds
to
classifying
with
diﬀerent
subset
of
its
inputs
missing.
This
kind
of
situation
arises
frequen
tly
in
medical
diagnosis,
ecause
man
kinds
of
medical
tests
are
exp
ensive
or
in
asive.
One
to
eﬃciently
deﬁne
such
large
set
of
functions
is
to
learn
probabilit
distribution
ov
er
all
the
relev
ant
ariables,
then
solve
the
classiﬁcation
task
marginalizing
out
the
missing
ariables.
With
input
ariables,
we
can
now
obtain
all
diﬀeren
classiﬁcation
functions
needed
for
each
ossible
set
of
missing
inputs,
but
the
computer
program
needs
to
learn
only
single
function
describing
the
joint
probabilit
distribution.
See
Go
dfellow
et
al.
2013b
for
an
example
of
deep
probabilistic
mo
del
applied
to
such
task
in
this
wa
Many
of
the
other
tasks
describ
ed
in
this
section
can
also
generalized
to
ork
with
missing
inputs;
classiﬁcation
with
missing
inputs
is
just
one
example
of
what
machine
learning
can
do.
98
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
Regression
In
this
type
of
task,
the
computer
program
is
ask
ed
to
predict
umerical
alue
given
some
input.
solve
this
task,
the
learning
algorithm
is
ask
ed
to
output
function
This
type
of
task
is
similar
to
classiﬁcation,
except
that
the
format
of
output
is
diﬀerent.
An
example
of
regression
task
is
the
prediction
of
the
exp
ected
claim
amount
that
an
insured
erson
will
mak
(used
to
set
insurance
premiums),
or
the
prediction
of
future
prices
of
securities.
These
kinds
of
predictions
are
also
used
for
algorithmic
trading.
ranscription
In
this
type
of
task,
the
mac
hine
learning
system
is
asked
to
observ
relativ
ely
unstructured
representation
of
some
kind
of
data
and
transcrib
the
information
into
discrete
textual
form.
or
example,
in
optical
haracter
recognition,
the
computer
program
is
sho
wn
photograph
con
taining
an
image
of
text
and
is
asked
to
return
this
text
in
the
form
of
sequence
of
characters
(e.g.,
in
ASCI
or
Unico
de
format).
Go
ogle
Street
View
uses
deep
learning
to
pro
cess
address
num
ers
in
this
wa
Go
dfellow
et
al.
2014d
).
Another
example
is
sp
eec
recognition,
where
the
computer
program
is
pro
vided
an
audio
eform
and
emits
sequence
of
characters
or
ord
ID
co
des
describing
the
ords
that
were
sp
oken
in
the
audio
recording.
Deep
learning
is
crucial
comp
onent
of
mo
dern
sp
eech
recognition
systems
used
at
ma
jor
companies,
including
Microsoft,
IBM
and
Go
ogle
Hin
ton
et
al.
2012b
).
Mac
hine
translation
In
mac
hine
translation
task,
the
input
already
consists
of
sequence
of
symbols
in
some
language,
and
the
computer
program
ust
con
ert
this
into
sequence
of
symbols
in
another
language. This
is
commonly
applied
to
natural
languages,
such
as
translating
from
to
renc
h.
Deep
learning
has
recently
egun
to
hav
an
imp
ortan
impact
on
this
kind
of
task
Sutskev
er
et
al.
2014
Bahdanau
et
al.
2015
).
Structured
output
Structured
output
tasks
inv
olve
any
task
where
the
output
is
vector
(or
other
data
structure
con
taining
multiple
alues)
with
imp
ortan
relationships
betw
een
the
diﬀeren
elements.
This
is
broad
category
and
subsumes
the
transcription
and
translation
tasks
describ
ed
ab
e,
as
well
as
man
other
tasks.
One
example
is
parsing—mapping
natural
language
sen
tence
in
to
tree
that
describ
es
its
grammatical
structure
tagging
no
des
of
the
trees
as
eing
verbs,
nouns,
adv
erbs,
and
so
on.
See
Collob
ert
2011
for
an
example
of
deep
learning
applied
to
parsing
task.
Another
example
is
pixel-wise
segmen
tation
of
images, where
the
computer
program
assigns
every
pixel
in
an
image
to
sp
eciﬁc
category
99
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
or
example,
deep
learning
can
used
to
annotate
the
lo
cations
of
roads
in
aerial
photographs
Mnih
and
Hinton
2010
).
The
output
form
need
not
mirror
the
structure
of
the
input
as
closely
as
in
these
annotation-st
yle
tasks.
or
example,
in
image
captioning,
the
computer
program
observes
an
image
and
outputs
natural
language
sentence
describing
the
image
Kiros
et
al.
2014a
Mao
et
al.
2015
Viny
als
et
al.
2015b
Donahue
et
al.
2014
Karpath
and
Li
2015
ang
et
al.
2015
Xu
et
al.
2015
).
These
tasks
are
called
structur
output
tasks
ecause
the
program
must
output
several
alues
that
are
all
tigh
tly
in
terrelated.
or
example,
the
ords
pro
duced
an
image
captioning
program
must
form
alid
sentence.
Anomaly
detection
In
this
type
of
task, the
computer
program
sifts
through
set
of
ev
en
ts
or
ob
jects
and
ﬂags
some
of
them
as
eing
unusual
or
atypical.
An
example
of
an
anomaly
detection
task
is
credit
card
fraud
detection.
By
mo
deling
your
purc
hasing
habits,
credit
card
company
can
detect
misuse
of
your
cards.
If
thief
steals
your
credit
card
or
credit
card
information,
the
thief
’s
purchases
will
often
come
from
diﬀeren
probabilit
distribution
ov
er
purchase
yp
es
than
your
wn.
The
credit
card
company
can
prev
en
fraud
by
placing
hold
on
an
account
as
so
on
as
that
card
has
een
used
for
an
unc
haracteristic
purc
hase.
See
Chandola
et
al.
2009
for
surv
ey
of
anomaly
detection
metho
ds.
Syn
thesis
and
sampling
In
this
yp
of
task,
the
machine
learning
al-
gorithm
is
asked
to
generate
new
examples
that
are
similar
to
those
in
the
training
data. Syn
thesis
and
sampling
via
mac
hine
learning
can
useful
for
media
applications
when
generating
large
volumes
of
conten
by
hand
ould
exp
ensive,
oring,
or
require
to
muc
time.
or
example,
video
games
can
automatically
generate
textures
for
large
ob
jects
or
landscap
es,
rather
than
requiring
an
artist
to
man
ually
lab
el
each
pixel
Luo
et
al.
2013
).
In
some
cases,
an
the
sampling
or
synthesis
pro
cedure
to
generate
sp
eciﬁc
kind
of
output
given
the
input.
or
example,
in
sp
eech
syn
thesis
task,
we
pro
vide
written
sen
tence
and
ask
the
program
to
emit
an
audio
eform
containing
sp
oken
ersion
of
that
sentence.
This
is
kind
of
structured
output
task,
but
with
the
added
qualiﬁcation
that
there
is
no
single
correct
output
for
each
input,
and
we
explicitly
desire
large
amount
of
ariation
in
the
output,
in
order
for
the
output
to
seem
more
natural
and
realistic.
Imputation
of
missing
alues
In
this
type
of
task,
the
machine
learning
algorithm
is
given
new
example
but
with
some
entries
of
100
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
missing.
The
algorithm
must
pro
vide
prediction
of
the
alues
of
the
missing
en
tries.
Denoising
In
this
yp
of
task,
the
mac
hine
learning
algorithm
is
giv
en
as
input
orrupte
example
obtained
an
unkno
wn
corruption
pro
cess
from
cle
an
example
The
learner
ust
predict
the
clean
example
from
its
corrupted
ersion
or
more
generally
predict
the
conditional
probabilit
distribution
Densit
estimation
or
probabilit
mass
function
estimation
In
the
densit
estimation
problem,
the
mac
hine
learning
algorithm
is
asked
to
learn
function
model
where
model
can
interpreted
as
probability
densit
function
(if
is
contin
uous)
or
probability
mass
function
(if
is
discrete)
on
the
space
that
the
examples
were
dra
wn
from.
do
such
task
ell
(w
will
sp
ecify
exactly
what
that
means
when
discuss
erformance
measures
),
the
algorithm
needs
to
learn
the
structure
of
the
data
it
has
seen.
It
ust
kno
where
examples
cluster
tigh
tly
and
where
they
are
unlik
ely
to
ccur.
Most
of
the
tasks
describ
ed
ab
require
the
learning
algorithm
to
at
least
implicitly
capture
the
structure
of
the
probabilit
distribution.
Density
estimation
enables
us
to
explicitly
capture
that
distribution.
In
principle,
can
then
erform
computations
on
that
distribution
to
solve
the
other
tasks
as
well.
or
example,
if
hav
erformed
densit
estimation
to
obtain
probabilit
distribution
we
can
use
that
distribution
to
solv
the
missing
alue
imputation
task.
If
alue
is
missing,
and
all
the
other
alues,
denoted
are
giv
en,
then
know
the
distribution
ov
er
it
is
giv
en
In
practice,
density
estimation
do
es
not
alwa
ys
enable
us
to
solv
all
these
related
tasks,
ecause
in
many
cases
the
required
op
erations
on
are
computationally
intractable.
Of
course,
many
other
tasks
and
types
of
tasks
are
ossible.
The
yp
es
of
tasks
list
here
are
in
tended
only
to
provide
examples
of
what
mac
hine
learning
can
do,
not
to
deﬁne
rigid
taxonom
of
tasks.
5.1.2
The
erformance
Measure,
ev
aluate the
abilities of a
mac
hine learning
algorithm,
we
ust design
quan
titativ
measure
of
its
erformance.
Usually
this
erformance
measure
is
sp
eciﬁc
to
the
task
eing
carried
out
by
the
system.
or
tasks
such
as
classiﬁcation,
classiﬁcation
with
missing
inputs,
and
tran-
scription,
we
often
measure
the
accuracy
of
the
model.
Accuracy
is
just
the
101
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
prop
ortion
of
examples
for
whic
the
mo
del
pro
duces
the
correct
output.
can
also
obtain
equiv
alent
information
by
measuring
the
error
rate
the
prop
ortion
of
examples
for
which
the
mo
del
pro
duces
an
incorrect
output.
often
refer
to
the
error
rate
as
the
exp
ected
0-1
loss.
The
0-1
loss
on
particular
example
is
if
it
is
correctly
classiﬁed
and
if
it
is
not.
or
tasks
such
as
density
estimation,
it
do
es
not
make
sense
to
measure
accuracy
error
rate,
or
any
other
kind
of
0-1
loss.
Instead,
we
ust
use
diﬀerent
erformance
metric
that
gives
the
mo
del
contin
uous-v
alued
score
for
each
example.
The
most
common
approach
is
to
rep
ort
the
av
erage
log-probability
the
mo
del
assigns
to
some
examples.
Usually
are
interested
in
how
well
the
mac
hine
learning
algorithm
erforms
on
data
that
it
has
not
seen
efore,
since
this
determines
how
well
it
will
work
when
deplo
ed
in
the
real
world.
therefore
ev
aluate
these
erformance
measures
using
test
set
of
data
that
is
separate
from
the
data
used
for
training
the
machine
learning
system.
The
choice
of
erformance
measure
ma
seem
straightforw
ard
and
ob
jectiv
e,
but
it
is
often
diﬃcult
to
ho
ose
erformance
measure
that
corresp
onds
ell
to
the
desired
ehavior
of
the
system.
In
some
cases,
this
is
ecause
it
is
diﬃcult
to
decide
what
should
measured.
or
example,
when
erforming
transcription
task,
should
we
measure
the
accuracy
of
the
system
at
transcribing
en
tire
sequences,
or
should
use
more
ﬁne-grained
erformance
measure
that
giv
es
partial
credit
for
getting
some
elements
of
the
sequence
correct?
When
erforming
regression
task,
should
we
enalize
the
system
more
if
it
frequen
tly
makes
medium-sized
mistak
es
or
if
it
rarely
makes
ery
large
mistakes?
These
kinds
of
design
hoices
dep
end
on
the
application.
In
other
cases,
we
kno
what
quantit
we
would
ideally
like
to
measure,
but
measuring
it
is
impractical.
or
example,
this
arises
frequen
tly
in
the
con
text
of
densit
estimation.
Many
of
the
est
probabilistic
mo
dels
represent
probability
distributions
only
implicitly
Computing
the
actual
probability
alue
assigned
to
sp
eciﬁc
oint
in
space
in
man
suc
mo
dels
is
in
tractable.
In
these
cases,
one
ust
design
an
alternative
criterion
that
still
corresp
onds
to
the
design
ob
jectives,
or
design
go
approximation
to
the
desired
criterion.
5.1.3
The
Exp
erience,
Mac
hine
learning
algorithms
can
be
broadly
categorized
as
unsup
ervised
or
sup
ervised
what
kind
of
exp
erience
they
are allo
wed
to
hav
during
the
learning
pro
cess.
Most
of
the
learning
algorithms
in
this
ok
can
understo
as
eing
allo
ed
102
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
to
exp
erience
an
en
tire
dataset
dataset
is
collection
of
many
examples,
as
deﬁned
in
section
5.1.1
Sometimes
we
call
examples
data
oints
One
of
the
oldest
datasets
studied
by
statisticians
and
machine
learning
re-
searc
hers
is
the
Iris
dataset
Fisher
1936
).
It
is
collection
of
measurements
of
diﬀeren
parts
of
150
iris
plants.
Eac
individual
plant
corresp
onds
to
one
example. The
features
within
each
example
are
the
measurements
of
each
part
of
the
plant: the
sepal
length,
sepal
width,
etal
length
and
etal
width.
The
dataset
also
records
which
sp
ecies
each
plant
elonged
to.
Three
diﬀeren
sp
ecies
are
represented
in
the
dataset.
Unsup
ervised
learning
algorithms
exp
erience
dataset
con
taining
many
features,
then
learn
useful
prop
erties
of
the
structure
of
this
dataset.
In
the
con
text
of
deep
learning,
we
usually
wan
to
learn
the
entire
probabilit
distribution
that
generated
dataset,
whether
explicitly
as
in
density
estimation,
or
implicitly
for
tasks
like
synthesis
or
denoising.
Some
other
unsup
ervised
learning
algorithms
erform
other
roles,
lik
clustering,
which
consists
of
dividing
the
dataset
in
to
clusters
of
similar
examples.
Sup
ervised
learning
algorithms
exp
erience
dataset
containing
features,
but
eac
example
is
also
asso
ciated
with
lab
el
or
target
or
example,
the
Iris
dataset
is
annotated
with
the
sp
ecies
of
each
iris
plant.
sup
ervised
learning
algorithm
can
study
the
Iris
dataset
and
learn
to
classify
iris
plants
in
to
three
diﬀeren
sp
ecies
based
on
their
measurements.
Roughly
sp
eaking,
unsup
ervised
learning
in
olves
observing
several
examples
of
random
ector
and
attempting
to
implicitly
or
explicitly
learn
the
proba-
bilit
distribution
or
some
interesting
prop
erties
of
that
distribution;
while
sup
ervised
learning
in
olv
es
observing
sev
eral
examples
of
random
ector
and
an
asso
ciated
alue
or
ector
then
learning
to
predict
from
usually
estimating
The
term
sup
ervised
learning
originates
from
the
view
of
the
target
eing
provided
by
an
instructor
or
teac
her
who
shows
the
mac
hine
learning
system
what
to
do.
In
unsup
ervised
learning,
there
is
no
instructor
or
teac
her,
and
the
algorithm
must
learn
to
mak
sense
of
the
data
without
this
guide.
Unsup
ervised
learning
and
sup
ervised
learning
are
not
formally
deﬁned
terms.
The
lines
betw
een
them
are
often
blurred.
Man
mac
hine
learning
technologies
can
used
to
erform
oth
tasks.
or
example,
the
chain
rule
of
probability
states
that
for
vector
the
joint
distribution
can
decomp
osed
as
) =
=1
(5.1)
This
decomp
osition
means
that
can
solve
the
ostensibly
unsup
ervised
problem
of
103
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
mo
deling
splitting
it
in
to
sup
ervised
learning
problems.
Alternativ
ely
can
solv
the
sup
ervised
learning
problem
of
learning
using
traditional
unsup
ervised
learning
technologies
to
learn
the
joint
distribution
then
inferring
) =
(5.2)
Though
unsup
ervised
learning
and
sup
ervised
learning
are
not
completely
formal
or
distinct
concepts,
they
do
help
roughly
categorize
some
of
the
things
we
do
with
mac
hine
learning
algorithms.
raditionally
eople
refer
to
regression,
classiﬁcation
and
structured
output
problems
as
sup
ervised
learning.
Densit
estimation
in
supp
ort
of
other
tasks
is
usually
considered
unsup
ervised
learning.
Other
arian
ts
of
the
learning
paradigm
are
ossible.
or
example,
in
semi-
sup
ervised
learning,
some
examples
include
sup
ervision
target
but
others
do
not.
In
multi-instance
learning,
an
entire
collection
of
examples
is
lab
eled
as
con
taining
or
not
containing
an
example
of
class,
but
the
individual
members
of
the
collection
are
not
lab
eled.
or
recent
example
of
multi-instance
learning
with
deep
mo
dels,
see
Kotzias
et
al.
2015
).
Some
mac
hine
learning
algorithms
do
not
just
exp
erience
ﬁxed
dataset.
or
example,
reinforcemen
learning
algorithms
interact
with
an
environmen
t,
so
there
is
feedback
lo
op
etw
een
the
learning
system
and
its
exp
eriences. Such
algorithms
are
ey
ond
the
scop
of
this
ok.
Please
see
Sutton
and
Barto
1998
or
Bertsek
as
and
sitsiklis
1996
for
information
ab
out
reinforcement
learning,
and
Mnih
et
al.
2013
for
the
deep
learning
approach
to
reinforcement
learning.
Most
mac
hine
learning
algorithms
simply
exp
erience
dataset.
dataset
can
describ
ed
in
many
wa
ys.
In
all
cases,
dataset
is
collection
of
examples,
whic
are
in
turn
collections
of
features.
One
common
wa
of
describing
dataset
is
with
design
matrix
design
matrix
is
matrix
containing
diﬀerent
example
in
each
ro
w.
Each
column
of
the
matrix
corresp
onds
to
diﬀeren
feature.
or
instance,
the
Iris
dataset
contains
150
examples
with
four
features
for
each
example.
This
means
can
represent
the
dataset
with
design
matrix
150
where
i,
is
the
sepal
length
of
plan
i,
is
the
sepal
width
of
plant
etc.
describ
most
of
the
learning
algorithms
in
this
ok
in
of
how
they
op
erate
on
design
matrix
datasets.
Of
course,
to
describ
dataset
as
design
matrix,
it
ust
ossible
to
describ
each
example
as
vector,
and
each
of
these
vecto
rs
must
the
same
size.
This
is
not
alwa
ys
ossible.
or
example,
if
ou
hav
collection
of
photographs
with
diﬀeren
widths
and
heigh
ts,
then
diﬀeren
photographs
will
contain
diﬀerent
um
ers
of
pixels,
so
not
all
the
photographs
ma
describ
ed
with
the
same
104
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
length
of
ector.
In
Section
9.7
and
hapter
10
we
describ
how
to
handle
diﬀeren
yp
es
of
such
heterogeneous
data.
In
cases
like
these,
rather
than
describing
the
dataset
as
matrix
with
ro
ws,
we
describ
it
as
set
con
taining
elemen
ts:
(1)
(2)
This
notation
do
es
not
imply
that
an
tw
example
ectors
and
ha
the
same
size.
In
the
case
of
sup
ervised
learning,
the
example
contains
lab
el
or
target
as
ell
as
collection
of
features.
or
example,
if
an
to
use
learning
algorithm
to
erform
ob
ject
recognition
from
photographs,
we
need
to
sp
ecify
whic
ob
ject
app
ears
in
eac
of
the
photos.
might
do
this
with
numeric
co
de,
with
signifying
erson,
signifying
car,
signifying
cat,
and
so
forth.
Often
when
orking
with
dataset
containing
design
matrix
of
feature
observ
ations
we
also
provide
vector
of
lab
els
with
pro
viding
the
lab
el
for
example
Of
course,
sometimes
the
lab
el
may
more
than
just
single
num
ber.
or
example,
if
we
an
to
train
sp
eech
recognition
system
to
transcrib
entire
sen
tences,
then
the
lab
el
for
each
example
sen
tence
is
sequence
of
words.
Just
as
there
is
no
formal
deﬁnition
of
sup
ervised
and
unsup
ervised
learning,
there
is
no
rigid
taxonomy
of
datasets
or
exp
eriences.
The
structures
describ
ed
here
co
er
most
cases,
but
it
is
alwa
ys
ossible
to
design
new
ones
for
new
applications.
5.1.4
Example:
Linear
Regression
Our
deﬁnition
of
mac
hine
learning
algorithm
as
an
algorithm
that
is
capable
of
improving
computer
program’s
erformance
at
some
task
via
exp
erience
is
somewhat
abstract.
make
this
more
concrete,
present
an
example
of
simple
machine
learning
algorithm:
linear
regression
will
return
to
this
example
rep
eatedly
as
we
introduce
more
mac
hine
learning
concepts
that
help
to
understand
the
algorithm’s
ehavior.
As
the
name
implies,
linear
regression
solves
regression
problem. In
other
ords,
the
goal
is
to
build
system
that
can
take
vector
as
input
and
predict
the
alue
of
scalar
as
its
output.
The
output
of
linear
regression
is
linear
function
of
the
input.
Let
the
alue
that
our
mo
del
predicts
should
tak
on.
deﬁne
the
output
to
(5.3)
where
is
vector
of
parameters
arameters
are
alues
that
con
trol
the
ehavior
of
the
system.
In
this
case,
is
the
co
eﬃcient
that
multiply
by
feature
efore
summing
up
the
contributions
from
all
the
features.
can
think
of
as
set
of
eigh
ts
that
determine
how
105
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
eac
feature
aﬀects
the
prediction. If
feature
receiv
es
ositive
weigh
then
increasing
the
alue
of
that
feature
increases
the
alue
of
our
prediction
If
feature
receiv
es
negative
weigh
t,
then
increasing
the
alue
of
that
feature
decreases
the
alue
of
our
prediction.
If
feature’s
weigh
is
large
in
magnitude,
then
it
has
large
eﬀect
on
the
prediction.
If
feature’s
weigh
is
zero,
it
has
no
eﬀect
on
the
prediction.
th
us
hav
deﬁnition
of
our
task
: to
predict
from
outputting
Next
we
need
deﬁnition
of
our
erformance
measure,
Supp
ose
that
we
ha
design
matrix
of
example
inputs
that
will
not
use
for
training,
only
for
ev
aluating
ho
well
the
mo
del
erforms.
also
hav
vector
of
regression
targets
providing
the
correct
alue
of
for
each
of
these
examples.
Because
this
dataset
will
only
used
for
ev
aluation,
we
call
it
the
test
set.
refer
to
the
design
matrix
of
inputs
as
test
and
the
ector
of
regression
targets
as
test
One
of
measuring
the
erformance
of
the
mo
del
is
to
compute
the
mean
squared
error
of
the
mo
del
on
the
test
set.
If
test
giv
es
the
predictions
of
the
mo
del
on
the
test
set,
then
the
mean
squared
error
is
given
by
MSE
test
test
test
(5.4)
In
tuitiv
ely
one
can
see
that
this
error
measure
decreases
to
when
test
test
can
also
see
that
MSE
test
||
test
test
||
(5.5)
so
the
error
increases
whenever
the
Euclidean
distance
etw
een
the
predictions
and
the
targets
increases.
mak
machine
learning
algorithm,
we
need
to
design
an
algorithm
that
will
improv
the
eigh
ts
in
wa
that
reduces
MSE
test
when
the
algorithm
is
allow
ed
to
gain
exp
erience
observing
training
set
train
train
One
in
tuitiv
wa
of
doing
this
(whic
we
justify
later,
in
section
5.5.1
is
just
to
minimize
the
mean
squared
error
on
the
training
set,
MSE
train
minimize
MSE
train
can
simply
solve
for
where
its
gradient
is
MSE
train
= 0
(5.6)
||
train
train
||
= 0
(5.7)
106
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
||
train
train
||
= 0
(5.8)
train
train
train
train
= 0
(5.9)
train
train
train
train
train
train
= 0
(5.10)
train
train
train
train
= 0
(5.11)
train
train
train
train
(5.12)
The
system
of
equations
whose
solution
is
given
by
equation
5.12
is
kno
wn
as
the
normal
equations
Ev
aluating
equation
5.12
constitutes
simple
learning
algorithm.
or
an
example
of
the
linear
regression
learning
algorithm
in
action,
see
ﬁgure
5.1
It
is
orth
noting
that
the
term
linear
regression
is
often
used
to
refer
to
slightly
more
sophisticated
mo
del
with
one
additional
parameter—an
intercept
term
In
this
mo
del
b,
(5.13)
so
the
mapping
from
parameters
to
predictions
is
still
linear
function
but
the
mapping
from
features
to
predictions
is
now
an
aﬃne
function.
This
extension
to
Linear
regression
example
20
25
30
35
40
45
50
55
MSE
(train)
Optimization
of
Figure
5.1:
linear
regression
problem,
with
training
set
consisting
of
ten
data
points,
eac
containing
one
feature.
Because
there
is
only
one
feature,
the
weigh
vector
con
tains
only
single
parameter
to
learn,
(L
eft)
Observe
that
linear
regression
learns
to
set
suc
that
the
line
comes
as
close
as
ossible
to
passing
through
all
the
training
points.
(Right)
The
plotted
oint
indicates
the
alue
of
found
the
normal
equations,
whic
can
see
minimizes
the
mean
squared
error
on
the
training
set.
107
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
aﬃne
functions
means
that
the
plot
of
the
mo
del’s
predictions
still
lo
oks
like
line,
but
it
need
not
pass
through
the
origin.
Instead
of
adding
the
bias
parameter
one
can
contin
ue
to
use
the
mo
del
with
only
eigh
ts
but
augment
with
an
extra
en
try
that
is
alwa
ys
set
to
The
eigh
corresp
onding
to
the
extra
entry
pla
ys
the
role
of
the
bias
parameter.
frequently
use
the
term
“linear”
when
referring
to
aﬃne
functions
throughout
this
ok.
The
intercept
term
is
often
called
the
bias
parameter
of
the
aﬃne
transfor-
mation.
This
terminology
derives
from
the
oin
of
view
that
the
output
of
the
transformation
is
biased
tow
ard
eing
in
the
absence
of
any
input. This
term
is
diﬀeren
from
the
idea
of
statistical
bias,
in
whic
statistical
estimation
algorithm’s
exp
ected
estimate
of
quantit
is
not
equal
to
the
true
quantit
Linear
regression
is
of
course
an
extremely
simple
and
limited
learning
algorithm,
but
it
provides
an
example
of
how
learning
algorithm
can
work.
In
subsequent
sections
describ
some
of
the
basic
principles
underlying
learning
algorithm
design
and
demonstrate
ho
these
principles
can
used
to
build
more
complicated
learning
algorithms.
5.2
Capacit
Ov
erﬁtting
and
Underﬁtting
The
central
challenge
in
machine
learning
is
that
our
algorithm
must
erform
ell
on
new,
pr
eviously
unse
en
inputs—not
just
those
on
which
our
mo
del
was
trained.
The
ability
to
perform
ell
on
previously
unobserved
inputs
is
called
generalization
ypically
when
training
machine
learning
model,
we
ha
access
to
training
set;
we
can
compute
some
error
measure
on
the
training
set,
called
the
training
error
and
reduce
this
training
error.
So
far,
what
we
hav
describ
ed
is
simply
an
optimization
problem.
What
separates
mach
ine
learning
from
optimization
is
that
wan
the
generalization
error
also
called
the
test
error
to
low
as
ell. The
generalization
error
is
deﬁned
as
the
exp
ected
alue
of
the
error
on
new
input.
Here
the
exp
ectation
is
tak
en
across
diﬀerent
ossible
inputs,
dra
wn
from
the
distribution
of
inputs
we
exp
ect
the
system
to
encounter
in
practice.
typically
estimate
the
generalization
error
of
machi
ne
learning
mo
del
measuring
its
erformance
on
test
set
of
examples
that
ere
collected
separately
from
the
training
set.
In
our
linear
regression
example,
trained
the
mo
del
minimizing
the
training
error,
train
||
train
train
||
(5.14)
108
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
but
we
actually
care
ab
out
the
test
error,
test
||
test
test
||
Ho
can
we
aﬀect
erformance
on
the
test
set
when
we
can
observ
only
the
training
set?
The
ﬁeld
of
statistical
learning
theory
pro
vides
some
answ
ers.
If
the
training
and
the
test
set
are
collected
arbitrarily
there
is
indeed
little
can
do.
If
we
are
allow
ed
to
make
some
assumptions
ab
out
ho
the
training
and
test
set
are
collected,
then
we
can
mak
some
progress.
The
training
and
test
data
are
generated
by
probability
distribution
er
datasets
called
the
data-generating
pro
cess
typically
make
set
of
as-
sumptions
kno
wn
collectiv
ely
as
the
i.i.d.
assumptions
These
assumptions
are
that
the
examples
in
each
dataset
are
indep
enden
from
each
other,
and
that
the
training
set
and
test
set
are
iden
tically
distributed
drawn
from
the
same
probabilit
distribution
as
each
other.
This
assumption
enables
us
to
describ
the
data-generating
pro
cess
with
probability
distribution
er
single
example.
The
same
distribution
is
then
used
to
generate
every
train
example
and
ev
ery
test
example.
call
that
shared
underlying
distribution
the
data-generating
dis-
tribution
denoted
data
This
probabilistic
framew
ork
and
the
i.i.d.
assumptions
enables
us
to
mathematically
study
the
relationship
etw
een
training
error
and
test
error.
One
immediate
connection
can
observe
etw
een
training
error
and
test
error
is
that
the
exp
ected
training
error
of
randomly
selected
mo
del
is
equal
to
the
exp
ected
test
error
of
that
mo
del.
Supp
ose
hav
probabilit
distribution
and
sample
from
it
rep
eatedly
to
generate
the
training
set
and
the
test
set.
or
some
ﬁxed
alue
the
exp
ected
training
set
error
is
exactly
the
same
as
the
exp
ected
test
set
error,
ecause
oth
exp
ectations
are
formed
using
the
same
dataset
sampling
pro
cess.
The
only
diﬀerence
etw
een
the
conditions
is
the
name
we
assign
to
the
dataset
sample.
Of
course, when w
e use
machine
learning
algorithm, we
do
not
ﬁx the
parameters
ahead
of
time,
then
sample
oth
datasets.
sample
the
training
set,
then
use
it
to
ho
ose
the
parameters
to
reduce
training
set
error,
then
sample
the
test
set.
Under
this
pro
cess,
the
exp
ected
test
error
is
greater
than
or
equal
to
the
exp
ected
alue
of
training
error.
The
factors
determining
ho
ell
mac
hine
learning
algorithm
will
erform
are
its
abilit
to
1. Mak
the
training
error
small.
2. Mak
the
gap
etw
een
training
and
test
error
small.
These
tw
factors
corresp
ond
to
the
central
hallenges
in
mac
hine
learning:
underﬁtting
and
erﬁtting
Underﬁtting
ccurs
when
the
mo
del
is
not
able
to
109
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
obtain
suﬃciently
lo
error
alue
on
the
training
set.
Ov
erﬁtting
ccurs
when
the
gap
etw
een
the
training
error
and
test
error
is
to
large.
can
control
whether
mo
del
is
more
likely
to
erﬁt
or
underﬁt
by
altering
its
capacit
Informally
mo
del’s
capacit
is
its
ability
to
ﬁt
wide
ariety
of
functions.
Models
with
low
capacit
may
struggle
to
the
training
set.
Models
with
high
capacity
can
ov
erﬁt
by
memorizing
prop
erties
of
the
training
set
that
do
not
serve
them
well
on
the
test
set.
One
wa
to
con
trol
the
capacity
of
learning
algorithm
is
by
ho
osing
its
yp
othesis
space
the
set
of
functions
that
the
learning
algorithm
is
allow
ed
to
select
as
eing
the
solution.
or
example,
the
linear
regression
algorithm
has
the
set
of
all
linear
functions
of
its
input
as
its
hypothesis
space.
can
generalize
linear
regression
to
include
olynomials,
rather
than
just
linear
functions,
in
its
yp
othesis
space.
Doing
so
increases
the
mo
del’s
capacit
olynomial
of
degree
gives
us
the
linear
regression
mo
del
with
whic
we
are
already
familiar,
with
the
prediction
x.
(5.15)
By
introducing
as
another
feature
provided
to
the
linear
regression
mo
del,
we
can
learn
mo
del
that
is
quadratic
as
function
of
(5.16)
Though
this
mo
del
implements
quadratic
function
of
its
input
the
output
is
still
linear
function
of
the
ar
ameters
so
we
can
still
use
the
normal
equations
to
train
the
mo
del
in
closed
form. W
can
con
tin
ue
to
add
more
ers
of
as
additional
features,
for
example,
to
obtain
olynomial
of
degree
9:
=1
(5.17)
Mac
hine
learning
algorithms
will
generally
erform
est
when
their
capacit
is
appropriate
for
the
true
complexit
of
the
task
they
need
to
erform
and
the
amoun
of
training
data
they
are
pro
vided
with.
Mo
dels
with
insuﬃcien
capacit
are
unable
to
solv
complex
tasks.
Mo
dels
with
high
capacit
can
solve
complex
tasks,
but
when
their
capacit
is
higher
than
needed
to
solve
the
presen
task,
they
ma
erﬁt.
Figure
5.2
shows
this
principle
in
action.
compare
linear,
quadratic
and
degree-9
predictor
attempting
to
ﬁt
problem
where
the
true
underlying
110
CHAPTER
5.
MA
CHINE
LEARNING
BASICS



Figure
5.2:
ﬁt
three
mo
dels
to
this
example
training
set.
The
training
data
was
generated
syn
thetically
by
randomly
sampling
alues
and
choosing
deterministically
ev
aluating
quadratic
function.
(L
eft)
linear
function
ﬁt
to
the
data
suﬀers
from
underﬁtting—it
cannot
capture
the
curv
ature
that
is
present
in
the
data.
(Center)
quadratic
function
ﬁt
to
the
data
generalizes
well
to
unseen
points.
It
do
es
not
suﬀer
from
signiﬁcant
amount
of
ov
erﬁtting
or
underﬁtting.
(R
ight)
polynomial
of
degree
ﬁt
to
the
data
suﬀers
from
erﬁtting.
Here
we
used
the
Moore-Penrose
pseudoin
erse
to
solv
the
underdetermined
normal
equations.
The
solution
passes
through
all
the
training
oin
ts
exactly
but
we
hav
not
een
lucky
enough
for
it
to
extract
the
correct
structure.
It
now
has
deep
alley
betw
een
tw
training
oints
that
do
es
not
app
ear
in
the
true
underlying
function.
It
also
increases
sharply
on
the
left
side
of
the
data,
while
the
true
function
decreases
in
this
area.
function
is
quadratic. The
linear
function
is
unable
to
capture
the
curv
ature
in
the
true
underlying
problem,
so
it
underﬁts.
The
degree-9
predictor
is
capable
of
represen
ting
the
correct
function,
but
it
is
also
capable
of
represen
ting
inﬁnitely
man
other
functions
that
pass
exactly
through
the
training
oin
ts,
ecause
we
ha
more
parameters
than
training
examples.
ha
little
chance
of
ho
osing
solution
that
generalizes
ell
when
so
many
wildly
diﬀeren
solutions
exist.
In
this
example,
the
quadratic
mo
del
is
erfectly
matched
to
the
true
structure
of
the
task,
so
it
generalizes
well
to
new
data.
So
far
we
ha
describ
ed
only
one
of
changing
mo
del’s
capacity:
hanging
the
num
ber
of
input
features
it
has,
and
simultaneously
adding
new
parameters
asso
ciated
with
those
features.
There
are
in
fact
many
wa
ys
to
change
mo
del’s
capacity
Capacity
is
not
determined
only
the
choice
of
mo
del.
The
mo
del
sp
eciﬁes
which
family
of
functions
the
learning
algorithm
can
choose
from
when
arying
the
parameters
in
order
to
reduce
training
ob
jective.
This
is
called
the
represen
tational
capacity
of
the
mo
del.
In
many
cases,
ﬁnding
the
est
111
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
function
within
this
family
is
diﬃcult
optimization
problem.
In
practice,
the
learning
algorithm
do
es
not
actually
ﬁnd
the
est
function,
but
merely
one
that
signiﬁcan
tly
reduces
the
training
error.
These
additional
limitations,
suc
as
the
imp
erfection
of
the
optimization
algorithm,
mean
that
the
learning
algorithm’s
eﬀectiv
capacity
ma
less
than
the
representational
capacit
of
the
mo
del
family
Our
mo
dern
ideas
ab
out
improving
the
generalization
of
machine
learning
mo
dels
are
reﬁnemen
ts
of
though
dating
bac
to
philosophers
at
least
as
early
as
Ptolem
Many
early
scholars
inv
oke
principle
of
parsimon
that
is
now
most
widely
known
as
Occam’s
razor
(c.
1287–1347).
This
principle
states
that
among
comp
eting
hypotheses
that
explain
kno
wn
observ
ations
equally
ell,
we
should
ho
ose
the
“simplest”
one.
This
idea
was
formalized
and
made
more
precise
in
the
tw
en
tieth
century
the
founders
of
statistical
learning
theory
apnik
and
Cherv
onenkis
1971
apnik
1982
Blumer
et
al.
1989
apnik
1995
).
Statistical
learning
theory
provides
arious
means
of
quantifying
mo
del
capacit
Among
these,
the
most
ell
known
is
the
apnik-Chervonenkis
dimension
or
dimension.
The
VC
dimension
measures
the
capacity
of
binary
classiﬁer.
The
dimension
is
deﬁned
as
eing
the
largest
ossible
alue
of
for
which
there
exists
training
set
of
diﬀeren
oin
ts
that
the
classiﬁer
can
lab
el
arbitrarily
Quan
tifying
the
capacity
of
the
mo
del
enables
statistical
learning
theory
to
mak
quan
titativ
predictions.
The
most
imp
ortant
results
in
statistical
learning
theory
sho
that
the
discrepancy
etw
een
training
error
and
generalization
error
is
ounded
from
ab
ov
by
quantit
that
grows
as
the
mo
del
capacit
gro
ws
but
shrinks
as
the
num
er
of
training
examples
increases
apnik
and
Chervonenkis
1971
apnik
1982
Blumer
et
al.
1989
apnik
1995
).
These
ounds
pro
vide
in
tellectual
justiﬁcation
that
machine
learning
algorithms
can
work,
but
they
are
rarely
used
in
practice
when
working
with
deep
learning
algorithms.
This
is
in
part
ecause
the
ounds
are
often
quite
lo
ose
and
in
part
ecause
it
can
quite
diﬃcult
to
determine
the
capacity
of
deep
learning
algorithms. The
problem
of
determining
the
capacity
of
deep
learning
mo
del
is
esp
ecially
diﬃcult
ecause
the
eﬀective
capacity
is
limited
by
the
capabilities
of
the
optimization
algorithm,
and
we
ha
little
theoretical
understanding
of
the
general
nonconv
ex
optimization
problems
inv
olv
ed
in
deep
learning.
ust
remem
er
that
while
simpler
functions
are
more
likely
to
generalize
(to
hav
small
gap
etw
een
training
and
test
error),
we
ust
still
ho
ose
suﬃcien
tly
complex
hypothesis
to
achiev
low
training
error.
Typically
training
error
decreases
until
it
asymptotes
to
the
minimum
ossible
error
alue
as
mo
del
capacit
increases
(assuming
the
error
measure
has
minim
um
alue).
ypically
112
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
Optimal
Capacity
Capacity
Error
Underﬁtting
zone
Overﬁtting
zone
Generalization
gap
raining
error
Generalization
error
Figure
5.3:
Typical
relationship
etw
een
capacit
and
error.
raining
and
test
error
eha
diﬀeren
tly
At
the
left
end
of
the
graph,
training
error
and
generalization
error
are
oth
high.
This
is
the
underﬁtting
regime
As
increase
capacit
training
error
decreases,
but
the
gap
betw
een
training
and
generalization
error
increases.
Even
tually
the
size
of
this
gap
outw
eighs
the
decrease
in
training
error,
and
we
en
ter
the
erﬁtting
regime
where
capacit
is
to
large,
ab
ov
the
optimal
capacit
generalization
error
has
U-shap
ed
curv
as
function
of
mo
del
capacit
This
is
illustrated
in
ﬁgure
5.3
reac
the
most
extreme
case
of
arbitrarily
high
capacit
, w
in
tro
duce
the
concept
of
nonparametric
mo
dels
So
far,
we
hav
seen
only
parametric
mo
dels,
such
as
linear
regression.
Parametric
mo
dels
learn
function
describ
ed
parameter
ector
whose
size
is
ﬁnite
and
ﬁxed
efore
any
data
is
observed.
Nonparametric
mo
dels
hav
no
such
limitation.
Sometimes,
nonparametric
mo
dels
are
just
theoretical
abstractions
(such
as
an
algorithm
that
searches
ov
er
all
ossible
probability
distributions)
that
cannot
implemen
ted
in
practice.
Ho
ev
er,
we
can
also
design
practical
nonparametric
mo
dels
by
making
their
complexit
function
of
the
training
set
size.
One
example
of
such
an
algorithm
is
nearest
neighbor
regression
Unlike
linear
regression,
whic
has
ﬁxed-length
vector
of
eigh
ts,
the
nearest
neighbor
regression
mo
del
simply
stores
the
and
from
the
training
set. When
asked
to
classify
test
oin
the
mo
del
lo
oks
up
the
nearest
entry
in
the
training
set
and
returns
the
asso
ciated
regression
target.
In
other
words,
where
arg
min
||
i,
||
The
algorithm
can
also
generalized
to
distance
metrics
other
than
the
norm,
suc
as
learned
distance
metrics
Goldb
erger
et
al.
2005
).
If
the
algorithm
is
allo
ed
to
break
ties
eraging
the
alues
for
all
i,
that
are
tied
for
nearest,
then
this
algorithm
is
able
to
ac
hiev
the
minimum
ossible
training
error
(which
113
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
migh
greater
than
zero,
if
tw
identical
inputs
are
asso
ciated
with
diﬀerent
outputs)
on
any
regression
dataset.
Finally
we
can
also
create
nonparametric
learning
algorithm
wrapping
parametric
learning
algorithm
inside
another
algorithm
that
increases
the
num
ber
of
parameters
as
needed.
or
example,
we
could
imagine
an
outer
lo
op
of
learning
that
hanges
the
degree
of
the
olynomial
learned
linear
regression
on
top
of
olynomial
expansion
of
the
input.
The
ideal
model
is
an
oracle
that
simply
knows
the
true
probability
distribution
that
generates
the
data. Ev
en
such
mo
del
will
still
incur
some
error
on
man
problems,
ecause
there
ma
still
some
noise
in
the
distribution.
In
the
case
of
sup
ervised
learning,
the
mapping
from
to
ma
inheren
tly
sto
hastic,
or
ma
deterministic
function
that
in
olv
es
other
ariables
esides
those
included
in
The
error
incurred
an
oracle
making
predictions
from
the
true
distribution
is
called
the
Bay
es
error
raining
and
generalization
error
ary
as
the
size
of
the
training
set
aries.
Exp
ected
generalization
error
can
never
increase
as
the
num
ber
of
training
examples
increases.
or
nonparametric
mo
dels,
more
data
yield
etter
generalization
until
the
est
ossible
error
is
ac
hiev
ed.
An
ﬁxed
parametric
mo
del
with
less
than
optimal
capacity
will
asymptote
to
an
error
alue
that
exceeds
the
Ba
es
error.
See
ﬁgure
5.4
for
an
illustration.
Note
that
it
is
ossible
for
the
mo
del
to
hav
optimal
capacity
and
yet
still
hav
large
gap
etw
een
training
and
generalization
errors.
In
this
situation,
we
may
able
to
reduce
this
gap
by
gathering
more
training
examples.
5.2.1
The
No
ree
Lunc
Theorem
Learning
theory
claims
that
mac
hine
learning
algorithm
can
generalize
well
from
ﬁnite
training
set
of
examples.
This
seems
to
contradict
some
basic
principles
of
logic.
Inductiv
reasoning,
or
inferring
general
rules
from
limited
set
of
examples,
is
not
logically
alid.
logically
infer
rule
describing
every
mem
er
of
set,
one
must
hav
information
ab
out
ev
ery
mem
er
of
that
set.
In
part,
machine
learning
av
oids
this
problem
by
oﬀering
only
probabilistic
rules,
rather
than
the
entirely
certain
rules
used
in
purely
logical
reasoning. Machine
learning
promises
to
ﬁnd
rules
that
are
pr
ob
ably
correct
ab
out
most
mem
ers
of
the
set
they
concern.
Unfortunately
ev
en
this
do
es
not
resolv
the
en
tire
problem.
The
no
free
lunc
theorem
for
machine
learning
olpert
1996
states
that,
eraged
ov
er
all
ossible
data-generating
distributions,
every
classiﬁcation
algorithm
has
the
114
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
































Figure
5.4:
The
eﬀect
of
the
training
dataset
size
on
the
train
and
test
error,
as
well
as
on
the
optimal
model
capacity
constructed
synthetic
regression
problem
based
on
adding
moderate
amount
of
noise
to
degree-5
olynomial,
generated
single
test
set,
and
then
generated
several
diﬀerent
sizes
of
training
set.
or
each
size,
we
generated
40
diﬀeren
training
sets
in
order
to
plot
error
bars
sho
wing
95
percent
conﬁdence
in
terv
als.
(T
op)
The
MSE
on
the
training
and
test
set
for
tw
diﬀeren
models:
quadratic
model,
and
mo
del
with
degree
hosen
to
minimize
the
test
error.
Both
are
ﬁt
in
closed
form.
or
the
quadratic
model,
the
training
error
increases
as
the
size
of
the
training
set
increases.
This
is
because
larger
datasets
are
harder
to
ﬁt.
Sim
ultaneously
the
test
error
decreases,
ecause
few
er
incorrect
hypotheses
are
consistent
with
the
training
data.
The
quadratic
mo
del
do
es
not
hav
enough
capacity
to
solv
the
task,
so
its
test
error
asymptotes
to
high
alue.
The
test
error
at
optimal
capacity
asymptotes
to
the
Bay
es
error.
The
training
error
can
fall
elow
the
Bay
es
error,
due
to
the
ability
of
the
training
algorithm
to
memorize
sp
eciﬁc
instances
of
the
training
set.
As
the
training
size
increases
to
inﬁnity
the
training
error
of
an
ﬁxed-capacit
mo
del
(here,
the
quadratic
mo
del)
ust
rise
to
at
least
the
Bay
es
error.
(Bottom)
As
the
training
set
size
increases,
the
optimal
capacity
(sho
wn
here
as
the
degree
of
the
optimal
olynomial
regressor)
increases.
The
optimal
capacit
plateaus
after
reac
hing
suﬃcien
complexity
to
solve
the
task.
115
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
same
error
rate
when
classifying
previously
unobserv
ed
oin
ts.
In
other
ords,
in
some
sense,
no
machine
learning
algorithm
is
universally
any
etter
than
any
other.
The
most
sophisticated
algorithm
can
conceiv
of
has
the
same
av
erage
erformance
(ov
er
all
ossible
tasks)
as
merely
predicting
that
every
oin
elongs
to
the
same
class.
ortunately
these
results
hold
only
when
we
av
erage
ov
er
al
ossible
data-
generating
distributions.
If
make
assumptions
ab
out
the
kinds
of
probability
distributions
encounter
in
real-world
applications,
then
we
can
design
learning
algorithms
that
erform
well
on
these
distributions.
This
means
that
the
goal
of
mac
hine
learning
research
is
not
to
seek
univ
ersal
learning
algorithm
or
the
absolute
est
learning
algorithm.
Instead,
our
goal
is
to
understand
what
kinds
of
distributions
are
relev
ant
to
the
“real
world”
that
an
AI
agen
exp
eriences,
and
what
kinds
of
mac
hine
learning
algorithms
erform
well
on
data
drawn
from
the
kinds
of
data-generating
distributions
we
care
ab
out.
5.2.2
Regularization
The
no
free
lunch
theorem
implies
that
we
must
design
our
machine
learning
algorithms
to
erform
well
on
sp
eciﬁc
task.
do
so
building
set
of
preferences
into
the
learning
algorithm.
When
these
preferences
are
aligned
with
the
learning
problems
that
we
ask
the
algorithm
to
solve,
it
erforms
etter.
So
far,
the
only
metho
of
mo
difying
learning
algorithm
that
we
ha
discussed
concretely
is
to
increase
or
decrease
the
mo
del’s
represen
tational
capacity
by
adding
or
removing
functions
from
the
hypothesis
space
of
solutions
the
learning
algorithm
is
able
to
ho
ose
from.
ga
the
sp
eciﬁc
example
of
increasing
or
decreasing
the
degree
of
olynomial
for
regression
problem.
The
view
we
hav
describ
ed
so
far
is
ov
ersimpliﬁed.
The
ehavior
of
our
algorithm
is
strongly
aﬀected
not
just
by
ho
large
mak
the
set
of
functions
allow
ed
in
its
hypothesis
space,
but
by
the
sp
eciﬁc
identit
of
those
functions.
The
learning
algorithm
hav
studied
so
far,
linear
regression,
has
yp
othesis
space
consisting
of
the
set
of
linear
functions
of
its
input.
These
linear
functions
can
useful
for
problems
where
the
relationship
etw
een
inputs
and
outputs
truly
is
close
to
linear.
They
are
less
useful
for
problems
that
ehav
in
ery
nonlinear
fashion.
or
example,
linear
regression
would
not
erform
well
if
tried
to
use
it
to
predict
sin
from
can
thus
con
trol
the
erformance
of
our
algorithms
choosing
what
kind
of
functions
allo
them
to
draw
solutions
from,
as
well
as
by
con
trolling
the
amoun
of
these
functions.
can
also
give
learning
algorithm
preference
for
one
solution
er
another
116
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
in
its
hypothesis
space.
This
means
that
oth
functions
are
eligible,
but
one
is
preferred.
The
unpreferred
solution
will
chosen
only
if
it
ﬁts
the
training
data
signiﬁcan
tly
etter
than
the
preferred
solution.
or
example,
we
can
mo
dify
the
training
criterion
for
linear
regression
to
include
eigh
deca
erform
linear
regression
with
weigh
deca
minimize
sum
comprising
oth
the
mean
squared
error
on
the
training
and
criterion
that
expresses
preference
for
the
eigh
ts
to
hav
smaller
squared
norm.
Sp
eciﬁcally
) =
MSE
train
(5.18)
where
is
alue
chosen
ahead
of
time
that
controls
the
strength
of
our
preference
for
smaller
weigh
ts.
When
= 0
we
imp
ose
no
preference,
and
larger
forces
the
eigh
ts
to
ecome
smaller.
Minimizing
results
in
hoice
of
eigh
ts
that
mak
tradeoﬀ
et
een
ﬁtting
the
training
data
and
eing
small.
This
gives
us
solutions
that
ha
smaller
slop
e,
or
that
put
eigh
on
fewer
of
the
features.
As
an
example
of
how
we
can
control
mo
del’s
tendency
to
erﬁt
or
underﬁt
via
weigh
decay
we
can
train
high-degree
olynomial
regression
mo
del
with
diﬀeren
alues
of
See
ﬁgure
5.5
for
the
results.
More
generally
we
can
regularize
mo
del
that
learns
function
by
adding
enalty
called
regularizer
to
the
cost
function.
In
the
case
of
weigh
deca
the
regularizer
is
Ω(
) =
In
chapter
we
will
see
that
many
other
regularizers
are
ossible.
Expressing
preferences
for
one
function
ov
er
another
is
more
general
wa
of
controlling
mo
del’s
capacity
than
including
or
excluding
members
from
the
yp
othesis
space.
can
think
of
excluding
function
from
hypothesis
space
as
expressing
an
inﬁnitely
strong
preference
against
that
function.
In
our
weigh
decay
example,
we
expressed
our
preference
for
linear
functions
deﬁned
with
smaller
weigh
ts
explicitly
, via
an
extra
term
in
the
criterion
we
minimize.
There
are
man
other
wa
ys
of
expressing preferences
for
diﬀerent
solutions,
oth
implicitly
and
explicitly
ogether, these
diﬀeren
approaches
are
known
as
regularization
gularization
is
any
mo
diﬁc
ation
we
make
to
le
arning
algorithm
that
is
intende
to
duc
its
gener
alization
err
or
but
not
its
tr
aining
err
or.
Regularization
is
one
of
the
central
concerns
of
the
ﬁeld
of
machine
learning,
riv
aled
in
its
imp
ortance
only
optimization.
The
no
free
lunch
theorem
has
made
it
clear
that
there
is
no
est
machine
learning
algorithm,
and,
in
particular,
no
est
form
of
regularization.
Instead
must
choose
form
of
regularization
that
is
well
suited
to
the
particular
task
wan
to
solve.
The
philosophy
of
deep
learning
in
general
and
this
ok
in
particular
is
that
wide
range
of
tasks
(such
as
all
the
in
tellectual
tasks
that
117
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
eople
can
do)
ma
all
solv
ed
eﬀectively
using
very
general-purp
ose
forms
of
regularization.
5.3
Hyp
erparameters
and
alidation
Sets
Most
machine
learning
algorithms
hav
hyperparameters,
settings
that
we
can
use
to
control
the
algorithm’s
ehavior.
The
alues
of
yp
erparameters
are
not
adapted
by
the
learning
algorithm
itself
(though
we
can
design
nested
learning
pro
cedure
in
which
one
learning
algorithm
learns
the
est
hyperparameters
for
another
learning
algorithm).
The
olynomial
regression
example
in
ﬁgure
5.2
has
single
hyperparameter:
the
degree
of
the
olynomial,
which
acts
as
capacit
yp
erparameter.
The
alue
used
to
con
trol
the
strength
of
eigh
decay
is
another
example
of
yp
erparameter.











Figure
5.5:
ﬁt
high-degree
olynomial
regression
mo
del
to
our
example
training
set
from
ﬁgure
5.2
The
true
function
is
quadratic,
but
here
we
use
only
models
with
degree
9.
ary
the
amoun
of
weigh
decay
to
prev
en
these
high-degree
models
from
ov
erﬁtting.
(L
eft)
With
ery
large
can
force
the
mo
del
to
learn
function
with
no
slop
at
all.
This
underﬁts
ecause
it
can
only
represent
constant
function.
(Center)
With
medium
alue
of
the
learning
algorithm
recov
ers
curve
with
the
right
general
shap
e.
Ev
en
though
the
mo
del
is
capable
of
represen
ting
functions
with
uc
more
complicated
shap
es,
weigh
decay
has
encouraged
it
to
use
simpler
function
describ
ed
smaller
co
eﬃcien
ts.
(Right)
With
weigh
decay
approaching
zero
(i.e.,
using
the
Mo
ore-Penrose
pseudoin
verse
to
solv
the
underdetermined
problem
with
minimal
regularization),
the
degree-9
polynomial
ov
erﬁts
signiﬁcan
tly
as
we
saw
in
ﬁgure
5.2
118
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
Sometimes
setting
is
chosen
to
hyperparameter
that
the
learning
algo-
rithm
do
es
not
learn
ecause
the
setting
is
diﬃcult
to
optimize.
More
frequently
the
setting
must
hyperparameter
ecause
it
is
not
appropriate
to
learn
that
yp
erparameter
on
the
training
set.
This
applies
to
all
hyperparameters
that
con
trol
mo
del
capacity
If
learned
on
the
training
set,
such
yp
erparameters
ould
alw
ys
ho
ose
the
maximum
ossible
mo
del
capacity
resulting
in
ov
erﬁtting
refer
to
ﬁgure
5.3
).
or
example, w
can
alw
ys
ﬁt
the
training
set
etter
with
higher-degree
olynomial
and
weigh
decay
setting
of
= 0
than
could
with
lo
er-degree
olynomial
and
ositive
eigh
deca
setting.
solve
this
problem,
we
need
alidation
set
of
examples
that
the
training
algorithm
do
es
not
observe.
Earlier
discussed
ho
held-out
test
set,
comp
osed
of
examples
coming
from
the
same
distribution
as
the
training
set,
can
used
to
estimate
the
generalization
error
of
learner,
after
the
learning
pro
cess
has
completed.
It
is
imp
ortan
that
the
test
examples
are
not
used
in
an
to
mak
hoices
ab
out
the
mo
del,
including
its
yp
erparameters. F
or
this
reason,
no
example
from
the
test
set
can
used
in
the
alidation
set.
Therefore,
alwa
ys
construct
the
alidation
set
from
the
tr
aining
data.
Speciﬁcally
split
the
training
data
in
to
tw
disjoint
subsets.
One
of
these
subsets
is
used
to
learn
the
parameters.
The
other
subset
is
our
alidation
set,
used
to
estimate
the
generalization
error
during
or
after
training,
allo
wing
for
the
hyperparameters
to
up
dated
accordingly
The
subset
of
data
used
to
learn
the
parameters
is
still
ypically
called
the
training
set,
ev
en
though
this
may
confused
with
the
larger
ol
of
data
used
for
the
entire
training
pro
cess.
The
subset
of
data
used
to
guide
the
selection
of
yp
erparameters
is
called
the
alidation
set.
Typically
one
uses
ab
out
80
ercen
of
the
training
data
for
training
and
20
ercen
for
alidation.
Since
the
alidation
set
is
used
to
“train”
the
hyperparameters,
the
alidation
set
error
will
underestimate
the
generalization
error,
though
ypically
smaller
amount
than
the
training
error
do
es.
After
all
hyperparameter
optimization
is
complete,
the
generalization
error
ma
estimated
using
the
test
set.
In
practice, when
the
same
test
set
has
een
used
rep
eatedly
to
ev
aluate
erformance
of
diﬀeren
algorithms
er
man
ears,
and
esp
ecially
if
we
consider
all
the
attempts
from
the
scientiﬁc
communit
at
eating
the
rep
orted
state-of-
the-art
erformance
on
that
test
set,
we
end
up
having
optimistic
ev
aluations
with
the
test
set
as
well.
Benchmarks
can
thus
ecome
stale
and
then
do
not
reﬂect
the
true
ﬁeld
erformance
of
trained
system.
Thankfully
the
communit
tends
to
mo
on
to
new
(and
usually
more
am
bitious
and
larger)
enchmark
datasets.
119
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
5.3.1
Cross-V
alidation
Dividing
the
dataset
into
ﬁxed
training
set
and
ﬁxed
test
set
can
be
problematic
if
it
results
in
the
test
set
eing
small.
small
test
set
implies
statistical
uncertaint
around
the
estimated
av
erage
test
error,
making
it
diﬃcult
to
claim
that
algorithm
works
etter
than
algorithm
on
the
given
task.
When
the
dataset
has
hundreds
of
thousands
of
examples
or
more,
this
is
not
serious
issue.
When
the
dataset
is
to
small,
alternativ
pro
cedures
enable
one
to
use
all
the
examples
in
the
estimation
of
the
mean
test
error,
at
the
price
of
increased
computational
cost.
These
pro
cedures
are
based
on
the
idea
of
rep
eating
the
training
and
testing
computation
on
diﬀeren
randomly
chosen
subsets
or
splits
of
the
original
dataset.
The
most
common
of
these
is
the
-fold
cross-v
alidation
pro
cedure,
shown
in
algorithm
5.1
in
which
partition
of
the
dataset
is
formed
splitting
it
into
nono
erlapping
subsets.
The
test
error
may
then
estimated
taking
the
erage
test
error
across
trials.
On
trial
the
-th
subset
of
the
data
is
used
as
the
test
set,
and
the
rest
of
the
data
is
used
as
the
training
set.
One
problem
is
that
no
unbiased
estimators
of
the
ariance
of
such
av
erage
error
estimators
exist
Bengio
and
Grandv
alet
2004
),
but
approximations
are
ypically
used.
5.4
Estimators,
Bias
and
ariance
The
ﬁeld
of
statistics
gives
us
man
to
ols
to
achiev
the
machine
learning
goal
of
solving
task
not
only
on
the
training
set
but
also
to
generalize. F
oundational
concepts
such
as
parameter
estimation,
bias
and
ariance
are
useful
to
formally
haracterize
notions
of
generalization,
underﬁtting
and
erﬁtting.
5.4.1
oin
Estimation
oin
estimation
is
the
attempt
to
provide
the
single
“b
est”
prediction
of
some
quan
tit
of
interest.
In
general
the
quantit
of
interest
can
single
parameter
or
ector
of
parameters
in
some
parametric
mo
del,
such
as
the
weigh
ts
in
our
linear
regression
example
in
section
5.1.4
but
it
can
also
whole
function.
distinguish
estimates
of
parameters
from
their
true
alue,
our
conv
en
tion
will
to
denote
oint
estimate
of
parameter
by
Let
(1)
set
of
indep
enden
and
iden
tically
distributed
120
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
Algorithm
5.1
The
-fold
cross-v
alidation
algorithm.
It
can
used
to
estimate
generalization
error
of
learning
algorithm
when
the
given
dataset
is
to
small
for
simple
train/test
or
train/v
alid
split
to
yield
accurate
estimation
of
generalization
error,
ecause
the
mean
of
loss
on
small
test
set
ma
hav
to
high
ariance.
The
dataset
con
tains
as
elements
the
abstract
examples
(for
the
-th
example),
which
could
stand
for
an
(input,target)
pair
= (
in
the
case
of
sup
ervised
learning,
or
for
just
an
input
in
the
case
of
unsup
ervised
learning. The
algorithm
returns
the
vector
of
errors
for
each
example
in
whose
mean
is
the
estimated
generalization
error. The
errors
on
individual
examples
can
used
to
compute
conﬁdence
interv
al
around
the
mean
(equation
5.47
).
Though
these
conﬁdence
in
terv
als
are
not
well
justiﬁed
after
the
use
of
cross-v
alidation,
it
is
still
common
practice
to
use
them
to
declare
that
algorithm
is
etter
than
algorithm
only
if
the
conﬁdence
interv
al
of
the
error
of
algorithm
lies
elow
and
do
es
not
in
tersect
the
conﬁdence
interv
al
of
algorithm
Deﬁne
KFoldXV
A,
L,
):
Require:
the
given
dataset,
with
elemen
ts
Require:
the
learning
algorithm,
seen
as
function
that
takes
dataset
as
input
and
outputs
learned
function
Require:
the
loss
function,
seen
as
function
from
learned
function
and
an
example
to
scalar
Require:
the
num
ber
of
folds
Split
into
mutually
exclusive
subsets
whose
union
is
for
from
to
do
for
in
do
end
or
end
fo
Return
(i.i.d.)
data
oints.
oint
estimator
or
statistic
is
any
function
of
the
data:
(1)
(5.19)
The
deﬁnition
do
es
not
require
that
return
alue
that
is
close
to
the
true
or
even
that
the
range
of
the
same
as
the
set
of
allow
able
alues
of
This
deﬁnition
of
oin
estimator
is
very
general
and
would
enable
the
designer
of
an
estimator
great
ﬂexibilit
While
almost
an
function
th
us
qualiﬁes
as
an
estimator,
121
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
go
estimator
is
function
whose
output
is
close
to
the
true
underlying
that
generated
the
training
data.
or
now,
tak
the
frequen
tist
ersp
ectiv
on
statistics.
That
is,
we
assume
that
the
true
parameter
alue
is
ﬁxed
but
unkno
wn,
while
the
oin
estimate
is
function
of
the
data.
Since
the
data
is
drawn
from
random
pro
cess,
an
function
of
the
data
is
random.
Therefore
is
random
ariable.
oin
estimation
can
also
refer
to
the
estimation
of
the
relationship
etw
een
input
and
target
ariables.
refer
to
these
types
of
oint
estimates
as
function
estimators.
unction Estimation
Sometimes w
e are
interested in
erforming function
estimation
(or
function
approximation).
Here,
we
are
trying
to
predict
ariable
giv
en
an
input
vector
assume
that
there
is
function
that
describ
es
the
approximate
relationship
etw
een
and
or
example,
we
may
assume
that
where
stands
for
the
part
of
that
is
not
predictable
from
In
function
estimation,
are
interested
in
appro
ximating
with
mo
del
or
estimate
unction
estimation
is
really
just
the
same
as
estimating
parameter
the
function
estimator
is
simply
oin
estimator
in
function
space.
The
linear
regression
example
(discussed
in
section
5.1.4
and
the
olynomial
regression
example
(discussed
in
section
5.2
oth
illustrate
scenarios
that
may
interpreted
as
either
estimating
parameter
or
estimating
function
mapping
from
to
no
review
the
most
commonly
studied
prop
erties
of
oint
estimators
and
discuss
what
they
tell
us
ab
out
these
estimators.
5.4.2
Bias
The
bias
of
an
estimator
is
deﬁned
as
bias(
) =
(5.20)
where
the
exp
ectation
is
er
the
data
(seen
as
samples
from
random
ariable)
and
is
the
true
underlying
alue
of
used
to
deﬁne
the
data-generating
distri-
bution.
An
estimator
is
said
to
un
biased
if
bias
) =
whic
implies
that
An
estimator
is
said
to
asymptotically
unbiased
if
lim
→∞
bias(
) =
which
implies
that
lim
→∞
) =
Example:
Bernoulli
Distribution
Consider
set
of
samples
(1)
that
are
indep
enden
tly
and
identically
distributed
according
to
Bernoulli
distri-
122
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
bution
with
mean
) =
(1
(1
(5.21)
common
estimator
for
the
parameter
of
this
distribution
is
the
mean
of
the
training
samples:
=1
(5.22)
determine
whether
this
estimator
is
biased,
we
can
substitute
equation
5.22
in
to
equation
5.20
bias(
) =
(5.23)
=1
(5.24)
=1
(5.25)
=1
=0
(1
(1
(5.26)
=1
(5.27)
= 0
(5.28)
Since
bias(
) = 0
we
say
that
our
estimator
is
unbiased.
Example:
Gaussian
Distribution
Estimator
of
the
Mean
No
w,
consider
set
of
samples
(1)
that
are
indep
enden
tly
and
iden
tically
distributed
according
to
Gaussian
distribution
) =
µ,
where
Recall
that
the
Gaussian
probability
density
function
is
given
by
µ,
) =
exp
(5.29)
common
estimator
of
the
Gaussian
mean
parameter
is
known
as
the
sample
mean
=1
(5.30)
123
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
determine
the
bias
of
the
sample
mean,
we
are
again
interested
in
calculating
its
exp
ectation:
bias(
) =
(5.31)
=1
(5.32)
=1
(5.33)
=1
(5.34)
= 0
(5.35)
Th
us
we
ﬁnd
that
the
sample
mean
is
an
un
biased
estimator
of
Gaussian
mean
parameter.
Example:
Estimators
of
the
ariance
of
Gaussian
Distribution
or
this
example,
compare
tw
diﬀerent
estimators
of
the
ariance
parameter
of
Gaussian
distribution.
are
in
terested
in
knowin
if
either
estimator
is
biased.
The
ﬁrst
estimator
of
consider
is
known
as
the
sample
ariance
=1
(5.36)
where
is
the
sample
mean.
More
formally
we
are
in
terested
in
computing
bias(
) =
(5.37)
egin
ev
aluating
the
term
] =
=1
(5.38)
(5.39)
Returning
to
equation
5.37
conclude
that
the
bias
of
is
/m
Therefore,
the
sample
ariance
is
biased
estimator.
124
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
The
unbiased
sample
ariance
estimator
=1
(5.40)
pro
vides
an
alternativ
approac
h.
As
the
name
suggests
this
estimator
is
unbiased.
That
is,
we
ﬁnd
that
] =
] =
=1
(5.41)
(5.42)
(5.43)
(5.44)
ha
tw
estimators:
one
is
biased,
and
the
other
is
not.
While
un
biased
estimators
are
clearly
desirable,
they
are
not
alwa
ys
the
“b
est”
estimators.
As
we
will
see
we
often
use
biased
estimators
that
ossess
other
imp
ortant
prop
erties.
5.4.3
ariance
and
Standard
Error
Another
prop
erty
of
the
estimator
that
we
might
wan
to
consider
is
ho
muc
exp
ect
it
to
ary
as
function
of
the
data
sample.
Just
as
we
computed
the
exp
ectation
of
the
estimator
to
determine
its
bias,
we
can
compute
its
ariance.
The
ariance
of
an
estimator
is
simply
the
ariance
ar(
(5.45)
where
the
random
ariable
is
the
training
set.
Alternately
the
square
ro
ot
of
the
ariance
is
called
the
standard
error
denoted
SE(
The
ariance,
or
the
standard
error,
of
an
estimator
pro
vides
measure
of
ho
ould
exp
ect
the
estimate
we
compute
from
data
to
ary
as
we
indep
endently
resample
the
dataset
from
the
underlying
data-generating
pro
cess.
Just
as
we
migh
lik
an
estimator
to
exhibit
low
bias,
we
would
also
like
it
to
hav
relatively
lo
ariance.
When
compute
an
statistic
using
ﬁnite
num
er
of
samples,
our
estimate
of
the
true
underlying
parameter
is
uncertain,
in
the
sense
that
we
could
hav
obtained
other
samples
from
the
same
distribution
and
their
statistics
ould
ha
125
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
een
diﬀerent.
The
exp
ected
degree
of
ariation
in
any
estimator
is
source
of
error
that
we
wan
to
quantif
The
standard
error
of
the
mean
is
giv
en
SE(
) =
ar
=1
(5.46)
where
is
the
true
ariance
of
the
samples
The
standard
error
is
often
estimated
using
an
estimate
of
Unfortunately
neither
the
square
ro
ot
of
the
sample
ariance
nor
the
square
ro
ot
of
the
unbiased
estimator
of
the
ariance
pro
vide
an
unbiased
estimate
of
the
standard
deviation.
Both
approac
hes
tend
to
underestimate
the
true
standard
deviation
but
are
still
used
in
practice.
The
square
ro
ot
of
the
un
biased
estimator
of
the
ariance
is
less
of
an
underestimate.
or
large
the
approximation
is
quite
reasonable.
The
standard
error
of
the
mean
is
ery
useful
in
machine
learning
exp
eriments.
often
estimate
the
generalization
error
by
computing
the
sample
mean
of
the
error
on
the
test
set.
The
um
er
of
examples
in
the
test
set
determines
the
accuracy
of
this
estimate.
aking
adv
antage
of
the
central
limit
theorem,
which
tells
us
that
the
mean
will
approximately
distributed
with
normal
distribution,
can
use
the
standard
error
to
compute
the
probabilit
that
the
true
expectation
falls
in
any
chosen
in
terv
al.
or
example,
the
95
ercen
conﬁdence
interv
al
centered
on
the
mean
is
96SE(
96SE(
))
(5.47)
under
the
normal
distribution
with
mean
and
ariance
SE
In
machine
learning
exp
eriments,
it
is
common
to
sa
that
algorithm
is
etter
than
algorithm
if
the
upp
er
ound
of
the
95
ercent
conﬁdence
in
terv
al
for
the
error
of
algorithm
is
less
than
the
lo
er
ound
of
the
95
ercen
conﬁdence
interv
al
for
the
error
of
algorithm
Example: Bernoulli
Distribution
once
again
consider
set
of
samples
(1)
dra
wn
indep
enden
tly
and
identically
from
Bernoulli
distribution
(recall
) =
(1
(1
).
This
time
we
are
interested
in
computing
the
ariance
of
the
estimator
=1
ar
= V
ar
=1
(5.48)
126
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
=1
ar
(5.49)
=1
(1
(5.50)
mθ
(1
(5.51)
(1
(5.52)
The
ariance
of
the
estimator
decreases
as
function
of
the
num
er
of
examples
in
the
dataset.
This
is
common
prop
erty
of
opular
estimators
that
will
return
to
when
we
discuss
consistency
(see
section
5.4.5
).
5.4.4
rading
oﬀ
Bias
and
ariance
to
Minimize
Mean
Squared
Error
Bias
and
ariance
measure
diﬀerent
sources
of
error
in
an
estimator.
Bias
measures
the
exp
ected
deviation
from
the
true
alue
of
the
function
or
parameter.
ariance
on
the
other
hand,
pro
vides
measure
of
the
deviation
from
the
exp
ected
estimator
alue
that
any
particular
sampling
of
the
data
is
likely
to
cause.
What
happ
ens
when
we
are
given
hoice
et
een
tw
estimators,
one
with
more
bias
and
one
with
more
ariance?
How
do
we
ho
ose
etw
een
them?
or
example,
imagine
that
we
are
interested
in
approximating
the
function
shown
in
ﬁgure
5.2
and
are
only
oﬀered
the
hoice
et
een
mo
del
with
large
bias
and
one
that
suﬀers
from
large
ariance.
How
do
ho
ose
et
een
them?
The
most
common
wa
to
negotiate
this
trade-oﬀ
is
to
use
cross-v
alidation.
Empirically
cross-v
alidation
is
highly
successful
on
man
real-w
orld
tasks.
Alter-
nativ
ely
we
can
also
compare
the
mean
squared
error
(MSE)
of
the
estimates:
MSE =
[(
(5.53)
= Bias(
ar(
(5.54)
The
MSE
measures
the
ov
erall
exp
ected
deviation—in
squared
error
sense—
et
een
the
estimator
and
the
true
alue
of
the
parameter
As
is
clear
from
equation
5.54
ev
aluating
the
MSE
incorp
orates
oth
the
bias
and
the
ariance.
Desirable
estimators
are
those
with
small
MSE
and
these
are
estimators
that
manage
to
keep
oth
their
bias
and
ariance
somewhat
in
hec
k.
The
relationship
etw
een
bias
and
ariance
is
tigh
tly
linked
to
the
machine
learning
concepts
of
capacit
underﬁtting
and
ov
erﬁtting.
When
generalization
127
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
Capacity
Bias
Generalization
error
ariance
Optimal
capacity
Overﬁtting zone
Underﬁtting zone
Figure
5.6:
As
capacit
increases
-axis),
bias
(dotted)
tends
to
decrease
and
ariance
(dashed)
tends
to
increase,
yielding
another
U-shap
ed
curv
for
generalization
error
(bold
curv
e).
If
ary
capacity
along
one
axis,
there
is
an
optimal
capacit
with
underﬁtting
when
the
capacity
is
elow
this
optimum
and
erﬁtting
when
it
is
ab
ov
e.
This
relationship
is
similar
to
the
relationship
etw
een
capacity
underﬁtting,
and
ov
erﬁtting,
discussed
in
section
5.2
and
ﬁgure
5.3
error
is
measured
by
the
MSE
(where
bias
and
ariance
are
meaningful
comp
onen
ts
of
generalization
error),
increasing
capacit
tends
to
increase
ariance
and
decrease
bias.
This
is
illustrated
in
ﬁgure
5.6
where
we
see
again
the
U-shap
ed
curve
of
generalization
error
as
function
of
capacit
5.4.5
Consistency
So
far
we
hav
discussed
the
prop
erties
of
arious
estimators
for
training
set
of
ﬁxed
size.
Usually
we
are
also
concerned
with
the
ehavior
of
an
estimator
as
the
amoun
of
training
data
grows.
In
particular,
usually
wish
that,
as
the
num
ber
of
data
oints
in
our
dataset
increases,
our
oint
estimates
con
erge
to
the
true
alue
of
the
corresp
onding
parameters.
More
formally
we
would
like
that
plim
→∞
(5.55)
The
sym
ol
plim
indicates
con
ergence
in
probabilit
meaning
that
for
any
as
The
condition
describ
ed
by
equation
5.55
is
kno
wn
as
consistency
It
is
sometimes
referred
to
as
eak
consistency
with
strong
consistency
referring
to
the
almost
sure
con
ergence
of
to
Almost
128
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
sure
conv
ergence
of
sequence
of
random
ariables
(1)
(2)
to
alue
ccurs
when
(lim
→∞
) = 1
Consistency
ensures
that
the
bias
induced
by
the
estimator
diminishes
as
the
um
er
of
data
examples
gro
ws.
Ho
ev
er,
the
rev
erse
is
not
true—asymptotic
un
biasedness
do
es
not
imply
consistency
. F
or
example,
consider
estimating
the
mean
parameter
of
normal
distribution
µ,
with
dataset
consisting
of
samples:
(1)
could
use
the
ﬁrst
sample
(1)
of
the
dataset
as
an
unbiased
estimator:
(1)
In
that
case,
so
the
estimator
is
un
biased
no
matter
how
many
data
oin
ts
are
seen.
This,
of
course,
implies
that
the
estimate
is
asymptotically
unbiased.
Ho
ev
er,
this
is
not
consisten
estimator
as
it
is
not
the
case
that
as
5.5
Maxim
um
Likelihoo
Estimation
ha
seen
some
deﬁnitions
of
common
estimators
and
analyzed
their
prop
erties.
But
where
did
these
estimators
come
from?
Rather
than
guessing
that
some
function
might
make
go
estimator
and
then
analyzing
its
bias
and
ariance,
would
like
to
ha
some
principle
from
whic
can
derive
sp
eciﬁc
functions
that
are
go
estimators
for
diﬀerent
mo
dels.
The
most
common
such
principle
is
the
maxim
um
likelihoo
principle.
Consider
set
of
examples
(1)
dra
wn
indep
enden
tly
from
the
true
but
unknown
data-generating
distribution
data
Let
model
parametric
family
of
probability
distributions
ov
er
the
same
space
indexed
In
other
ords,
model
maps
any
conﬁguration
to
real
num
ber
estimating
the
true
probabilit
data
The
maximum
likelihoo
estimator
for
is
then
deﬁned
as
ML
= arg
max
model
(5.56)
= arg
max
=1
model
(5.57)
This
pro
duct
ov
er
many
probabilities
can
incon
enien
for
arious
reasons.
or
example,
it
is
prone
to
numerical
underﬂow.
obtain
more
conv
enien
but
equiv
alent
optimization
problem,
observ
that
taking
the
logarithm
of
the
lik
eliho
do
es
not
hange
its
arg
max
but
do
es
con
enien
tly
transform
pro
duct
129
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
in
to
sum:
ML
= arg
max
=1
log
model
(5.58)
Because
the
arg
max
do
es
not
hange
when
we
rescale
the
cost
function,
we
can
divide
by
to
obtain
ersion
of
the
criterion
that
is
expressed
as
an
exp
ectation
with
resp
ect
to
the
empirical
distribution
data
deﬁned
by
the
training
data:
ML
= arg
max
data
log
model
(5.59)
One
to
interpret
maximum
lik
eliho
estimation
is
to
view
it
as
minimizing
the
dissimilarity
etw
een
the
empirical
distribution
data
deﬁned
by
the
training
set
and
the
mo
del
distribution,
with
the
degree
of
dissimilarity
et
een
the
tw
measured
by
the
KL
divergence.
The
KL
div
ergence
is
given
by
KL
data
model
) =
data
[log ˆ
data
log
model
)]
(5.60)
The
term
on
the
left
is
function
only
of
the
data-generating
pro
cess,
not
the
mo
del.
This
means
when
train
the
mo
del
to
minimize
the
KL
div
ergence,
need
only
minimize
data
[log
model
)]
(5.61)
whic
is
of
course
the
same
as
the
maximization
in
equation
5.59
Minimizing
this
KL
divergence
corresp
onds
exactly
to
minimizing
the
cross-
en
trop
etw
een
the
distributions.
Man
authors
use
the
term
“cross-entrop
y”
to
iden
tify
sp
eciﬁcally
the
negative
log-lik
eliho
of
Bernoulli
or
softmax
distribution,
but
that
is
misnomer.
Any
loss
consisting
of
negative
log-lik
eliho
is
cross-
en
trop
etw
een
the
empirical
distribution
deﬁned
by
the
training
set
and
the
probabilit
distribution
deﬁned
by
mo
del.
or
example,
mean
squared
error
is
the
cross-en
trop
et
een
the
empirical
distribution
and
Gaussian
mo
del.
can
thus
see
maximum
likelihoo
as
an
attempt
to
make
the
mo
del
dis-
tribution
match
the
empirical
distribution
data
Ideally
would
like
to
match
the
true
data-generating
distribution
data
but
ha
no
direct
access
to
this
distribution.
While
the
optimal
is
the
same
regardless
of
whether
we
are
maximizing
the
lik
eliho
or
minimizing
the
KL
divergence,
the
alues
of
the
ob
jectiv
functions
are
diﬀeren
t.
In
softw
are,
we
often
phrase
oth
as
minimizing
cost
function.
Maxim
um
likelihoo
th
us
ecomes
minimization
of
the
negativ
log-likelihoo
(NLL),
or
equiv
alen
tly
minimization
of
the
cross-en
trop
The
persp
ective
of
maxim
um
likelihoo
as
minimum
KL
div
ergence
ecomes
helpful
in
this
case
ecause
the
KL
divergence
has
known
minimum
alue
of
zero.
The
negativ
log-lik
eliho
can
actually
ecome
negative
when
is
real-v
alued.
130
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
5.5.1
Conditional
Log-Lik
eliho
and
Mean
Squared
Error
The
maxim
um
lik
eliho
estimator
can
readily
generalized
to
estimate
condi-
tional
probabilit
in
order
to
predict
giv
en
This
is
actually
the
most
common
situation
ecause
it
forms
the
basis
for
most
sup
ervised
learning.
If
represen
ts
all
our
inputs
and
all
our
observed
targets,
then
the
conditional
maxim
um
lik
eliho
estimator
is
ML
= arg
max
(5.62)
If
the
examples
are
assumed
to
i.i.d.,
then
this
can
decomp
osed
in
to
ML
= arg
max
=1
log
(5.63)
Example:
Linear
Regression
as
Maxim
um
Lik
eliho
Linear
regression,
in
tro
duced
in
section
5.1.4
ma
justiﬁed
as
maxim
um
likelihoo
pro
cedure.
Previously
motiv
ated
linear
regression
as
an
algorithm
that
learns
to
take
an
input
and
pro
duce
an
output
alue
. The
mapping
from
to
is
chosen
to
minimize
mean
squared
error,
criterion
that
in
tro
duced
more
or
less
arbitrarily
now
revisit
linear
regression
from
the
oint
of
view
of
maxim
um
likelihoo
estimation.
Instead
of
pro
ducing
single
prediction
no
think
of
the
mo
del
as
pro
ducing
conditional
distribution
can
imagine
that
with
an
inﬁnitely
large
training
set,
migh
see
sev
eral
training
examples
with
the
same
input
alue
but
diﬀeren
alues
of
The
goal
of
the
learning
algorithm
is
no
to
ﬁt
the
distribution
to
all
those
diﬀerent
alues
that
are
all
compatible
with
derive
the
same
linear
regression
algorithm
we
obtained
efore,
we
deﬁne
) =
The
function
gives
the
prediction
of
the
mean
of
the
Gaussian.
In
this
example,
assume
that
the
ariance
is
ﬁxed
to
some
constant
hosen
by
the
user.
will
see
that
this
hoice
of
the
functional
form
of
causes
the
maxim
um
lik
eliho
estimation
pro
cedure
to
yield
the
same
learning
algorithm
as
we
developed
efore.
Since
the
examples
are
assumed
to
i.i.d.,
the
conditional
log-likelihoo
(equation
5.63
is
given
by
=1
log
(5.64)
log
log(2
=1
(5.65)
131
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
where
is
the
output
of
the
linear
regression
on
the
-th
input
and
is
the
um
er
of
the
training
examples.
Comparing
the
log-lik
eliho
with
the
mean
squared
error,
MSE
train
=1
||
||
(5.66)
immediately
see
that
maximizing
the
log-lik
eliho
with
resp
ect
to
yields
the
same
estimate
of
the
parameters
as
do
es
minimizing
the
mean
squared
error.
The
criteria
hav
diﬀerent
alues
but
the
same
lo
cation
of
the
optimum.
This
justiﬁes
the
use
of
the
MSE
as
maxim
um
likelihoo
estimation
pro
cedure.
As
will
see,
the
maximum
likelihoo
estimator
has
sev
eral
desirable
prop
erties.
5.5.2
Prop
erties
of
Maxim
um
Likelihoo
The
main
app
eal
of
the
maximum
likelihoo
estimator
is
that
it
can
shown
to
the
est
estimator
asymptotically
as
the
num
er
of
examples
in
of
its
rate
of
conv
ergence
as
increases.
Under
appropriate
conditions, the maxim
um lik
eliho
estimator has
the
prop
ert
of
consistency
(see
section
5.4.5
),
meaning
that
as
the
num
ber
of
training
examples
approac
hes
inﬁnity
the
maxim
um
likelihoo
estimate
of
parameter
con
erges
to
the
true
alue
of
the
parameter.
These
conditions
are
as
follows:
The
true
distribution
data
ust
lie
within
the
model
family
model
Otherwise,
no
estimator
can
recov
er
data
The
true
distribution
data
ust
corresp
ond
to
exactly
one
alue
of
Oth-
erwise,
maxim
um
lik
eliho
can
reco
er
the
correct
data
but
will
not
be
able
to
determine
which
alue
of
was
used
the
data-generating
pro
cess.
There
are
other
inductive
principles
esides
the
maxim
um
likelihoo
estimator,
man
of
which
share
the
prop
ert
of
eing
consistent
estimators.
Consisten
estimators
can
diﬀer,
how
ever,
in
their
statistical
eﬃciency
meaning
that
one
consisten
estimator
may
obtain
low
er
generalization
error
for
ﬁxed
num
er
of
samples
or
equiv
alen
tly
may
require
few
er
examples
to
obtain
ﬁxed
level
of
generalization
error.
Statistical
eﬃciency
is
typically
studied
in
the
parametric
case
(as
in
linear
regression),
where
our
goal
is
to
estimate
the
alue
of
parameter
(assuming
it
is
ossible
to
identify
the
true
parameter),
not
the
alue
of
function.
wa
to
measure
ho
close
we
are
to
the
true
parameter
is
by
the
exp
ected
mean
squared
error,
computing
the
squared
diﬀerence
et
een
the
estimated
and
true
parameter
132
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
alues,
where
the
exp
ectation
is
ov
er
training
samples
from
the
data-generating
distribution.
That
parametric
mean
squared
error
decreases
as
increases,
and
for
large,
the
Cramér-Rao
low
er
ound
Rao
1945
Cramér
1946
sho
ws
that
no
consisten
estimator
has
low
er
MSE
than
the
maxim
um
lik
eliho
estimator.
or
these
reasons
(consistency
and
eﬃciency),
maximum
likelihoo
is
often
considered
the
preferred
estimator
to
use
for
mac
hine
learning.
When
the
num
er
of
examples
is
small
enough
to
yield
ov
erﬁtting
ehavior,
regularization
strategies
suc
as
weigh
decay
ma
used
to
obtain
biased
version
of
maxim
um
lik
eliho
that
has
less
ariance
when
training
data
is
limited.
5.6
Ba
esian
Statistics
So
far
ha
discussed
frequen
tist
statistics
and
approac
hes
based
on
estimat-
ing
single
alue
of
then
making
all
predictions
thereafter
based
on
that
one
estimate.
Another
approach
is
to
consider
all
ossible
alues
of
when
making
prediction.
The
latter
is
the
domain
of
Ba
esian
statistics
As
discussed
in
section
5.4.1
, the frequen
tist
ersp
ective
is that
the true
parameter
alue
is
ﬁxed
but
unkno
wn,
while
the
oint
estimate
is
random
ariable
on
account
of
it
eing
function
of
the
dataset
(which
is
seen
as
random).
The
Bay
esian
ersp
ective
on
statistics
is
quite
diﬀeren
t. The
Bay
esian
uses
probabilit
to
reﬂect
degrees
of
certaint
in
states
of
knowledge.
The
dataset
is
directly
observ
ed
and
so
is
not
random.
On
the
other
hand,
the
true
parameter
is
unknown
or
uncertain
and
thus
is
represen
ted
as
random
ariable.
Before
observing
the
data,
we
represent
our
knowl
edge
of
using
the
prior
probabilit
distribution
(sometimes
referred
to
as
simply
“the
prior”).
Generally
the
machine
learning
practitioner
selects
prior
distribution
that
is
quite
broad
(i.e.,
with
high
entrop
y)
to
reﬂect
high
degree
of
uncertain
in
the
alue
of
efore
observing
any
data.
or
example,
one
might
assume
priori
that
lies
in
some
ﬁnite
range
or
olume,
with
uniform
distribution. Man
priors
instead
reﬂect
preference
for
“simpler” solutions
(such
as
smaller
magnitude
co
eﬃcien
ts,
or
function
that
is
closer
to
eing
constant).
No
consider
that
we
ha
set
of
data
samples
(1)
can
reco
er
the
eﬀect
of
data
on
our
elief
ab
out
combining
the
data
lik
eliho
(1)
with
the
prior
via
Bay
es’
rule:
(1)
) =
(1)
(1)
(5.67)
133
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
In
the
scenarios
where
Ba
esian
estimation
is
typically
used,
the
prior
egins
as
relativ
ely
uniform
or
Gaussian
distribution
with
high
en
trop
and
the
observ
ation
of
the
data
usually
causes
the
osterior
to
lose
entrop
and
concentrate
around
few
highly
likely
alues
of
the
parameters.
Relativ
to
maximum
likelihoo
estimation,
Bay
esian
estimation
oﬀers
tw
imp
ortan
diﬀerences.
First,
unlik
the
maxim
um
lik
eliho
approac
that
mak
es
predictions
using
oint
estimate
of
the
Ba
esian
approach
is
to
make
predictions
using
full
distribution
ov
er
or
example,
after
observing
examples,
the
predicted
distribution
ov
er
the
next
data
sample,
+1)
is
given
by
+1)
(1)
) =
+1)
(1)
(5.68)
Here
each
alue
of
with
ositive
probability
densit
contributes
to
the
prediction
of
the
next
example,
with
the
contribution
weigh
ted
the
osterior
densit
itself.
After
ha
ving
observed
(1)
if
we
are
still
quite
uncertain
ab
out
the
alue
of
then
this
uncertain
is
incorp
orated
directly
into
any
predictions
we
migh
mak
e.
In
section
5.4
we
discussed
how
the
frequen
tist
approac
addresses
the
uncer-
tain
in
given
oint
estimate
of
ev
aluating
its
ariance.
The
ariance
of
the
estimator
is
an
assessmen
of
ho
the
estimate
migh
hange
with
alternativ
samplings
of
the
observ
ed
data.
The
Ba
esian
answ
er
to
the
question
of
how
to
deal
with
the
uncertaint
in
the
estimator
is
to
simply
integrate
ov
er
it,
which
tends
to
protect
well
against
ov
erﬁtting. This
integral
is
of
course
just
an
application
of
the
la
ws
of
probabilit
making
the
Bay
esian
approach
simple
to
justify
while
the
frequen
tist
mac
hinery
for
constructing
an
estimator
is
based
on
the
rather
ad
ho
decision
to
summarize
all
knowledge
con
tained
in
the
dataset
with
single
oin
estimate.
The
second
imp
ortant
diﬀerence
etw
een
the
Ba
esian
approac
to
estimation
and
the
maximum
likelihoo
approach
is
due
to
the
contribution
of
the
Ba
esian
prior
distribution.
The
prior
has
an
inﬂuence
by
shifting
probability
mass
densit
to
ards
regions
of
the
parameter
space
that
are
preferred
priori.
In
practice,
the
prior
often
expresses
preference
for
mo
dels
that
are
simpler
or
more
smo
oth.
Critics
of
the
Ba
esian
approach
iden
tify
the
prior
as
source
of
sub
jectiv
human
judgmen
aﬀecting
the
predictions.
Ba
esian
metho
ds
ypically
generalize
muc
etter
when
limited
training
data
is
ailable
but
typically
suﬀer
from
high
computational
cost
when
the
num
ber
of
training
examples
is
large.
134
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
Example:
Ba
esian
Linear
Regression
Here
we
consider
the
Ba
esian
esti-
mation
approac
to
learning
the
linear
regression
parameters.
In
linear
regression,
learn
linear
mapping
from
an
input
ector
to
predict
the
alue
of
scalar
The
prediction
is
parametrized
by
the
ector
(5.69)
Giv
en
set
of
training
samples
train
train
we
can
express
the
prediction
of
ver
the
en
tire
training
set
as
train
train
(5.70)
Expressed
as
Gaussian
conditional
distribution
on
train
hav
train
train
) =
train
train
(5.71)
exp
train
train
train
train
(5.72)
where
we
follow
the
standard
MSE
formulation
in
assuming
that
the
Gaussian
ariance
on
is
one.
In
what
follo
ws,
to
reduce
the
notational
burden,
we
refer
to
train
train
as
simply
determine
the
osterior
distribution
er
the
mo
del
parameter
vector
we
ﬁrst
need
to
sp
ecify
prior
distribution.
The
prior
should
reﬂect
our
naiv
elief
ab
out
the
alue
of
these
parameters.
While
it
is
sometimes
diﬃcult
or
unnatural
to
express
our
prior
eliefs
in
of
the
parameters
of
the
mo
del,
in
practice
we
ypically
assume
fairly
broad
distribution,
expressing
high
degree
of
uncertain
ab
out
. F
or
real-v
alued
parameters
it
is
common
to
use
Gaussian
as
prior
distribution,
) =
exp
(5.73)
where
and
are
the
prior
distribution
mean
vector
and
cov
ariance
matrix
resp
ectiv
ely
With
the
prior
th
us
sp
eciﬁed,
can
no
pro
ceed
in
determining
the
osterior
distribution
ov
er
the
mo
del
parameters:
(5.74)
Unless
there
is
reason
to
use
particular
cov
ariance
structure,
we
typically
assume
diagonal
cov
ariance
matrix
= diag(
135
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
exp
exp
(5.75)
exp
(5.76)
now
deﬁne
and
Us-
ing
these
new
ariables,
we
ﬁnd
that
the
osterior
ma
rewritten
as
Gaussian
distribution:
exp
(5.77)
exp
(5.78)
All
that
do
not
include
the
parameter
vector
ha
een
omitted;
they
are
implied
the
fact
that
the
distribution
ust
normalized
to
in
tegrate
to
Equation
3.23
shows
how
to
normalize
ultiv
ariate
Gaussian
distribution.
Examining
this
osterior
distribution
enables
us
to
gain
some
in
tuition
for
the
eﬀect
of
Ba
esian
inference.
In
most
situations,
we
set
to
If
we
set
then
giv
es
the
same
estimate
of
as
do
es
frequen
tist
linear
regression
with
eigh
deca
enalt
of
One
diﬀerence
is
that
the
Ba
esian
estimate
is
undeﬁned
if
is
set
to
zero—w
are
not
allow
ed
to
egin
the
Bay
esian
learning
pro
cess
with
an
inﬁnitely
wide
prior
on
The
more
imp
ortant
diﬀerence
is
that
the
Bay
esian
estimate
pro
vides
cov
ariance
matrix,
showing
how
likely
all
the
diﬀeren
alues
of
are,
rather
than
providing
only
the
estimate
5.6.1
Maxim
um
Posteriori
(MAP)
Estimation
While
the
most
principled
approach
is
to
make
predictions
using
the
full
Ba
esian
osterior
distribution
ver
the
parameter
it
is
still
often
desirable
to
hav
single
oint
estimate. One
common
reason
for
desiring
oint
estimate
is
that
most
op
erations
inv
olving
the
Bay
esian
osterior
for
most
in
teresting
mo
dels
are
in
tractable,
and
oin
estimate
oﬀers
tractable
appro
ximation.
Rather
than
simply
returning
to
the
maxim
um
lik
eliho
estimate,
we
can
still
gain
some
of
the
eneﬁt
of
the
Bay
esian
approach
by
allo
wing
the
prior
to
inﬂuence
the
choice
of
the
oin
estimate.
One
rational
wa
to
do
this
is
to
choose
the
maxim
um
osteriori
(MAP)
oint
estimate.
The
MAP
estimate
chooses
the
oint
of
136
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
maximal
osterior
probability
(or
maximal
probability
densit
in
the
more
common
case
of
contin
uous
):
MAP
= arg
max
) = arg
max
log
log
(5.79)
recognize,
on
the
righthand
side,
log
that
is,
the
standard
log-
lik
eliho
term,
and
log
corresp
onding
to
the
prior
distribution.
As
an
example,
consider
linear
regression
mo
del
with
Gaussian
prior
on
the
weigh
ts
If
this
prior
is
giv
en
by
then
the
log-prior
term
in
equation
5.79
is
prop
ortional
to
the
familiar
eigh
decay
enalty
plus
term
that
do
es
not
dep
end
on
and
do
es
not
aﬀect
the
learning
pro
cess.
MAP
Ba
esian
inference
with
Gaussian
prior
on
the
eigh
ts
thus
corresp
onds
to
weigh
deca
As
with
full
Ba
esian
inference,
MAP
Ba
esian
inference
has
the
adv
an
tage
of
lev
eraging
information
that
is
brought
the
prior
and
cannot
found
in
the
training
data.
This
additional
information
helps
to
reduce
the
ariance
in
the
MAP
oint
estimate
(in
comparison
to
the
ML
estimate).
Ho
ev
er,
it
do
es
so
at
the
price
of
increased
bias.
Man
regularized
estimation
strategies,
such
as
maxim
um
likelihoo
learning
regularized
with
eigh
deca
can
interpreted
as
making
the
MAP
approxima-
tion
to
Bay
esian
inference.
This
view
applies
when
the
regularization
consists
of
adding
an
extra
term
to
the
ob
jective
function
that
corresp
onds
to
log
Not
all
regularization
enalties
corresp
ond
to
MAP
Ba
esian
inference.
or
example,
some
regularizer
may
not
the
logarithm
of
probability
distribution.
Other
regularization
dep
end
on
the
data,
which
of
course
prior
probabilit
distribution
is
not
allow
ed
to
do.
MAP
Ba
esian
inference
provides
straightforw
ard
to
design
complicated
et
interpretable
regularization
terms.
or
example,
more
complicated
enalty
term
can
deriv
ed
by
using
mixture
of
Gaussians,
rather
than
single
Gaussian
distribution,
as
the
prior
Nowlan
and
Hin
ton
1992
).
5.7
Sup
ervised
Learning
Algorithms
Recall
from
section
5.1.3
that
sup
ervised
learning
algorithms
are,
roughly
sp
eaking,
learning
algorithms
that
learn
to
asso
ciate
some
input
with
some
output,
given
training
set
of
examples
of
inputs
and
outputs
. In
many
cases
the
outputs
ma
diﬃcult
to
collect
automatically
and
ust
pro
vided
by
human
137
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
“sup
ervisor,”
but
the
term
still
applies
ev
en
when
the
training
set
targets
ere
collected
automatically
5.7.1
Probabilistic
Sup
ervised
Learning
Most supervised learning algorithms in
this b
ok are based
on estimating a
probabilit
distribution
can
do
this
simply
using
maxim
um
lik
eliho
estimation
to
ﬁnd
the
est
parameter
vector
for
parametric
family
of
distributions
hav
already
seen
that
linear
regression
corresp
onds
to
the
family
) =
(5.80)
can
generalize
linear
regression
to
the
classiﬁcation
scenario
by
deﬁning
diﬀeren
family
of
probability
distributions.
If
we
hav
tw
classes,
class
and
class
1,
then
we
need
only
sp
ecify
the
probability
of
one
of
these
classes.
The
probabilit
of
class
determines
the
probability
of
class
0,
ecause
these
alues
ust
add
up
to
1.
The
normal
distribution
ov
er
real-v
alued
um
ers
that
used
for
linear
regression
is
parametrized
in
of
mean.
Any
alue
we
supply
for
this
mean
is
alid.
distribution
er
binary
ariable
is
slightly
more
complicated,
ecause
its
mean
must
alw
ys
etw
een
and
1.
One
wa
to
solve
this
problem
is
to
use
the
logistic
sigmoid
function
to
squash
the
output
of
the
linear
function
in
to
the
in
terv
al
(0,
1)
and
interpret
that
alue
as
probabilit
y:
= 1
) =
(5.81)
This
approach
is
known
as
logistic
regression
(a
somewhat
strange
name
since
use
the
mo
del
for
classiﬁcation
rather
than
regression).
In
the
case
of
linear
regression,
we
were
able
to
ﬁnd
the
optimal
weigh
ts
solving
the
normal
equations.
Logistic
regression
is
somewhat
more
diﬃcult.
There
is
no
closed-form
solution
for
its
optimal
weigh
ts.
Instead,
ust
searc
for
them
maximizing
the
log-likelihoo
d.
can
do
this
by
minimizing
the
negativ
log-lik
eliho
using
gradient
descent.
This
same
strategy
can
applied
to
essentially
an
sup
ervised
learning
problem,
writing
down
parametric
family
of
conditional
probabilit
distributions
er
the
right
kind
of
input
and
output
ariables.
138
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
5.7.2
Supp
ort
ector
Mac
hines
One
of
the
most
inﬂuential
approac
hes
to
sup
ervised
learning
is
the
supp
ort
vector
mac
hine
Boser
et
al.
1992
Cortes
and
apnik
1995
).
This
mo
del
is
similar
to
logistic
regression
in
that
it
is
driven
linear
function
Unlike
logistic
regression,
the
supp
ort
vector
machine
do
es
not
provide
probabilities,
but
only
outputs
class
identit
The
SVM
predicts
that
the
ositive
class
is
present
when
is
ositive.
Likewise,
it
predicts
that
the
negative
class
is
present
when
is
negative.
One
key
innov
ation
asso
ciated
with
supp
ort
vector
machines
is
the
ernel
tric
The
kernel
trick
consists
of
observing
that
many
mac
hine
learning
algorithms
can
written
exclusively
in
of
dot
pro
ducts
betw
een
examples.
or
example,
it
can
shown
that
the
linear
function
used
the
supp
ort
vector
machine
can
re-written
as
=1
(5.82)
where
is
training
example,
and
is
vector
of
co
eﬃcients.
Rewriting
the
learning
algorithm
this
wa
enables
us
to
replace
with
the
output
of
giv
en
feature
function
and
the
dot
product
with
function
) =
called
ernel
The
op
erator
represents
an
inner
pro
duct
analogous
to
or
some
feature
spaces,
may
not
use
literally
the
ector
inner
pro
duct.
In
some
inﬁnite
dimensional
spaces,
we
need
to
use
other
kinds
of
inner
pro
ducts,
for
example,
inner
pro
ducts
based
on
integration
rather
than
summation.
complete
dev
elopmen
of
these
kinds
of
inner
pro
ducts
is
ey
ond
the
scop
of
this
ok.
After
replacing
dot
pro
ducts
with
ernel
ev
aluations,
we
can
make
predictions
using
the
function
) =
(5.83)
This
function
is
nonlinear
with
resp
ect
to
but
the
relationship
etw
een
and
is
linear.
Also,
the
relationship
etw
een
and
is
linear.
The
ernel-based
function
is
exactly
equiv
alent
to
prepro
cessing
the
data
by
applying
to
all
inputs,
then
learning
linear
mo
del
in
the
new
transformed
space.
The
ernel
trick
is
ow
erful
for
reasons.
First,
it
enables
us
to
learn
mo
dels
that
are
nonlinear
as
function
of
using
conv
ex
optimization
techniques
that
are
guaran
teed
to
conv
erge
eﬃcien
tly
This
is
ossible
ecause
we
consider
ﬁxed
and
optimize
only
that
is,
the
optimization
algorithm
can
view
the
decision
function
as
eing
linear
in
diﬀeren
space.
Second,
the
kernel
function
often
139
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
admits
an
implementation
that
is
signiﬁcantly
more
computationally
eﬃcien
than
naiv
ely
constructing
tw
vectors
and
explicitly
taking
their
dot
pro
duct.
In
some
cases,
can
even
inﬁnite
dimensional,
which
would
result
in
an
inﬁnite
computational
cost
for
the
naive,
explicit
approach.
In
many
cases,
is
nonlinear,
tractable
function
of
ev
en
when
is
intractable.
As
an
example
of
an
inﬁnite-dimensional
feature
space
with
tractable
ernel,
we
construct
feature
mapping
ov
er
the
nonnegative
in
tegers
Suppose
that
this
mapping
returns
vector
containing
ones
follo
ed
inﬁnitely
many
zeros.
can
write
kernel
function
x,
) =
min
x,
that
is
exactly
equiv
alent
to
the
corresp
onding
inﬁnite-dimensional
dot
pro
duct.
The
most
commonly
used
kernel
is
the
Gaussian
kernel
) =
(5.84)
where
is
the
standard
normal
density
This
kernel
is
also
kno
wn
as
the
radial
basis
function
(RBF)
kernel,
ecause
its
alue
decreases
along
lines
in
space
radiating
outw
ard
from
The
Gaussian
ernel
corresp
onds
to
dot
pro
duct
in
an
inﬁnite-dimensional
space,
but
the
deriv
ation
of
this
space
is
less
straigh
tforw
ard
than
in
our
example
of
the
min
ernel
ov
er
the
integers.
can
think
of
the
Gaussian
ernel
as
erforming
kind
of
template
match-
ing
training
example
asso
ciated
with
training
lab
el
ecomes
template
for
class
When
test
oin
is
near
according
to
Euclidean
distance,
the
Gaussian
kernel
has
large
resp
onse,
indicating
that
is
very
similar
to
the
template.
The
mo
del
then
puts
large
weigh
on
the
asso
ciated
training
lab
el
Ov
erall,
the
prediction
will
combine
man
such
training
lab
els
weigh
ted
by
the
similarit
of
the
corresp
onding
training
examples.
Supp
ort
vector
machines
are
not
the
only
algorithm
that
can
enhanced
using
the
kernel
tric
k.
Man
other
linear
mo
dels
can
enhanced
in
this
The
category
of
algorithms
that
emplo
the
kernel
trick
is
kno
wn
as
ernel
machines
or
kernel
metho
ds
Williams
and
Rasmussen
1996
Schölk
opf
et
al.
1999
).
ma
jor
dra
wbac
to
kernel
machines
is
that
the
cost
of
ev
aluating
the
decision
function
is
linear
in
the
um
er
of
training
examples,
ecause
the
-th
example
con
tributes
term
to
the
decision
function.
Supp
ort
vector
machines
are
able
to
mitigate
this
by
learning
an
ector
that
contains
mostly
zeros.
Classifying
new
example
then
requires
ev
aluating
the
ernel
function
only
for
the
training
examples
that
ha
nonzero
These
training
examples
are
known
as
supp
ort
ectors
Kernel
machines
also
suﬀer
from
high
computational
cost
of
training
when
the
dataset
is
large.
revisit
this
idea
in
section
5.9
Kernel
machines
with
140
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
generic
kernels
struggle
to
generalize
well.
explain
wh
in
section
5.11
The
mo
dern
incarnation
of
deep
learning
was
designed
to
ercome
these
limitations
of
ernel
mac
hines.
The
current
deep
learning
renaissance
egan
when
Hin
ton
et
al.
2006
demonstrated
that
neural
net
ork
could
outp
erform
the
RBF
kernel
SVM
on
the
MNIST
enchmark.
5.7.3
Other
Simple
Sup
ervised
Learning
Algorithms
hav
already
brieﬂy
encountered
another
nonprobabilistic
sup
ervised
learning
algorithm,
nearest
neigh
or
regression.
More
generally
-nearest
neighbors
is
family
of
tec
hniques
that
can
used
for
classiﬁcation
or
regression.
As
nonparametric
learning
algorithm,
-nearest
neigh
ors
is
not
restricted
to
ﬁxed
um
er
of
parameters.
usually
think
of
the
-nearest
neigh
ors
algorithm
as
not
having
any
parameters
but
rather
implementing
simple
function
of
the
training
data.
In
fact,
there
is
not
ev
en
really
training
stage
or
learning
pro
cess.
Instead,
at
test
time,
when
an
to
pro
duce
an
output
for
new
test
input
ﬁnd
the
-nearest
neighbors
to
in
the
training
data
then
return
the
erage
of
the
corresp
onding
alues
in
the
training
set.
This
works
for
essen
tially
an
kind
of
sup
ervised
learning
where
can
deﬁne
an
erage
er
alues.
In
the
case
of
classiﬁcation,
can
av
erage
ov
er
one-hot
co
de
ectors
with
= 1
and
= 0
for
all
other
alues
of
can
then
in
terpret
the
av
erage
ov
er
these
one-hot
co
des
as
giving
probability
distribution
ov
er
classes.
As
nonparametric
learning
algorithm,
-nearest
neighbor
can
ac
hiev
ery
high
capacity
or
example,
supp
ose
hav
ulticlass
classiﬁcation
task
and
measure
erformance
with
0-1
loss.
In
this
setting,
-nearest
neighbor
conv
erges
to
double
the
Ba
es
error
as
the
um
er
of
training
examples
approac
hes
inﬁnit
The
error
in
excess
of
the
Bay
es
error
results
from
choosing
single
neighbor
breaking
ties
et
een
equally
distan
neigh
ors
randomly
When
there
is
inﬁnite
training
data,
all
test
oints
will
hav
inﬁnitely
man
training
set
neigh
ors
at
distance
zero.
If
we
allo
the
algorithm
to
use
all
these
neighbors
to
ote,
rather
than
randomly
ho
osing
one
of
them,
the
pro
cedure
conv
erges
to
the
Bay
es
error
rate. The
high
capacity
of
-nearest
neighbors
enables
it
to
obtain
high
accuracy
giv
en
large
training
set.
It
do
es
so
at
high
computational
cost,
ho
ev
er,
and
it
may
generalize
very
badly
giv
en
small
ﬁnite
training
set. One
weakness
of
-nearest
neighbors
is
that
it
cannot
learn
that
one
feature
is
more
discriminative
than
another.
or
example,
imagine
ha
regression
task
with
100
dra
wn
from
an
isotropic
Gaussian
distribution,
but
only
single
ariable
is
relev
ant
to
the
output.
Supp
ose
further
that
this
feature
simply
enco
des
the
output
directly
that
in
all
cases.
Nearest
neighbor
regression
will
not
able
to
detect
this
simple
pattern.
141
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
The
nearest
neigh
or
of
most
oin
ts
will
determined
the
large
num
ber
of
features
through
100
not
the
lone
feature
. Th
us
the
output
on
small
training
sets
will
essentially
random.
Another
type
of
learning
algorithm
that
also
breaks
the
input
space
in
to
regions
and
has
separate
parameters
for
eac
region
is
the
decision
tree
Breiman
et
al.
1984
and
its
many
ariants.
As
shown
in
ﬁgure
5.7
each
no
de
of
the
decision
tree
is
asso
ciated
with
region
in
the
input
space,
and
internal
no
des
break
that
region
in
to
one
subregion
for
eac
hild
of
the
no
de
(t
ypically
using
an
axis-aligned
cut).
Space
is
th
us
sub
divided
into
nono
erlapping
regions,
with
one-to-one
corresp
ondence
et
een
leaf
no
des
and
input
regions.
Eac
leaf
no
de
usually
maps
ev
ery
oint
in
its
input
region
to
the
same
output.
Decision
trees
are
usually
trained
with
sp
ecialized
algorithms
that
are
ey
ond
the
scop
of
this
ok.
The
learning
algorithm
can
considered
nonparametric
if
it
is
allow
ed
to
learn
tree
of
arbitrary
size,
though
decision
trees
are
usually
regularized
with
size
constraints
that
turn
them
into
parametric
mo
dels
in
practice.
Decision
trees
as
they
are
ypically
used,
with
axis-aligned
splits
and
constant
outputs
within
eac
no
de,
struggle
to
solve
some
problems
that
are
easy
ev
en
for
logistic
regression.
or
example,
if
we
ha
tw
o-class
problem,
and
the
ositive
class
ccurs
wherever
the
decision
oundary
is
not
axis
aligned.
The
decision
tree
will
th
us
need
to
approximate
the
decision
oundary
with
many
no
des,
implementing
step
function
that
constan
tly
walks
bac
and
forth
across
the
true
decision
function
with
axis-aligned
steps.
As
we
ha
ve
seen,
nearest
neighbor
predictors
and
decision
trees
hav
many
limitations.
Nonetheless,
they
are
useful
learning
algorithms
when
computational
resources
are
constrained.
can
also
build
intuition
for
more
sophisticated
learning
algorithms
thinking
ab
out
the
similarities
and
diﬀerences
etw
een
sophisticated
algorithms
and
-nearest
neighbors
or
decision
tree
baselines.
See
Murph
2012
),
Bishop
2006
),
Hastie
et
al.
2001
or
other
mac
hine
learning
textb
oks
for
more
material
on
traditional
sup
ervised
learning
algorithms.
5.8
Unsup
ervised
Learning
Algorithms
Recall
from
section
5.1.3
that
unsup
ervised
algorithms
are
those
that
exp
erience
only
“features”
but
not
sup
ervision
signal.
The
distinction
et
een
sup
ervised
and
unsup
ervised
algorithms
is
not
formally
and
rigidly
deﬁned
ecause
there
is
no
ob
jectiv
test
for
distinguishing
whether
alue
is
feature
or
target
provided
by
sup
ervisor.
Informally
unsup
ervised
learning
refers
to
most
attempts
to
extract
information
from
distribution
that
do
not
require
uman
lab
or
to
annotate
142
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
01
111
011
1111
1110
110
10
010
00
1110
1111
110
10
01
00
010
011
11
111
11
Figure
5.7:
Diagrams
describing
ho
decision
tree
orks.
(T
op)
Each
no
de
of
the
tree
ho
oses
to
send
the
input
example
to
the
child
no
de
on
the
left
(0)
or
to
the
hild
node
on
the
right
(1).
Internal
no
des
are
drawn
as
circles
and
leaf
no
des
as
squares.
Each
no
de
is
displa
ed
with
binary
string
identiﬁer
corresp
onding
to
its
osition
in
the
tree,
obtained
app
ending
bit
to
its
paren
iden
tiﬁer
(0
ho
ose
left
or
top,
choose
righ
or
ottom).
(Bottom)
The
tree
divides
space
into
regions.
The
2-D
plane
sho
ws
ho
decision
tree
might
divide
The
no
des
of
the
tree
are
plotted
in
this
plane,
with
each
internal
no
de
drawn
along
the
dividing
line
it
uses
to
categorize
examples,
and
leaf
no
des
dra
wn
in
the
center
of
the
region
of
examples
they
receiv
e.
The
result
is
piecewise-constant
function,
with
one
piece
per
leaf.
Eac
leaf
requires
at
least
one
training
example
to
deﬁne,
so
it
is
not
possible
for
the
decision
tree
to
learn
function
that
has
more
lo
cal
maxima
than
the
umber
of
training
examples.
143
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
examples.
The
term
is
usually
asso
ciated
with
densit
estimation,
learning
to
dra
samples
from
distribution,
learning
to
denoise
data
from
some
distribution,
ﬁnding
manifold
that
the
data
lies
near,
or
clustering
the
data
into
groups
of
related
examples.
classic
unsup
ervised
learning
task
is
to
ﬁnd
the
“b
est”
represen
tation
of
the
data.
By
“b
est”
we
can
mean
diﬀerent
things,
but
generally
sp
eaking
we
are
lo
oking
for
represen
tation
that
preserv
es
as
uc
information
ab
out
as
ossible
while
ob
eying
some
enalt
or
constrain
aimed
at
keeping
the
representation
simpler
or
more
accessible
than
itself.
There
are
ultiple
ays
of
deﬁning
simpler
representation.
Three
of
the
most
common
include
low
er-dimensional
representations,
sparse
represen
tations,
and
indep
endent
representations.
Low-dimensional
represen
tations
attempt
to
compress
as
muc
information
about
as
ossible
in
smaller
represen
tation.
Sparse
representations
Barlo
1989
Olshausen
and
Field
1996
Hin
ton
and
Ghahramani
1997
em
ed
the
dataset
into
representation
whose
entries
are
mostly
zeros
for
most
inputs.
The
use
of
sparse
representations
typically
requires
increasing
the
dimensionality
of
the
representation,
so
that
the
represen
tation
ecoming
mostly
zeros
do
es
not
discard
to
muc
information.
This
results
in
an
erall
structure
of
the
represen
tation
that
tends
to
distribute
data
along
the
axes
of
the
representation
space.
Indep
enden
representations
attempt
to
disentangle
the
sources
of
ariation
underlying
the
data
distribution
such
that
the
dimensions
of
the
representation
are
statistically
indep
endent
Of course
these three
criteria are
certainly not
utually exclusiv
e.
Lo
w-
dimensional
representations
often
yield
elements
that
hav
fewer
or
weak
er
de-
endencies
than
the
original
high-dimensional
data.
This
is
ecause
one
wa
to
reduce
the
size
of
represen
tation
is
to
ﬁnd
and
remov
redundancies.
Identifying
and
remo
ving
more
redundancy
enables
the
dimensionalit
reduction
algorithm
to
ac
hiev
more
compression
while
discarding
less
information.
The
notion
of
representation
is
one
of
the
central
themes
of
deep
learning
and
therefore
one
of
the
central
themes
in
this
ok.
In
this
section,
we
develop
some
simple
examples
of
represen
tation
learning
algorithms.
ogether,
these
example
algorithms
sho
ho
to
op
erationalize
all
three
of
the
criteria
ab
ov
e.
Most
of
the
remaining
chapters
introduce
additional
represen
tation
learning
algorithms
that
dev
elop
these
criteria
in
diﬀerent
wa
ys
or
introduce
other
criteria.
144
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
20
10
10
20
20
10
10
20
20
10
10
20
20
10
10
20
Figure
5.8:
PCA
learns
linear
pro
jection
that
aligns
the
direction
of
greatest
ariance
with
the
axes
of
the
new
space.
(L
eft)
The
original
data
consist
of
samples
of
In
this
space,
the
ariance
might
ccur
along
directions
that
are
not
axis
aligned.
(R
ight)
The
transformed
data
no
aries
most
along
the
axis
The
direction
of
second-most
ariance
is
no
along
5.8.1
Principal
Comp
onen
ts
Analysis
In
section
2.12
sa
that
the
principal
comp
onen
ts
analysis
algorithm
pro
vides
means
of
compressing
data.
can
also
view
PCA
as
an
unsup
ervised
learning
algorithm
that
learns
represen
tation
of
data.
This
representation
is
based
on
of
the
criteria
for
simple
representation
describ
ed
ab
ov
e.
PCA
learns
represen
tation
that
has
low
er
dimensionality
than
the
original
input.
It
also
learns
representation
whose
elements
ha
no
linear
correlation
with
eac
other.
This
is
ﬁrst
step
tow
ard
the
criterion
of
learning
representations
whose
elements
are
statistically
indep
enden
t.
achiev
full
indep
endence,
representation
learning
algorithm
must
also
remov
the
nonlinear
relationships
et
een
ariables.
PCA
learns
an
orthogonal,
linear
transformation
of
the
data
that
pro
jects
an
input
to
represen
tation
as
sho
wn
in
ﬁgure
5.8
In
section
2.12
we
saw
that
could
learn
one-dimensional
representation
that
est
reconstructs
the
original
data
(in
the
sense
of
mean
squared
error)
and
that
this
representation
actually
corresp
onds
to
the
ﬁrst
principal
comp
onent
of
the
data.
Th
us
we
can
use
PCA
as
simple
and
eﬀectiv
dimensionalit
reduction
metho
that
preserv
es
as
uc
of
the
information
in
the
data
as
ossible
(again,
as
measured
least-squares
reconstruction
error).
In
the
follo
wing,
will
study
ho
the
PCA
represen
tation
decorrelates
the
original
data
representation
Let
us
consider
the
design
matrix
will
assume
that
the
data
has
145
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
mean
of
zero,
] =
If
this
is
not
the
case,
the
data
can
easily
cen
tered
subtracting
the
mean
from
all
examples
in
prepro
cessing
step.
The
unbiased
sample
cov
ariance
matrix
asso
ciated
with
is
giv
en
by
ar[
] =
(5.85)
PCA
ﬁnds
representation
(through
linear
transformation)
where
ar[
is
diagonal.
In
section
2.12
we
saw
that
the
principal
comp
onents
of
design
matrix
are
given
by
the
eigenv
ectors
of
rom
this
view,
(5.86)
In
this
section,
exploit
an
alternativ
deriv
ation
of
the
principal
comp
onents.
The
principal
comp
onen
ts
may
also
obtained
via
singular
alue
decomp
osition
(SVD).
Sp
eciﬁcally
they
are
the
right
singular
vectors
of
see
this,
let
the
right
singular
ectors
in
the
decomp
osition
. W
then
recov
er
the
original
eigenv
ector
equation
with
as
the
eigenv
ector
basis:
(5.87)
The
SVD
is
helpful
to
sho
that
PCA
results
in
diagonal
ar
Using
the
SVD
of
we
can
express
the
ariance
of
as:
ar[
] =
(5.88)
(5.89)
(5.90)
(5.91)
where
use
the
fact
that
ecause
the
matrix
of
the
singular
alue
decomp
osition
is
deﬁned
to
orthogonal.
This
shows
that
the
co
ariance
of
is
diagonal
as
required:
ar[
] =
(5.92)
(5.93)
146
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
(5.94)
(5.95)
where
this
time
we
use
the
fact
that
again
from
the
deﬁnition
of
the
SVD.
The
ab
analysis
shows
that
when
we
pro
ject
the
data
to
via
the
linear
transformation
the
resulting
representation
has
diagonal
cov
ariance
matrix
(as
giv
en
),
whic
immediately
implies
that
the
individual
elemen
ts
of
are
utually
uncorrelated.
This
abilit
of
PCA
to
transform
data
into
represen
tation
where
the
elements
are
mutually
uncorrelated
is
ery
imp
ortant
prop
erty
of
PCA.
It
is
simple
example
of
representation
that
attempts
to
disentangle
the
unknown
factors
of
variation
underlying
the
data. In
the
case
of
PCA,
this
disen
tangling
takes
the
form
of
ﬁnding
rotation
of
the
input
space
(describ
ed
by
that
aligns
the
principal
axes
of
ariance
with
the
basis
of
the
new
representation
space
asso
ciated
with
While
correlation
is
an
imp
ortant
category
of
dep
endency
etw
een
elements
of
the
data,
we
are
also
interested
in
learning
representations
that
disentangle
more
complicated
forms
of
feature
dep
endencies.
or
this,
we
will
need
more
than
what
can
done
with
simple
linear
transformation.
5.8.2
-means
Clustering
Another
example
of
simple
represen
tation
learning
algorithm
is
-means
clustering.
The
-means
clustering
algorithm
divides
the
training
set
into
diﬀeren
clusters
of
examples
that
are
near
each
other.
can
thus
think
of
the
algorithm
as
pro
viding
-dimensional
one-hot
co
de
vector
represen
ting
an
input
If
elongs
to
cluster
then
= 1
and
all
other
entries
of
the
represen
tation
are
zero.
The
one-hot
co
de
provided
-means
clustering
is
an
example
of
sparse
represen
tation,
ecause
the
ma
jority
of
its
entries
are
zero
for
ev
ery
input.
Later,
dev
elop
other
algorithms
that
learn
more
ﬂexible
sparse
representations,
where
more
than
one
en
try
can
nonzero
for
eac
input
One-hot
co
des
are
an
extreme
example
of
sparse
representations
that
lose
many
of
the
eneﬁts
of
distributed
represen
tation.
The
one-hot
co
de
still
confers
some
statistical
adv
antages
(it
naturally
conv
eys
the
idea
that
all
examples
in
the
same
cluster
are
similar
to
each
other),
and
it
confers
the
computational
adv
antage
that
the
entire
represen
tation
147
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
ma
captured
by
single
in
teger.
The
-means
algorithm
works
by
initializing
diﬀeren
cen
troids
(1)
to
diﬀeren
alues,
then
alternating
etw
een
tw
diﬀerent
steps
until
conv
ergence.
In
one
step,
each
training
example
is
assigned
to
cluster
where
is
the
index
of
the
nearest
centroid
In
the
other
step,
each
cen
troid
is
up
dated
to
the
mean
of
all
training
examples
assigned
to
cluster
One
diﬃculty
ertaining
to
clustering
is
that
the
clustering
problem
is
inherently
ill
osed,
in
the
sense
that
there
is
no
single
criterion
that
measures
how
well
clustering
of
the
data
corresp
onds
to
the
real
world.
can
measure
prop
erties
of
the
clustering,
such
as
the
av
erage
Euclidean
distance
from
cluster
centroid
to
the
members
of
the
cluster.
This
enables
us
to
tell
ho
well
we
are
able
to
reconstruct
the
training
data
from
the
cluster
assignments.
do
not
know
how
ell
the
cluster
assignments
corresp
ond
to
prop
erties
of
the
real
orld.
Moreo
er,
there
ma
man
diﬀeren
clusterings
that
all
corresp
ond
well
to
some
prop
erty
of
the
real
orld.
may
hop
to
ﬁnd
clustering
that
relates
to
one
feature
but
obtain
diﬀeren
t,
equally
alid
clustering
that
is
not
relev
an
to
our
task.
or
example,
supp
ose
that
we
run
tw
clustering
algorithms
on
dataset
consisting
of
images
of
red
trucks,
images
of
red
cars,
images
of
gray
trucks,
and
images
of
gray
cars.
If
we
ask
eac
clustering
algorithm
to
ﬁnd
tw
clusters,
one
algorithm
ma
ﬁnd
cluster
of
cars
and
cluster
of
truc
ks,
while
another
ma
ﬁnd
cluster
of
red
ehicles
and
cluster
of
gray
vehicles.
Supp
ose
we
also
run
third
clustering
algorithm,
whic
is
allo
ed
to
determine
the
num
ber
of
clusters.
This
may
assign
the
examples
to
four
clusters,
red
cars,
red
trucks,
gra
cars,
and
gray
truc
ks.
This
new
clustering
no
at
least
captures
information
ab
out
oth
attributes,
but
it
has
lost
information
ab
out
similarity
Red
cars
are
in
diﬀeren
cluster
from
gra
cars,
just
as
they
are
in
diﬀerent
cluster
from
gray
truc
ks. The
output
of
the
clustering
algorithm
do
es
not
tell
us
that
red
cars
are
more
similar
to
gray
cars
than
they
are
to
gray
trucks.
They
are
diﬀerent
from
oth
things,
and
that
is
all
kno
w.
These
issues
illustrate
some
of
the
reasons
that
we
may
prefer
distributed
represen
tation
to
one-hot
represen
tation.
distributed
representation
could
ha
attributes
for
eac
vehicle—one
representing
its
color
and
one
representing
whether
it
is
car
or
truc
k.
It
is
still
not
en
tirely
clear
what
the
optimal
distributed
representation
is
(how
can
the
learning
algorithm
kno
whether
the
attributes
we
are
in
terested
in
are
color
and
car-versus-truc
rather
than
man
ufacturer
and
age?),
but
having
many
attributes
reduces
the
burden
on
the
algorithm
to
guess
which
single
attribute
we
care
ab
out,
and
gives
us
the
ability
to
measure
similarity
betw
een
ob
jects
in
ﬁne-grained
by
comparing
many
148
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
attributes
instead
of
just
testing
whether
one
attribute
matc
hes.
5.9
Sto
hastic
Gradien
Descen
Nearly
all
of
deep
learning
is
ered
one
very
imp
ortant
algorithm:
sto
hastic
gradien
descen
(SGD).
Sto
chastic
gradient
descen
is
an extension
of the
gradien
descen
algorithm
introduced
in
section
4.3
recurring
problem
in
machine
learning
is
that
large
training
sets
are
necessary
for
go
generalization,
but
large
training
sets
are
also
more
computationally
exp
ensiv
e.
The
cost
function
used
mac
hine
learning
algorithm
often
decomp
oses
as
sum
ov
er
training
examples
of
some
er-example
loss
function.
or
example,
the
negativ
conditional
log-likelihoo
of
the
training
data
can
written
as
) =
data
) =
=1
(5.96)
where
is
the
er-example
loss
) =
log
or
these
additiv
cost
functions,
gradient
descent
requires
computing
) =
=1
(5.97)
The
computational
cost
of
this
op
eration
is
As
the
training
set
size
gro
ws
to
billions
of
examples,
the
time
to
tak
single
gradien
step
ecomes
prohibitively
long.
The
insight
of
SGD
is
that
the
gradien
is
an
exp
ectation.
The
exp
ectation
may
approximately
estimated
using
small
set
of
samples.
Sp
eciﬁcally
on
eac
step
of
the
algorithm,
can
sample
minibatc
of
examples
(1)
dra
wn
uniformly
from
the
training
set.
The
minibatc
size
is
ypically
chosen
to
relatively
small
num
ber
of
examples,
ranging
from
one
to
few
hundred.
Crucially
is
usually
held
ﬁxed
as
the
training
set
size
gro
ws.
may
ﬁt
training
set
with
billions
of
examples
using
up
dates
computed
on
only
hundred
examples.
The
estimate
of
the
gradient
is
formed
as
=1
(5.98)
149
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
using
examples
from
the
minibatch
The
sto
chastic
gradient
descen
algorithm
then
follows
the
estimated
gradient
do
wnhill:
(5.99)
where
is
the
learning
rate.
Gradien
descent
in
general
has
often
een
regarded
as
slo
or
unreliable.
In
the
past,
the
application
of
gradient
descen
to
noncon
ex
optimization
problems
as
regarded
as
fo
olhardy
or
unprincipled.
da
we
know
that
the
machine
learning
mo
dels
describ
ed
in
part
ork
very
well
when
trained
with
gradien
descen
t.
The
optimization
algorithm
may
not
guaran
teed
to
arrive
at
even
lo
cal
minim
um
in
reasonable
amount
of
time,
but
it
often
ﬁnds
very
low
alue
of
the
cost
function
quickly
enough
to
useful.
Sto
hastic
gradient
descent
has
many
imp
ortan
uses
outside
the
con
text
of
deep
learning.
It
is
the
main
wa
to
train
large
linear
mo
dels
on
very
large
datasets.
or
ﬁxed
mo
del
size,
the
cost
er
SGD
up
date
do
es
not
dep
end
on
the
training
set
size
In
practice,
we
often
use
larger
mo
del
as
the
training
set
size
increases,
but
we
are
not
forced
to
do
so.
The
um
er
of
up
dates
required
to
reach
con
ergence
usually
increases
with
training
set
size. Ho
ev
er,
as
approac
hes
inﬁnit
the
mo
del
will
even
tually
con
erge
to
its
est
ossible
test
error
efore
SGD
has
sampled
every
example
in
the
training
set.
Increasing
further
will
not
extend
the
amoun
of
training
time
needed
to
reac
the
mo
del’s
est
ossible
test
error.
rom
this
oint
of
view,
one
can
argue
that
the
asymptotic
cost
of
training
mo
del
with
SGD
is
(1)
as
function
of
Prior
to
the
adven
of
deep
learning,
the
main
wa
to
learn
nonlinear
mo
dels
as
to
use
the
kernel
trick
in
com
bination
with
linear
mo
del.
Many
kernel
learning
algorithms
require
constructing
an
matrix
i,j
Constructing
this
matrix
has
computational
cost
whic
is
clearly
undesirable
for
datasets
with billions
of examples.
In
academia,
starting in
2006, deep learning w
as
initially
interesting
ecause
it
as
able
to
generalize
to
new
examples
etter
than
comp
eting
algorithms
when
trained
on
medium-sized
datasets
with
tens
of
thousands
of
examples.
So
on
after,
deep
learning
garnered
additional
interest
in
industry
ecause
it
pro
vided
scalable
wa
of
training
nonlinear
mo
dels
on
large
datasets.
Sto
hastic
gradient
descen
and
many
enhancemen
ts
to
it
are
describ
ed
further
in
chapter
150
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
5.10
Building
Mac
hine
Learning
Algorithm
Nearly
all
deep
learning
algorithms
can
describ
ed
as
particular
instances
of
fairly
simple
recip
e: combine
sp
eciﬁcation
of
dataset,
cost
function,
an
optimization
pro
cedure
and
mo
del.
or
example,
the
linear
regression
algorithm
combines
dataset
consisting
of
and
the
cost
function
) =
data
log
model
(5.100)
the
mo
del
sp
eciﬁcation
model
) =
b,
1)
and,
in
most
cases,
the
optimization
algorithm
deﬁned
solving
for
where
the
gradient
of
the
cost
is
zero
using
the
normal
equations.
By
realizing
that
we
can
replace
any
of
these
comp
onen
ts
mostly
independently
from
the
others,
we
can
obtain
wide
range
of
algorithms.
The
cost
function
ypically
includes
at
least
one
term
that
causes
the
learning
pro
cess
to
erform
statistical
estimation.
The
most
common
cost
function
is
the
negativ
log-likelihoo
d,
so
that
minimizing
the
cost
function
causes
maxim
um
lik
eliho
estimation.
The
cost
function
ma
also
include
additional
terms,
such
as
regularization
terms.
or
example,
we
can
add
weigh
decay
to
the
linear
regression
cost
function
to
obtain
) =
||
||
data
log
model
(5.101)
This
still
allows
closed
form
optimization.
If
we
hange
the
mo
del
to
nonlinear,
then
most
cost
functions
can
no
longer
optimized
in
closed
form.
This
requires
us
to
ho
ose
an
iterative
numerical
optimization
pro
cedure,
such
as
gradient
descen
t.
The
recip
for
constructing
learning
algorithm
by
com
bining
mo
dels,
costs,
and
optimization
algorithms
supp
orts
oth
sup
ervised
and
unsup
ervised
learning.
The
linear
regression
example
sho
ws
how
to
supp
ort
sup
ervised
learning.
Unsup
ervised
learning
can
supp
orted
by
deﬁning
dataset
that
contains
only
and
pro
viding
an
appropriate
unsup
ervised
cost
and
mo
del.
or
example,
can
obtain
the
ﬁrst
PCA
ector
by
sp
ecifying
that
our
loss
function
is
) =
data
||
||
(5.102)
while
our
mo
del
is
deﬁned
to
ha
with
norm
one
and
reconstruction
function
) =
xw
151
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
In
some
cases,
the
cost
function
may
function
that
cannot
actually
ev
aluate,
for
computational
reasons.
In
these
cases,
we
can
still
appro
ximately
minimize
it
using
iterative
umerical
optimization,
as
long
as
hav
some
wa
of
appro
ximating
its
gradients.
Most
machine
learning
algorithms
mak
use
of
this
recip
e,
though
it
may
not
immediately
obvious.
If
mac
hine
learning
algorithm
seems
especially
unique
or
hand
designed,
it
can
usually
understo
as
using
sp
ecial-case
optimizer.
Some
mo
dels,
suc
as
decision
trees
and
-means,
require
sp
ecial-case
optimizers
ecause
their
cost
functions
hav
ﬂat
regions
that
make
them
inappropriate
for
minimization
gradien
t-based
optimizers.
Recognizing
that
most
mac
hine
learning
algorithms
can
describ
ed
using
this
recip
helps
to
see
the
diﬀeren
algorithms
as
part
of
taxonom
of
metho
ds
for
doing
related
tasks
that
ork
for
similar
reasons,
rather
than
as
long
list
of
algorithms
that
eac
hav
separate
justiﬁcations.
5.11
Challenges
Motiv
ating
Deep
Learning
The
simple
machine
learning
algorithms
describ
ed
in
this
chapter
ork
well
on
wide
ariety
of
imp
ortant
problems.
They
ha
not
succeeded,
how
ev
er,
in
solving
the
central
problems
in
AI,
such
as
recognizing
sp
eech
or
recognizing
ob
jects.
The
developmen
of
deep
learning
was
motiv
ated
in
part
the
failure
of
traditional
algorithms
to
generalize
well
on
suc
AI
tasks.
This
section
is
ab
out
how
the
challenge
of
generalizing
to
new
examples
ecomes
exp
onen
tially
more
diﬃcult
when
working
with
high-dimensional
data,
and
how
the
mechanisms
used
to
achiev
generalization
in
traditional
mac
hine
learning
are
insuﬃcien
to
learn
complicated
functions
in
high-dimensional
spaces.
Such
spaces
also
often
imp
ose
high
computational
costs.
Deep
learning
was
designed
to
ercome
these
and
other
obstacles.
5.11.1
The
Curse
of
Dimensionality
Man
machine
learning
problems
ecome
exceedingly
diﬃcult
when
the
num
ber
of
dimensions
in
the
data
is
high.
This
phenomenon
is
known
as
the
curse
of
dimensionalit
Of
particular
concern
is
that
the
num
er
of
ossible
distinct
conﬁgurations
of
set
of
ariables
increases
exp
onentially
as
the
um
er
of
ariables
increases.
The
curse
of
dimensionality
arises
in
many
places
in
computer
science,
esp
ecially
in
machine
learning.
152
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
Figure
5.9:
As
the
num
ber
of
relev
an
dimensions
of
the
data
increases
(from
left
to
righ
t),
the
num
ber
of
onﬁgurations
of
interest
may
grow
exp
onentially
(L
eft)
In
this
one-dimensional
example,
we
ha
one
ariable
for
which
we
only
care
to
distinguish
10
regions
of
in
terest.
With
enough
examples
falling
within
eac
of
these
regions
(each
region
corresp
onds
to
cell
in
the
illustration),
learning
algorithms
can
easily
generalize
correctly
straigh
tforw
ard
ay
to
generalize
is
to
estimate
the
alue
of
the
target
function
within
eac
region
(and
ossibly
in
terp
olate
betw
een
neighboring
regions).
(Center)
With
dimensions,
it
is
more
diﬃcult
to
distinguish
10
diﬀerent
alues
of
each
ariable.
need
to
eep
trac
of
up
to
10
10 = 100
regions,
and
we
need
at
least
that
man
examples
to
co
er
all
those
regions.
(Right)
With
three
dimensions,
this
grows
to
10
= 1
000
regions
and
at
least
that
many
examples.
or
dimensions
and
alues
to
distinguished
along
eac
axis,
seem
to
need
regions
and
examples.
This
is
an
instance
of
the
curse
of
dimensionalit
Figure
graciously
pro
vided
Nicolas
Chapados.
One
hallenge
osed
the
curse
of
dimensionalit
is
statistical
hallenge.
As
illustrated
in
ﬁgure
5.9
statistical
challenge
arises
ecause
the
num
ber
of
ossible
conﬁgurations
of
is
uc
larger
than
the
num
ber
of
training
examples.
understand
the
issue,
let
us
consider
that
the
input
space
is
organized
in
to
grid,
as
in
the
ﬁgure.
can
describ
low-dimensional
space
with
small
num
ber
of
grid
cells
that
are
mostly
occupied
the
data.
When
generalizing
to
new
data
oin
t,
we
can
usually
tell
what
to
do
simply
by
insp
ecting
the
training
examples
that
lie
in
the
same
cell
as
the
new
input.
or
example,
if
estimating
the
probabilit
densit
at
some
oin
we
can
just
return
the
um
er
of
training
examples
in
the
same
unit
olume
cell
as
divided
the
total
um
er
of
training
examples.
If
wish
to
classify
an
example,
we
can
return
the
most
common
class
of
training
examples
in
the
same
cell.
If
we
are
doing
regression,
can
av
erage
the
target
alues
observed
ov
er
the
examples
in
that
cell.
But
what
ab
out
the
cells
for
which
ha
seen
no
example?
Because
in
high-dimensional
spaces,
the
num
er
of
conﬁgurations
is
uge,
uc
larger
than
our
num
ber
of
examples,
ypical
grid
cell
has
no
training
example
asso
ciated
with
it.
How
could
we
ossibly
say
something
meaningful
ab
out
these
new
conﬁgurations?
Many
traditional
machine
learning
153
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
algorithms
simply
assume
that
the
output
at
new
oint
should
approximately
the
same
as
the
output
at
the
nearest
training
oint.
5.11.2
Lo
cal
Constancy
and
Smo
othness
Regularization
generalize
ell,
mac
hine
learning
algorithms
need
to
guided
prior
eliefs
ab
out
what
kind
of
function
they
should
learn.
hav
seen
these
priors
incorp
o-
rated
as
explicit
eliefs
in
the
form
of
probability
distributions
er
parameters
of
the
mo
del.
More
informally
may
also
discuss
prior
eliefs
as
directly
inﬂuencing
the
function
itself
and
inﬂuencing
the
parameters
only
indirectly
as
result
of
the
relationship
et
een
the
parameters
and
the
function.
Additionally
we
informally
discuss
prior
beliefs
as
eing
expressed
implicitly
by
choosing
algorithms
that
are
biased
to
ard
choosing
some
class
of
functions
ov
er
another,
ev
en
though
these
biases
may
not
expressed
(or
even
be
ossible
to
express)
in
of
probabilit
distribution
representing
our
degree
of
elief
in
arious
functions.
Among
the
most
widely
used
of
these
implicit
“priors” is
the
smo
othness
prior
or
lo
cal
constancy
prior
This
prior
states
that
the
function
learn
should
not
change
very
muc
within
small
region.
Man
simpler
algorithms
rely
exclusiv
ely
on
this
prior
to
generalize
well,
and
as
result,
they
fail
to
scale
to
the
statistical
hallenges
in
volv
ed
in
solving
AI-
lev
el
tasks.
Throughout
this
ok,
we
describ
ho
deep
learning
introduces
additional
(explicit and
implicit)
priors in
order
to
reduce
the generalization
error
on
sophisticated
tasks.
Here,
explain
why
the
smo
othness
prior
alone
is
insuﬃcien
for
these
tasks.
There
are
man
diﬀerent
wa
ys
to
implicitly
or
explicitly
express
prior
elief
that
the
learned
function
should
smo
oth
or
lo
cally
constan
t.
All
these
diﬀerent
metho
ds
are
designed
to
encourage
the
learning
pro
cess
to
learn
function
that
satisﬁes
the
condition
(5.103)
for
most
conﬁgurations
and
small
change
In
other
words,
if
we
kno
go
answ
er
for
an
input
(for
example,
if
is
lab
eled
training
example),
then
that
answ
er
is
probably
go
in
the
neigh
orho
of
If
we
ha
several
go
answers
in
some
neighborho
d,
we
would
combine
them
(by
some
form
of
av
eraging
or
in
terp
olation)
to
pro
duce
an
answer
that
agrees
with
as
many
of
them
as
muc
as
ossible.
An
extreme
example
of
the
lo
cal
constancy
approach
is
the
-nearest
neigh
ors
family
of
learning
algorithms.
These
predictors
are
literally
constant
ov
er
each
region
con
taining
all
the
oints
that
ha
the
same
set
of
nearest
neigh
ors
in
154
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
the
training
set.
or
= 1
the
num
er
of
distinguishable
regions
cannot
more
than
the
num
ber
of
training
examples.
While
the
-nearest
neighbors
algorithm
copies
the
output
from
nearby
training
examples,
most
kernel
machines
in
terp
olate
et
een
training
set
outputs
asso
ciated
with
nearb
training
examples.
An
imp
ortant
class
of
kernels
is
the
family
of
lo
cal
ernels
where
is
large
when
and
decreases
as
and
gro
further
apart
from
eac
other.
lo
cal
kernel
can
though
of
as
similarity
function
that
erforms
template
matching,
measuring
ho
closely
test
example
resem
bles
each
training
example
. Muc
of
the
mo
dern
motiv
ation
for
deep
learning
is
derived
from
studying
the
limitations
of
lo
cal
template
matching
and
ho
deep
mo
dels
are
able
to
succeed
in
cases
where
lo
cal
template
matching
fails
Bengio
et
al.
2006b
).
Decision
trees
also
suﬀer
from
the
limitations
of
exclusively
smo
othness-based
learning,
ecause
they
break
the
input
space
into
as
many
regions
as
there
are
lea
es
and
use
separate
parameter
(or
sometimes
man
parameters
for
extensions
of
decision
trees)
in
each
region.
If
the
target
function
requires
tree
with
at
least
lea
es
to
represented
accurately
then
at
least
training
examples
are
required
to
ﬁt
the
tree.
ultiple
of
is
needed
to
achiev
some
level
of
statistical
conﬁdence
in
the
predicted
output.
In
general,
to
distinguish
regions
in
input
space,
all
these
metho
ds
require
examples.
Typically
there are
parameters,
with
(1)
parameters
asso
ciated
with
eac
of
the
regions.
The
nearest
neighbor
scenario,
in
which
eac
training
example
can
used
to
deﬁne
at
most
one
region,
is
illustrated
in
ﬁgure
5.10
Is
there
to
represen
complex
function
that
has
many
more
regions
to
distinguished
than
the
um
er
of
training
examples?
Clearly
assuming
only
smo
othness
of
the
underlying
function
will
not
allo
learner
to
do
that.
or
example,
imagine
that
the
target
function
is
kind
of
chec
erboard.
chec
erboard
con
tains
many
ariations,
but
there
is
simple
structure
to
them.
Imagine
what
happ
ens
when
the
um
er
of
training
examples
is
substantially
smaller
than
the
um
er
of
black
and
white
squares
on
the
heck
erb
oard.
Based
on
only
lo
cal
generalization
and
the
smo
othness
or
lo
cal
constancy
prior,
the
learner
would
guaran
teed
to
correctly
guess
the
color
of
new
oint
if
it
lay
within
the
same
hec
erb
oard
square
as
training
example.
There
is
no
guarantee,
ho
ev
er,
that
the
learner
could
correctly
extend
the
chec
erb
oard
pattern
to
oints
lying
in
squares
that
do
not
contain
training
examples.
With
this
prior
alone,
the
only
information
that
an
example
tells
us
is
the
color
of
its
square,
and
the
only
wa
to
get
the
colors
of
the
en
tire
hec
erb
oard
right
is
to
co
er
eac
of
its
cells
with
at
155
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
least
one
example.
The
smo
othness
assumption
and
the
asso
ciated
nonparametric
learning
algo-
rithms
ork
extremely
well
as
long
as
there
are
enough
examples
for
the
learning
algorithm
to
observ
high
oints
on
most
eaks
and
low
oints
on
most
alleys
of
the
true
underlying
function
to
learned.
This
is
generally
true
when
the
function
to
learned
is
smo
oth
enough
and
aries
in
few
enough
dimensions.
In
high
dimensions,
ev
en
very
smo
oth
function
can
change
smo
othly
but
in
diﬀeren
wa
along
each
dimension.
If
the
function
additionally
eha
es
diﬀerently
in
arious
regions,
it
can
ecome
extremely
complicated
to
describ
with
set
of
training
examples.
If
the
function
is
complicated
(w
wan
to
distinguish
huge
um
er
of
regions
compared
to
the
num
er
of
examples),
is
there
an
hop
to
generalize
well?
The
answer
to
oth
of
these
questions—whether
it
is
ossible
to
represent
complicated
function
eﬃcien
tly
and
whether
it
is
possible
for
the
estimated
function
to
generalize
ell
to
new
inputs—is
es.
The
key
insigh
is
that
very
large
um
er
of
regions,
suc
as
(2
can
be
deﬁned
with
examples,
so
long
as
we
Figure
5.10:
Illustration
of
how
the
nearest
neigh
or
algorithm
breaks
up
the
input
space
in
to
regions.
An
example
(represented
here
circle)
within
each
region
deﬁnes
the
region
oundary
(represented
here
by
the
lines).
The
alue
asso
ciated
with
each
example
deﬁnes
what
the
output
should
for
all
oints
within
the
corresp
onding
region. The
regions
deﬁned
nearest
neighbor
matching
form
geometric
pattern
called
oronoi
diagram.
The
um
ber
of
these
contiguous
regions
cannot
gro
faster
than
the
num
ber
of
training
examples.
While
this
ﬁgure
illustrates
the
ehavior
of
the
nearest
neighbor
algorithm
sp
eciﬁcally
other
mac
hine
learning
algorithms
that
rely
exclusiv
ely
on
the
lo
cal
smoothness
prior
for
generalization
exhibit
similar
ehaviors:
each
training
example
only
informs
the
learner
ab
out
how
to
generalize
in
some
neighborho
immediately
surrounding
that
example.
156
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
in
tro
duce
some
dep
endencies
et
een
the
regions
through
additional
assumptions
ab
out
the
underlying
data-generating
distribution.
In
this
wa
we
can
actually
generalize
nonlo
cally
Bengio
and
Monp
errus
2005
Bengio
et
al.
2006c
).
Man
diﬀeren
deep
learning
algorithms
provide
implicit
or
explicit
assumptions
that
are
reasonable
for
broad
range
of
AI
tasks
in
order
to
capture
these
adv
antages.
Other
approaches
to
machine
learning
often
make
stronger,
task-sp
eciﬁc
as-
sumptions.
or
example,
we
could
easily
solve
the
ch
ec
erb
oard
task
provid
ing
the
assumption
that
the
target
function
is
erio
dic.
Usually
we
do
not
include
suc
strong,
task-sp
eciﬁc
assumptions
in
neural
netw
orks
so
that
they
can
generalize
to
muc
wider
ariety
of
structures.
AI
tasks
ha
structure
that
is
uc
to
complex
to
limited
to
simple,
manually
sp
eciﬁed
prop
erties
such
as
erio
dicity
so
we
wan
learning
algorithms
that
em
dy
more
general-purp
ose
assumptions.
The
core
idea
in
deep
learning
is
that
we
assume
that
the
data
as
generated
by
the
omp
osition
of
factors
or
features,
otentially
at
multiple
lev
els
in
hierar-
Many
other
similarly
generic
assumptions
can
further
impro
deep
learning
algorithms.
These
apparently
mild
assumptions
allow
an
exp
onential
gain
in
the
relationship
et
een
the
num
ber
of
examples
and
the
num
ber
of
regions
that
can
distinguished.
describ
these
exp
onen
tial
gains
more
precisely
in
sections
6.4.1
15.4
and
15.5
The
exp
onential
adv
an
tages
conferred
the
use
of
deep
distributed
represen
tations
coun
ter
the
exp
onential
challenges
osed
by
the
curse
of
dimensionality
5.11.3
Manifold
Learning
An
imp
ortant
concept
underlying
man
ideas
in
machine
learning
is
that
of
manifold.
manifold
is
connected
region.
Mathematically
, it is
a set
of points
asso
ciated
with
neighborho
around
each
oint.
rom
an
given
oin
t,
the
manifold
lo
cally
app
ears
to
Euclidean
space.
In
ev
eryda
life,
we
exp
erience
the
surface
of
the
world
as
2-D
plane,
but
it
is
in
fact
spherical
manifold
in
3-D
space.
The
concept
of
neighborho
surrounding
eac
oin
implies
the
existence
of
transformations
that
can
applied
to
mov
on
the
manifold
from
one
osition
to
neighboring
one.
In
the
example
of
the
world’s
surface
as
manifold,
one
can
alk
north,
south,
east,
or
west.
Although
there
is
formal
mathematical
meaning
to
the
term
“manifold,”
in
mac
hine
learning
it
tends
to
used
more
lo
osely
to
designate
connected
set
of
oints
that
can
approximated
ell
by
considering
only
small
num
er
of
157
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
Figure
5.11:
Data
sampled
from
distribution
in
tw
o-dimensional
space
that
is
actually
concen
trated
near
one-dimensional
manifold,
like
twisted
string.
The
solid
line
indicates
the
underlying
manifold
that
the
learner
should
infer.
degrees
of
freedom,
or
dimensions,
embedded
in
higher-dimensional
space.
Each
dimension
corresp
onds
to
lo
cal
direction
of
ariation.
See
ﬁgure
5.11
for
an
example
of
training
data
lying
near
one-dimensional
manifold
embedded
in
tw
o-
dimensional
space.
In
the
context
of
mac
hine
learning,
we
allo
the
dimensionalit
of
the
manifold
to
ary
from
one
oin
to
another.
This
often
happ
ens
when
manifold
intersects
itself.
or
example,
ﬁgure
eight
is
manifold
that
has
single
dimension
in
most
places
but
tw
dimensions
at
the
in
tersection
at
the
center.
Man
mac
hine
learning
problems
seem
hop
eless
if
we
exp
ect
the
machine
learning
algorithm
to
learn
functions
with
in
teresting
ariations
across
all
of
Manifold
learning
algorithms
surmount
this
obstacle
by
assuming
that
most
of
consists
of
inv
alid
inputs,
and
that
interesting
inputs
ccur
only
along
collection
of
manifolds
con
taining
small
subset
of
oints,
with
interesting
ariations
in
the
output
of
the
learned
function
ccurring
only
along
directions
that
lie
on
the
manifold,
or
with
interesting
ariations
happ
ening
only
when
mo
from
one
manifold
to
another.
Manifold
learning
was
introduced
in
the
case
of
contin
uous-v
alued
data
and
in
the
unsup
ervised
learning
setting,
although
this
probabilit
concen
tration
idea
can
generalized
to
oth
discrete
data
and
the
sup
ervised
learning
setting:
the
ey
assumption
remains
that
probabilit
mass
is
highly
concentrated.
The
assumption
that
the
data
lies
along
lo
w-dimensional
manifold
may
not
alw
ys
correct
or
useful.
argue
that
in
the
context
of
AI
tasks,
such
as
those
that
inv
olv
pro
cessing
images,
sounds,
or
text,
the
manifold
assumption
is
158
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
Figure
5.12:
Sampling
images
uniformly
at
random
(by
randomly
picking
each
pixel
according
to
uniform
distribution)
giv
es
rise
to
noisy
images.
Although
there
is
nonzero
probabilit
of
generating
an
image
of
face
or
of
any
other
ob
ject
frequently
encoun
tered
in
AI
applications,
we
never
actually
observe
this
happ
ening
in
practice.
This
suggests
that
the
images
encountered
in
AI
applications
ccup
negligible
prop
ortion
of
the
olume
of
image
space.
at
least
approximately
correct.
The
evidence
in
fav
or
of
this
assumption
consists
of
tw
categories
of
observ
ations.
The
ﬁrst
observ
ation
in
fav
or
of
the
manifold
yp
othesis
is
that
the
proba-
159
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
bilit
distribution
ver
images,
text
strings,
and
sounds
that
ccur
in
real
life
is
highly
concentrated.
Uniform
noise
essen
tially
never
resembles
structured
inputs
from
these
domains. Figure
5.12
shows
how,
instead,
uniformly
sampled
oints
lo
ok
lik
the
patterns
of
static
that
app
ear
on
analog
television
sets
when
no
signal
is
av
ailable.
Similarly
if
ou
generate
do
cument
by
pic
king
letters
uniformly
at
random,
what
is
the
probability
that
you
will
get
meaningful
English-language
text?
Almost
zero,
again,
ecause
most
of
the
long
sequences
of
letters
do
not
corresp
ond
to
natural
language
sequence:
the
distribution
of
natural
language
sequences
ccupies
very
little
volume
in
the
total
space
of
sequences
of
letters.
Of
course,
concentrated
probability
distributions
are
not
suﬃcien
to
sho
that
the
data
lies
on
reasonably
small
um
er
of
manifolds.
must
also
establish
that
the
examples
encounter
are
connected
to
each
other
other
examples,
with
eac
example
surrounded
by
other
highly
similar
examples
that
can
reac
hed
applying
transformations
to
trav
erse
the
manifold. The
second
argument
in
fa
or
of
the
manifold
yp
othesis
is
that
can
imagine
such
neigh
orho
ds
and
transformations,
at
least
informally
In
the
case
of
images,
we
can
certainly
think
of
many
ossible
transformations
that
allow
us
to
trace
out
manifold
in
image
space:
can
gradually
dim
or
brigh
ten
the
ligh
ts,
gradually
mo
or
rotate
ob
jects
in
the
image,
gradually
alter
the
colors
on
the
surfaces
of
ob
jects,
and
so
forth.
Multiple
manifolds
are
likely
inv
olved
in
most
applications.
or
example,
the
manifold
of
uman
face
images
ma
not
connected
to
the
manifold
of
cat
face
images.
These
thought
exp
eriments
conv
ey
some
intuitiv
reasons
supp
orting
the
mani-
fold
yp
othesis.
More
rigorous
exp
erimen
ts
Cayton
2005
Nara
anan
and
Mitter
2010
Sc
hölk
opf
et
al.
1998
Row
eis
and
Saul
2000
enen
baum
et
al.
2000
Brand
2003
Belkin
and
Niy
ogi
2003
Donoho
and
Grimes
2003
ein
berger
and
Saul
2004
clearly
supp
ort
the
hypothesis
for
large
class
of
datasets
of
interest
in
AI.
When
the
data
lies
on
lo
w-dimensional
manifold,
it
can
most
natural
for
mac
hine
learning
algorithms
to
represent
the
data
in
of
co
ordinates
on
the
manifold,
rather
than
in
of
co
ordinates
in
In
everyda
life,
can
think
of
roads
as
1-D
manifolds
embedded
in
3-D
space.
give
directions
to
sp
eciﬁc
addresses
in
of
address
num
bers
along
these
1-D
roads,
not
in
of
co
ordinates
in
3-D
space.
Extracting
these
manifold
co
ordinates
is
challenging
but
holds
the
promise
of
improving
many
mac
hine
learning
algorithms.
This
general
principle
is
applied
in
man
con
texts.
Figure
5.13
sho
ws
the
manifold
structure
of
dataset
consisting
of
faces.
By
the
end
of
this
ok,
will
ha
dev
elop
ed
the
metho
ds
necessary
to
learn
such
manifold
structure.
In
ﬁgure
20.6
we
will
see
ho
machine
learning
algorithm
can
successfully
accomplish
this
goal.
160
CHAPTER
5.
MA
CHINE
LEARNING
BASICS
This
concludes
part
which
has
provided
the
basic
concepts
in
mathematics
and
machine
learning
that
are
emplo
ed
throughout
the
remaining
parts
of
the
ok.
ou
are
now
prepared
to
em
bark
on
your
study
of
deep
learning.
Figure
5.13: T
raining
examples
from
the
QMUL
Multiview
ace
Dataset
Gong
et
al.
2000
),
for
which
the
sub
jects
ere
asked
to
mov
in
such
wa
as
to
cov
er
the
tw
o-
dimensional
manifold
corresp
onding
to
tw
angles
of
rotation.
ould
like
learning
algorithms
to
be
able
to
discov
er
and
disen
tangle
suc
manifold
coordinates.
Figure
20.6
illustrates
suc
feat.
161