This chapter contains some mathematical material
that you will likely have seen, but some may not have stayed with
you. I have also relegated the detailed discussion of how one
splits a node in a decision tree to this chapter.
15.1 Useful Material About Matrices
Terminology:
-
A matrix
is symmetric if
. A symmetric matrix is necessarily square.
-
We write
for the identity matrix.
-
A matrix is diagonal if the only non-zero elements appear on the diagonal. A diagonal matrix is necessarily symmetric.
-
A symmetric matrix is positive semidefinite if, for any x such that x T x > 0 (i.e. this vector has at least one non-zero component), we have
.
-
A symmetric matrix is positive definite if, for any x such that x T x > 0, we have
.
-
A matrix
is orthonormal if
. Orthonormal matrices are necessarily square.
Orthonormal
matrices: You should think of orthonormal matrices as
rotations, because they do not change lengths or angles. For
x a vector,
an orthonormal matrix, and
, we have
.
This means that
doesn’t change lengths. For y, z both unit vectors, we have that the
cosine of the angle between them is y T x; but, by the same argument as above,
the inner product of
and
is the same as y T x. This means that
doesn’t change angles, either.







Eigenvectors and
Eigenvalues: Assume
is a d × d symmetric matrix, v is a d × 1 vector, and λ is a scalar. If we have
then v is referred to as
an eigenvector of
and λ is the corresponding eigenvalue.
Matrices don’t have to be symmetric to have eigenvectors and
eigenvalues, but the symmetric case is the only one of interest to
us.



In the case of a symmetric matrix, the
eigenvalues are real numbers, and there are d distinct eigenvectors that are normal
to one another, and can be scaled to have unit length. They can be
stacked into a matrix
.
This matrix is orthonormal, meaning that
. This
means that there is a diagonal matrix
such that
In fact, there is a large number of such matrices, because we can
reorder the eigenvectors in the matrix
, and the equation still holds with a
new
,
obtained by reordering the diagonal elements of the original
.
There is no reason to keep track of this complexity. Instead, we
adopt the convention that the elements of
are always ordered so that the elements of
are
sorted along the diagonal, with the largest value coming
first.
![$$\mathcal{U} = \left [\mathbf{v}_{1},\ldots,\mathbf{v}_{d}\right ]$$](A442674_1_En_15_Chapter_IEq17.gif)








Diagonalizing a
symmetric matrix: This gives us a particularly important
procedure. We can convert any symmetric matrix
to a diagonal form by computing
This procedure is referred to as diagonalizing a
matrix. Again, we assume that the elements of
are always ordered so that the elements of
are
sorted along the diagonal, with the largest value coming first.
Diagonalization allows us to show that positive definiteness is
equivalent to having all positive eigenvalues, and positive
semidefiniteness is equivalent to having all non-negative
eigenvalues.




Factoring a
matrix: Assume that
is symmetric and positive semidefinite.
We have that
and all the diagonal elements of
are non-negative. Now construct a diagonal
matrix whose diagonal entries are the positive square roots of the
diagonal elements of
; call this matrix
. We have
and
. Then we
have that
so we can factor
into the form
by computing the
eigenvectors and eigenvalues.










15.1.1 The Singular Value Decomposition
For any m
× p matrix
, it is possible to obtain a
decomposition
where
is m × m,
is p × p, and
is m
× p and is diagonal. If you
don’t recall what a diagonal matrix looks like when the matrix
isn’t square, it’s simple.
All entries are zero, except the i, i entries for i in the range 1 to min(m, p). So if
is tall and thin, the top square is diagonal
and everything else is zero; if
is short and wide, the left square is
diagonal and everything else is zero. Both
and
are orthonormal (i.e.
and
).











Notice that there is a relationship between
forming an SVD and diagonalizing a matrix. In particular,
is symmetric, and it can
be diagonalized as
Similarly,
is symmetric, and it can
be diagonalized as




15.1.2 Approximating A Symmetric Matrix
Assume we have a k × k symmetric matrix
, and we wish to construct a matrix
that approximates it. We require that (a) the rank of
is precisely r <
k and (b) the approximation
should minimize the Frobenius
norm, that is,
It turns out that there is a straightforward construction that
yields
.





The first step is to notice that if
is orthonormal and
is any matrix, then
This is true because
is a rotation (as is
), and rotations
do not change the length of vectors. So, for example, if we write
as a table of row vectors
,
then
.
Now
,
so
.
But rotations do not change lengths, so
,
and so
.
To see the result for the case of
, just think of
as a table of row vectors.






![$$\mathcal{M} = \left [\mathbf{m}_{1},\mathbf{m}_{2},\ldots \mathbf{m}_{k}\right ]$$](A442674_1_En_15_Chapter_IEq57.gif)
![$$\mathcal{U}\mathcal{M} = \left [\mathcal{U}\mathbf{m}_{1},\mathcal{U}\mathbf{m}_{2},\ldots \mathcal{U}\mathbf{m}_{k}\right ]$$](A442674_1_En_15_Chapter_IEq58.gif)






Notice that, if
is the orthonormal matrix whose columns
are eigenvectors of
, then we have
Now write
for
, and
for
the diagonal matrix of eigenvalues of
. Then we have
an expression that is easy to solve for
. We know that
is
diagonal, so the best
is diagonal, too. The rank of
must be r, so the rank of
must be r as well. To get the best
, we keep the r largest diagonal values of
,
and set the rest to zero;
has rank r because it has only r non-zero entries on the diagonal, and
every other entry is zero.
















Now to recover
from
, we know that
(remember,
is the identity). We have
,
so
We can clean up this representation in a useful way. Notice that
only the first r columns of
(and the corresponding rows of
) contribute to
. The remaining k − r are each multiplied by one of the
zeros on the diagonal of
. Remember that, by convention,
was
sorted so that the diagonal values are in descending order (i.e.
the largest value is in the top left corner). We now keep only the
top left r × r block of
, which we write
. We then write
for the k × r matrix consisting of the first
r columns of
. Then
This is so useful a result, I have displayed it in a box; you
should remember it.
















Procedure 15.1 (Approximating a Symmetric
Matrix with a Low Rank Matrix)
Assume we have a symmetric k × k matrix
. We wish to approximate
with a matrix
that has rank r <
k. Write
for the matrix whose columns are eigenvectors of
, and
for the diagonal matrix of eigenvalues of
(so
).
Remember that, by convention,
was sorted so that the diagonal values are
in descending order (i.e. the largest value is in the top left
corner).









Now construct
from
by setting the k − r smallest values of
to
zero, and keeping only the top left r × r block. Construct
, the k × r matrix consisting of the first
r columns of
. Then
is the best possible rank r
approximation to
in the Frobenius norm.







Now if
is positive semidefinite (i.e. if at
least the r largest
eigenvalues of
are non-negative), then we can factor
as in the previous section. This yields
a procedure to approximate a symmetric matrix by factors. This is
so useful a result, I have displayed it in a box; you should
remember it.



Procedure 15.2 (Approximating a Symmetric
Matrix with Low Dimensional Factors)
Assume we have a symmetric k × k matrix
. We wish to approximate
with a matrix
that has rank r < k. We assume that at least the
r largest eigenvalues of
are non-negative. Write
for the matrix whose columns are
eigenvectors of
, and
for the diagonal matrix of eigenvalues of
(so
).
Remember that, by convention,
was sorted so that the diagonal values are
in descending order (i.e. the largest value is in the top left
corner).










Now construct
from
by setting the k − r smallest values of
to
zero and keeping only the top left r × r block. Construct
by replacing each diagonal
element of
with its positive square root. Construct
, the k × r matrix consisting of the first
r columns of
. Then write
is the best possible rank r
approximation to
in the Frobenius norm.










15.2 Some Special Functions
Error functions
and Gaussians: The error
function is defined by
and programming environments can typically evaluate the error
function. This fact is made useful to us by a simple change of
variables. We get
A particularly useful manifestation of this fact comes by noticing
that
(because
is a
probability density function, and is symmetric about 0). As a
result, we get





Inverse error
functions: We sometimes wish to know the value of
x such that
for some given p. The
relevant function of p is
known as the probit function
or the normal quantile function. We write
The probit function
can be expressed in terms of the inverse error function. Most programming environments can evaluate the
inverse error function (which is the inverse of the error
function). We have that
One problem we solve with some regularity is: choose u such that





Notice that
so that


Gamma
functions: The gamma function
is defined by a series of steps. First,
we have that for n an
integer,
and then for z a complex
number with positive real part (which includes positive real
numbers), we have
By doing this, we get a function on positive real numbers that is a
smooth interpolate of the factorial function. We won’t do any real
work with this function, so won’t expand on this definition. In
practice, we’ll either look up a value in tables or require a
software environment to produce it.



15.3 Splitting a Node in a Decision Tree
We want to choose a split that yields the most
information about the classes. To do so, we need to be able to
account for information. The proper measure is entropy (described
in more detail below). You should think of entropy as the number of
bits, on average, that would be required to determine the value of
a random variable. Filling in the details will allow us to
determine which of two splits is better, and to tell whether it is
worth splitting at all. At a high level, it is easy to compute
which of two splits is better. We determine the entropy of the
class conditioned on each split, then take the split which yields
the lowest entropy. This works because less information (fewer
bits) are required to determine the value of the class once we have
that split. Similarly, it is easy to compute whether to split or
not. We compare the entropy of the class conditioned on each split
to the entropy of the class without a split, and choose the case
with the lowest entropy, because less information (fewer bits) are
required to determine the value of the class in that case.
15.3.1 Accounting for Information with Entropy
It turns out to be
straightforward to keep track of information, in simple cases. We
will start with an example. Assume I have 4 classes. There are 8
examples in class 1, 4 in class 2, 2 in class 3, and 2 in class 4.
How much information on
average will you need to send me to tell me the class of a
given example? Clearly, this depends on how you communicate the
information. You could send me the complete works of Edward Gibbon
to communicate class 1; the Encyclopaedia for class 2; and so on.
But this would be redundant. The question is how little can you
send me. Keeping track of the amount of information is easier if we
encode it with bits (i.e. you can send me sequences of ‘0’s and
‘1’s).
Imagine the following scheme. If an example is in
class 1, you send me a ‘1’. If it is in class 2, you send me ‘01’;
if it is in class 3, you send me ‘001’; and in class 4, you send me
‘101’. Then the expected number of bits you will send me is
which is 1. 75 bits. This number doesn’t have to be an integer,
because it’s an expectation.

Notice that for the i’th class, you have sent me −
log2 p(i) bits. We can write the expected
number of bits you need to send me as
This expression handles other simple cases correctly, too. You
should notice that it isn’t really important how many objects appear in each class.
Instead, the fraction of
all examples that appear in the class is what matters. This
fraction is the prior probability that an item will belong to the
class. You should try what happens if you have two classes, with an
even number of examples in each; 256 classes, with an even number
of examples in each; and 5 classes, with p(1) = 1∕2, p(2) = 1∕4, p(3) = 1∕8, p(4) = 1∕16 and p(5) = 1∕16. If you try other examples,
you may find it hard to construct a scheme where you can send as
few bits on average as this
expression predicts. It turns out that, in general, the smallest
number of bits you will need to send me is given by the expression
under all conditions, though it may be hard or impossible to
determine what representation is required to achieve this
number.


The entropy of a
probability distribution is a number that scores how many bits, on
average, would need to be known to identify an item sampled from
that probability distribution. For a discrete probability
distribution, the entropy is computed as
where i ranges over all the
numbers where p(i)
is not zero. For example, if we have two classes and p(1) = 0. 99, then the entropy is
0. 0808, meaning you need very little information to tell which
class an object belongs to. This makes sense, because there is a
very high probability it belongs to class 1; you need very little
information to tell you when it is in class 2. If you are worried
by the prospect of having to send 0. 0808 bits, remember this is an
average, so you can interpret the number as meaning that, if you
want to tell which class each of 104 independent objects
belong to, you could do so in principle with only 808 bits.

Generally, the entropy is
larger if the class of an item is more uncertain. Imagine we have
two classes and p(1) =
0. 5, then the entropy is 1, and this is the largest possible value
for a probability distribution on two classes. You can always tell
which of two classes an object belongs to with just one bit (though
you might be able to tell with even less than one bit).
15.3.2 Choosing a Split with Information Gain
Write
for the set of all data at the node.
Write
for the left pool, and
for the right pool. The entropy of
a pool
scores how many bits would be required
to represent the class of an item in that pool, on average. Write
for the number of items of class
i in the pool, and
for the number of items in the pool.
Then the entropy
of the pool
is
It is straightforward that
bits are required to classify an
item in the parent pool
. For an item in the left pool, we need
bits; for an item in the right
pool, we need
bits. If we split the parent
pool, we expect to encounter items in the left pool with
probability
and items in the right pool with probability















This means that, on average, we must supply
bits to classify data items if we split the parent pool. Now a good
split is one that results in left and right pools that are
informative. In turn, we should need fewer bits to classify once we
have split than we need before the split. You can see the
difference
as the information gain
caused by the split. This is the average
number of bits that you don’t have to supply if you know which
side of the split an example lies. Better splits have larger
information gain.


Recall that our decision function is to choose a
feature at random, then test its value against a threshold. Any
data point where the value is larger goes to the left pool; where
the value is smaller goes to the right. This may sound much too
simple to work, but it is actually effective and popular. Assume
that we are at a node, which we will label k. We have the pool of training
examples that have reached that node. The i’th example has a feature vector
x i , and each of these feature
vectors is a d dimensional
vector.
We choose an integer j in the range 1…d uniformly and at random. We will
split on this feature, and we store j in the node. Recall we write
x i (j) for the value of the
j’th component of the
i’th feature vector. We
will choose a threshold t
k , and split by
testing the sign of x
i
(j) −
t k . Choosing the value of
t k is easy. Assume there are
N k examples in the pool. Then
there are N
k − 1 possible
values of t
k that lead to
different splits. To see this, sort the N k examples by x (j), then choose values of
t k halfway between example values.
For each of these values, we compute the information gain of the
split. We then keep the threshold with the best information
gain.
We can elaborate this procedure in a useful way,
by choosing m features at
random, finding the best split for each, then keeping the feature
and threshold value that is best. It is important that m is a lot smaller than the total
number of features—a usual root of thumb is that m is about the square root of the total
number of features. It is usual to choose a single m, and choose that for all the
splits.