2:11

Then p star maximizes H(p) over all probability

distribution p on S, subject to the constraints in 1.

Note that p star of x, because

of the exponential form is always strictly positive

and so the support of p star is equal to S.

If we let q_i equals e to the power minus lambda_i, then we can

write p star of x equals e to the power minus lambda_0, e to the

power minus lambda_1 times r_1(x) all the way to e to the power minus lambda_m

times r_m(x),

which is equal to q_0 times q_1

to the power of r_1(x), all the way to q_m to the power of r_m(x).

7:52

Then p star maximizes the entropy of p over all probability

distribution p defined on S subject to the constraints

summation x in S_p p(x) times r_i(x) is equal to summation

x in S p star of x times r_i(x) for all i between 1 and m.

[BLANK_AUDIO]

The proof goes as follows.

Let summation x in S, p star of x, times r_i(x) be a_i for all i.

[BLANK_AUDIO]

Then, by construction, the parameters lambda_0, lambda_1 up to lambda_m

in p star of x are such that p star satisfies the constraints

summation x in S_p, p(x) times r_i(x) is equal to a_i for all i.

[BLANK_AUDIO]

Then the corollary is implied by theorem 2.50.

[BLANK_AUDIO]

Example 2.52 is an illustration of an application of theorem 2.50.

Let S be finite and let the set of constraints be empty.

Then p star of x is equal to e to

the power of minus lambda_0 which does not depend on x.

Therefore, p star is simply the uniform distribution over S, that

is p star of x is equal to the reciprocal of the size of S for all x in S.

This is consistent with our previous result, that over a

finite alphabet, the entropy is maximized by the uniform distribution.

[BLANK_AUDIO]

The next example is a little bit more elaborate.

Let S be the set of integers 0, 1, 2

so on and so forth.

And let the set of constraints be summation x p(x) times x equal to a

where a is greater than or equal to 0.

13:55

We consider maximizing the differential entropy of f subject to the constraint

integrating x square f(x)dx, that is

the expectation of X square equals kappa.

Then by theorem 10.41, f star of x has the form a times e to the power

minus b x square, which is the Gaussian distribution with 0 mean.

[BLANK_AUDIO]

In order to satisfy the second moment constraint, the only choices are a equals

1 over square root 2 pi kappa, and b equals 1 over 2 kappa.

[BLANK_AUDIO]

Next we are going to illustrate an application of corollary 10.42.

We start with a pdf of a Gaussian

random variable, with mean 0 and variance sigma square

and call this density function f star of x, which is equal to 1 over square

root 2 pi sigma square, e to the power minus x square over 2 sigma square.

15:04

To answer this question, we write f star of x equals e to the

power minus lambda_0, times e to the power minus lambda_1 times x square,

[BLANK_AUDIO]

where lambda_0 is equal to 1 half log 2 pi sigma

square, and lambda_1 is equal to 1 over 2 sigma square.

Then according to Corollary 10.42, f star maximizes the differential entropy,

over all density functions, subject to the second moment equal to sigma square.

[BLANK_AUDIO]

The next theorem says that, for a continuous random variable, with mean

mu and variance sigma square, the

differential entropy is upper bounded by one half

log 2 pi e sigma square, with equality if and only if

X is a Gaussian random variable, with mean mu and variance sigma square.

[BLANK_AUDIO]

The proof goes as follows.

Let X prime equals X minus mu.

16:36

And then by theorem 10.43, the differential entropy of X

prime, is less than or equal to one half log 2 pi e sigma square

because the second moment of x prime, is equal to sigma square.

[BLANK_AUDIO]

Equality holds if and only if X prime

is the Gaussian random variable with mean 0, and

variance sigma square, or X is the Gaussian

random variable with mean mu and variance sigma square.

[BLANK_AUDIO]

The following remark is somewhat subtle.

Theorem 10.43 says that with the constraints

that the second moment is equal to kappa,

the differential entropy is maximized by the

Gaussian distribution with mean 0 and variance kappa.

[BLANK_AUDIO]

If we impose the additional constraint, that the mean is equal to 0, then both

of the variance of X and the second moment of X are equal to kappa.

[BLANK_AUDIO]

By theorem 10.44, the differential entropy is still maximized

by the Gaussian distribution with 0 mean and variance kappa.

[BLANK_AUDIO]

Now we discuss a relation between differential

entropy, and the spread of the distribution.

From theorem 10.44, we have the differential entropy of

a random variable X is less than or equal

to one half log 2 pi e sigma square, where

sigma square is equal to the variance of X.

18:30

which is a measure of the spread of the distribution,

plus a constant, that is one half log 2 pi e.

In particular, as sigma, the standard deviation, tends

to 0, the differential entropy tends to minus infinity.

[BLANK_AUDIO]

The next 2 theorems, are the vector

generalizations of theorem 10.43, and 10.44 respectively.

Let X be a vector of n

continuous random variables, with correlation matrix K tilde.

Then the differential entropy of X is upper bounded by one half log two pi e

to the power n times the determinant of the correlation matrix K tilde

with equality, if and only if X is a Gaussian

vector with mean 0 and covariance matrix K tilde.

[BLANK_AUDIO]

Theorem 10.46 says that, for a random vector with mean nu and

covariance matrix K, the differential entropy is upper bounded by one half log 2 pi e

to the power n, times the determinant of K, with equality if and

only if X is a Gaussian vector with mean mu and covariance matrix K.

[BLANK_AUDIO]

We now prove theorem 10.45.

Define the function r_{ij}(x) to be x_i times x_j, and let

the (i,j)-th element of the matrix K tilde be k tilde ij.

Then the constraints on the pdf of the random vector X,

namely the requirement that the correlation matrix is equal to K

tilde, are equivalent to setting the integral of r_{ij}(x)

f(x)dx over the support of f to k tilde ij.

[BLANK_AUDIO]

It is because r_{ij}(x) is equal to x_i times x_j, and so this

integral is equal to the expectation of X_i times X_j,

that is the correlation between X_i and X_j, and this is for all i,j between 1 and n.

[BLANK_AUDIO]

Now by theorem 10.41, the joint pdf that

maximizes the differential entropy, has the form,

f star of x equals e to the power minus lambda_0,

minus summation over all i and j, lambda_{ij} x_i times x_j,

where x_i times x_j is r_{ij}(x).

[BLANK_AUDIO]

Here, the summation over all i,j, lambda_{ij} times x_i

times x_j can be written as x transpose L times x

[BLANK_AUDIO]

where L is an n by n matrix, with the (i,j)-th element equal to lambda_{ij}.

[BLANK_AUDIO]

Thus, f star is the joint pdf

of a multivariate Gaussian distribution with 0 mean.

[BLANK_AUDIO]

To see this, we only need to compare the form of f

star of x and the pdf of a Gaussian distribution with mean 0.

[BLANK_AUDIO]

Then for all i, j between 1 and n, the covariance between X_i and X_j is

equal to expectation of X_i times X_j, minus

the expectation of X_i times the expectation of X_j,

where the expectation of X_i is equal to 0,

and the expectation of X_j is also equal to 0.

23:31

which is a joint pdf of the Gaussian

distribution with mean 0 and variance matrix K tilde.

Hence, by theorem 10.20, which gives the differential entropy

of a Gaussian distribution, we have proved the upper

bound on the differential entropy of the random vector

X, in terms of the correlation matrix K tilde

with equality, if and only if X is a

Gaussian vector, with mean 0 and covariance matrix K tilde.