0:00

Previously, we looked at

information entropy when the probability is uniformly distributed.

So this was the Hartley's construction in 1928,

Hartley's formula in 1928 for uniform distribution case.

Again, that's where the uniform distribution comes from.

It comes from the modeling of the outcome distribution.

Now, what happens when this is not true when properly distribution is not uniform?

So, let's investigate that.

Let's look into that.

And we will look at a more generalized equation

to express information entropy.

Let's actually build on

what we have on

this sheet and we're going to actually build on this example of weather in Gotham City.

We will be using the same weather information as the random phenomenon,

but let's use a different distribution.

Let's actually use a different city altogether.

I'm just going to get rid of some of the parts that are less relevant.

And this was the uniform case.

And we'll be using the same variable definitions

with small n being the number of independent events.

Instead of Gotham City,

which has a very random weather,

let's use Colorado Springs.

And Colorado is known to be sunny.

So there will be a bias towards the sunny weather.

Again, let's suppose in Colorado Springs,

the weather is either sunny,

rainy, snowy or cloudy.

Let's make that probability distribution non-uniform.

And again, as I mentioned,

Colorado itself is famous for being a sunny state, a sunny Colorado.

So let's suppose that the probability of Colorado Springs being sunny is

half and the probability

of the weather being rainy is 0.125 or 1/8,

the probability of the weather being snowy is also 1/8 or 0.125,

the probability of being cloudy is 0.25 or a quarter.

And these four probabilities will sum up to one.

Similarly to what we did before,

we'll ask the question about how many bits are needed to

communicate the weather in Colorado Springs.

So it's no longer Gotham City,

but it will be Colorado Springs.

And we would want to make

this communication or the bit transfer as efficient as possible.

We can keep this way of encoding,

however, such encoding scheme is not going to be efficient.

Let's actually choose another encoding scheme which is more efficient.

4:01

And the way that we're going to do that is we're going to have

the more probable outcomes take smaller number of bits.

So if the bit is zero,

then the weather in Colorado Springs will be sunny.

And let's use a blue pen for that.

And if the first bit is one,

then the weather in Colorado Springs will not be sunny.

So, meaning that it will be rainy or snowy or cloudy.

Now, if that first bit is zero,

then we, as an outsider,

knows that the weather in Colorado Springs is sunny,

but if the bit is one,

then we don't know for certainty

which of the three weather conditions the Colorado Springs is experiencing.

So if the first bit is one,

then we need another bit.

And suppose, if that second bit is zero,

let's suppose that the weather condition of Colorado Springs is cloudy,

and if the bit is one,

then the weather condition is rainy or snowy.

And we're not done if the bit sequence is 1 1.

So, if the third bit after the two bits of 1 1s,

if the third bit is zero,

then the weather condition in Colorado Springs is rainy,

and if the third bit is also one,

then the weather condition in Colorado Springs is snowy.

In this case, we actually have a different number of bits for different weather outcomes.

The sunny weather has only one bit,

bit zero, the cloudy weather is encoded using one zero.

6:11

So, the weather being cloudy is encoded with two bits, and likewise,

for rainy, it corresponds to 1 1 0,

and for snowy, also 1 1 1.

Now, we actually see

a different number of bits for different outcomes, more specifically,

one bit for the sunny case and the sunny is most probable,

two bits for cloudy,

and three bits for rainy and snowy.

Let's look at the information entropy.

For one day or one weather event,

7:26

so for one day,

we see that the information entropy is equal to

one bit for the case where the weather is sunny,

plus two bits for the case where the weather is cloudy.

And three bits for the case where the probability the weather being rainy,

and also three bits for the probability of the weather being snowy.

And this is the expected number of

bits that you would use for communication.

Notice that this first part is

log of one over P_sub_i of base two and the second part is P_sub_i.

Again, the first part is log of one

over P_sub_i and the second part is the probabilities.

To recap that, we can use Riemann sum to express that.

When this boils down to i equals one to capital N. Again,

small i is the outcome indices,

P_sub_i times log of one over P_sub_i.

And given that this is P_sub_i to the minus first power inside the logarithm,

this is equal to

minus i_sub_one to N P_sub_i

times log of P_sub_i.

So this is still information entropy for one day.

And given small m independent weather events or m independent days,

we actually have the entropy for m days, independent days.

We have the information entropy to be m times of this expression.

So m times i equals

one to N of the Riemann sum P_sub_i log of one over P_sub_i.

And this equation was established by Claude Shannon in 1948,

and it's a generalization and can be used even in

the cases where the probability distribution is not uniform.

Again, to recap that,

H is equal to m times the Riemann sum over

the possible outcomes P_sub_i times log

of two because we're encoding them in bits of P_sub_i.

And of course, another form of this is to use arithmetic to-

it's actually log of, one over P_sub_i.

And another form for this would be to use arithmetic

11:30

N to have minus m times the aggregation of

log of P_sub_i's times P_sub_i's,

and the Riemann sum is over all these P_sub_i times log of P_sub_i.

So again, this was constructed or established by Claude Shannon in 1948.

Previously, I talked about the efficiency of

such encoding scheme and we actually had

it so that the more probable outcomes use less number of bits.

This is commonly practiced when we want to,

in digital encoding, when we want to make that encoding scheme efficient.

For example, an example of it is Morse code.

When you want to take English letters and encode it in Morse code,

and Morse code uses two alphabets,

dots or dashes, then we take

the most probable English letters and encode them using smaller number of dots or dashes.

For example, in Morse code,

we'll encode the letter E and the letter T to have one dot

and one dash and the rest of the letters will have more than one dot or one dash.