In this lecture we're going to talk about what it means to log data and what impact
that has when you do things like take arithmetic means of logged data and create
confidence intervals and this sort of thing.
So we'll talk about logs, we'll talk about the geometric mean, which is intrinsically
related to taking logs of data and taking arithmetic means.
And we'll about the geometric mean and it's relationship with the law of large
numbers and the central limit theorem. And then we'll go through some of the
existing techniques that we've already gone through, like creating T confidence
intervals. But just go over how they're interpreted
with respect to log data. And then we'll finish talking about the
log normal distribution. So just to remind everyone a little bit
about logs. Log base b, of the number x, is the number
y, such that b to the y, equals x. And log base b of one is always going to
be zero because b to the zero is going to equal one.
And then this log base b, will always travel to minus infinity as x travels to
zero. And then, you know, for the class, we've
always been writing just log as when the base is E for Euler's number, and
sometimes people tend to write LN in that case.
There's basically only three bases for logs that people ever use.
Base E has a lot of nice mathematical properties.
Base ten is nice because then log speaks of orders of magnitude, right.
Log base ten of ten is one. Log base ten of 100 is two.
Log base ten of 1,000 is three and so on. And then log base two is often very useful
as well. Because it's a smaller number than ten,
you get lower powers, it's often useful. And just to remind everyone log AB is log
A plus log B, log A raised to the B power is B log A, and log A divided by B is log
A minus log B. In other words, log turns multiplication
into addition, division into subtraction, and powers into multiplication.
So hopefully none of this is news to you. So that's sort of the mathematical
properties of the log. But statistically, why do we take logs of
data? The most common reason to take a log of
data is if the data is sort of skewed high.
And what I mean by that, for example, is incomes are a great traditional example of
things that tend to be skewed high. You have a lot of people making very
little money and a handful of people making a lot of money.
And so that distribution looks like a hump towards zero.
And it spreads out with a long tail towards high values.
And sometimes people, you might take logs of.
Income data to try and make it look more bell-shaped.
This occurs frequently in biostatistics for example, health expenditures.
A lot of people tend to spend very little on healthcare until healthcare becomes a
problem, then they spend a lot. So distributions like healthcare
expenditures and other things like that tend to be right skewed especially because
they're bounded from below by zero. In setting where errors are, feasibly
multiplicative. When dealing in things like concentrations
and rates, then it's. Natural to take logs because then it turns
that multiplication into addition. Whenever you're considering ratios.
It's useful to take logs because then instead you have differences rather than
ratios. And then if you are dealing with something
where you're not so concerned about the specific number but more concerned about
orders of magnitude, say something like using log base ten, as an example if you
are considering astronomical distances you might be just more concerned with the
orders of magnitude rather than the actual specific number.
Then you might often take logs. And then, counts are often logged if your
data are the number of say, infections at a hospital or something like that.
You might log data like that. Notice if you have logged several counts
and one of them is zero then you have a problem with taking logs so you have to
come up with some solution for that. So, let me talk a little bit about the
geometric mean. The sample, I, I say the sample geometric
mean, just so we're using the same notation when we talked about the sample
mean of data. The sample geometric mean of data set X1
to XN is, you take the product of the observations, pie I equals one to N, XI,
and then you raise it to the one over nth power.
And notice, if all the Xs are positive, which is generally the case if you're
thinking about geometric means. Then the log of the geometric mean, is
then the arithmetic mean. One over N summation log XI.
So it's the arithmetic mean of the log of observations.
So let me just repeat that. The log of the geometric mean is the
arithmetic mean of the log observations. So because, the log of the geometric mean
is the arithmetic mean of the log observations.
On the log scale, the geometric mean has all these properties that we already
talked about associated with sample means, sample arithmetic means.
So the law of large number applies and the central limit theorem applies.
I have parenthesis here that says, under what assumptions.
But under whatever the assumptions applied for the arithmetic mean.
To have the, the law of large numbers and the central number theorem.
The geometric mean is always less than or equal to the, the sample or arithmetic
mean, as a just general property. So let me just give you a quick example of
using geometric means. In some domains people use the geometric
mean so frequently than when they talk about the mean they, they're referring to
the geometric mean, not the arithmetic mean.
So, as an example, when what you're thinking about is inherently
multiplicative, you would often think of the geometric mean.
So suppose that in a population of interest, the prevalence of the disease
rose two% one year. And then the next year it fell one%, and
then the year after that it rose two%. And then it rose one percent again.
Well if you were thinking about what's the end prevalence of the disease after the
starting prevalence, inherently you would multiply the starting prevalence x 1.02
x.99 x 1.02 x 1.01, and you would get the ending prevalence.
So the geometric mean of these collection of increases and decreases would be a
relevant quantitative to study. And so that geometric mean would be the
product of them raised to the one-fourth power.
And what's interesting about that, then, is.
If you take the starting prevalence and you multiply it, times 1.02, .99, 1.02 and
1.01, you get the ending prevalence after the four years.
If you take the geometric mean and multiply it, times the starting prevalence
four times, you get the same number. So that's what that the geometric mean is,
considered in the sense of the arithmetic mean.
The arithmetic mean is the number you would have to add four times to get the
same end result. The geometric mean is the one you have to
multiply, or times to get the same result. And that's why it's useful.
So if you're thinking about things that are inherently multiplicative like,
percent increases and decreases. Then it's common to take the geometric
mean. So if you work in certain financial
sectors for example, if they say mean, they are referring to the geometric mean
because it's obviously more natural to talk about.
Okay, so just re hashing some of these points.
Multiplying the initial prevalence by 1.01 to the fourth power.
Than otherwise, multiplying it by four times is the same thing as multiplying by
the original four numbers in sequence. So 1.01 is the constant factor by which
you would need to multiply the initial prevalence each year to achieve the same
overall increase or decrease in prevalence over a four year period.
Take that in contrast to the arithmetic mean and that's the factor by which you
would have to add to achieve the same total increase.
And in this case, it's clear to me at least, that the geometric mean makes a lot
more sense that the arithmetic mean to talk about.
On the next slide, I was thinking about how to explain this.
I googled the geometric mean and the arithmetic mean and I found this great
example at the University of Toronto's website and it has a really fun geometric
interpretation of the arithmetic mean and the geometric mean.
So if you have a rectangle and A and B are the lengths of the sides of a rectangle.
Then the arithmetic mean A plus B over two is the length of the sides of the square
that has the same perimeter as the rectangle.
The geometric mean A times B to the one-half is the length of the side of the
square that has the same area. So if you're sort of interested in
multiplicative things like areas, you want the geometric mean of the sides.
If you're interested in additive things like perimeters, you want the arithmetic
means. So it's, I thought that was really cool
when I read that. So, back to statistics, the log of the
sample geometric mean is just an average. And so provided the expected value of log
x exists. Then that average has to converge just by
the law of the large numbers to what, I'm defining here as mu equal to the expected
value of log x. To remember the log of the geometric mean
is, is itself just an arithmetic mean. We have the law of large numbers which
tells us, what the arithmetic mean converges to?
It converges to the population mean. So therefore, the log geometric mean
converges to, expected value of log x, where x is a draw from the, original,
population, on the natural, unit scale. Not on the log scale.
Therefore, if you want to know what the geometric mean converges to,
The geometric mean is the, exponential of the log of the geometric mean.
Of course, because E to the log x, is x. So it would be nice if that worked out to
be expected value of x. But it's doesn't because E to the expected
value of log x, the exponent can't move inside this expected value.
So we get something, which I'm going to just call, E to mu, is, is exactly E to
the mu. And this is not expected value of x.
And this quantity, E to the mu which is the exponent of the expected value of log
x. It doesn't really have a name.
But I like to call it the population geometric mean cuz you know, if you have
that the sample arithmetic mean converges to the population mean.
The sample variance converges to the population variance.
The sample median converges to the population median.
Then by that logic, the sample geometric mean should converge to something that's
called the population geometric mean. So, I'm going to call it that.
I, I don't see that to often in books, but what the heck, I'm going to do it.
So to reiterate, the exponent of the expected value of log X is not equal to
the expected value of the exponent of log X, which is equal to E to the X.
So, what I'm referring to as the population geometric mean, is not equal to
the population mean that we defined earlier.
It is however interesting to note, that if the distribution of log X is symmetric.
Which, remember, that was one of the reasons at the beginning of the lecture,
we stated for taking logs of data is to turn skewed data to data that's more
symmetric. Then if the distribution of log X is
symmetric, then consider the median. The median is the point where point five
is equal to probability that log X being less than or equal to, to mu.
And in this case, because log X is symmetric, mu, the mean on the log scale
is in fact also the median. So this first statement, point five equal
to the probability of log X less than or equal to mu, is just reiterating the
statement that for this distribution that's symmetrical on the log scale, the
mean and the median on the log scale are equal.
But, now on the interior of this probability statement we can, because
everything's positive. And because the E function is monotonic.
We can take an exponent on both sides of this inequality, and get that the
probability of the x on the natural scale, not on the log scale.
But x on the natural scale. The probability that x is less than or
equal to E to the mu is also 50%. So the conclusion is that, for log
symmetric distributions, the geometric mean is estimating the median.
So why am I saying all this, I am making fairly simple ideas rather complicated.
The idea is you have data, you log it and you just do all the normal stuff you do
with your data, you just using it on log data.
And what I'm trying to say is, I'm trying to relate the quantities that you get from
doing that. They have interpretations back on the
natural scale, that's what we're trying to say.
You don't have to discard the natural scale units when you log data, you get a
lot of interesting interpretations back on the natural scale.
So, at any rate, if you use the central limit theorem to create a confidence
interval for the log measurements. Then your interval is estimating mu, the
expected value of the log measurements expected value of log x for log units.
Then if you expodentiate the interval, then you're estimating E to the MU.
The population geometric mean, as I'm calling it.
And then in the event the distribution of the log data is itself symmetric.
Then your exponentiated interval is also estimating the median.
So this is kind if a back handed way of getting the confidence interval for the
median. If you're willing to assume that your data
is symmetric the population. From which your data is drawn, it's
symmetric on the log-scale. Then when you take the log of the data,
create the confidence interval and then exponentiate the end points, then you wind
up with a confidence interval for the median.
And remember, we also talked about getting a confidence interval for the median using
bootstrapping, but this is a lot easier, it just uses the ordinary T confidence
interval. And then this is especially useful for
paired data when their ratio is of interest.
So, let's just quickly go through an example.
So, remember, I quoted before this book by Rozner, Fundamentals of Biostatistics,
which I like. It's very thorough.
And covers a huge chunk of biostatistical topics.
Any rate, on page 298 of the version that I have, which unfortunately I think was
the previous version than the current one. It gives a paired design where it compared
systolic blood pressure for people taking oral contraceptives and matched controls.
And so paired design is where you have a person and you have a bunch of covariates
that you're concerned with, when you want to compare, say, oral contraceptive use to
controls. You're worried that the group of people
that take oral contraceptives are, different than the group of people who
don't take oral contraceptives. So, what you might do is you might take
this, list of, things that you, think might, explain that, difference, and match
on'em so that, the person taking the oral contraceptive, they have a, twin in a
sense, in the, control group, that, at least insofar as the, other variables you
can measure, they're very close. That's this idea of matching.
The matching to the extreme., You know, you couldn't do it in this
experiment, but imagine if you were investigating aspirin.
You would, say, give a person an aspirin and then after a suitable wash out period,
give them a placebo. And then that person would be perfectly
matched to themselves as their own control.
So that's the extreme version of this case, but let's suppose you're in a
circumstance like this where you can't really, randomize people to contraceptive
use. You couldn't do crossover experiment like
that. So you could match people as closely as
you could on all these other things that you think might differentiate
contraceptive users from controls. And match them as closely as possible.
Anyways, that's a match design. But the point for our discussion, is that
person one. Who is in the oral contraceptive group,
and person one, who is in the control group.
They are tied together. And so we want to utilize that information
that they're similar. So what we might do is take the blood
pressure for person one. From the oral contraceptive use.
And the systolic blood pressure from person one in the control group.
And analyze their ratios, right? And so we might be interested in ratios,
because we just might be interested in the interpretation of, well, what percent
increase and decrease does a person in the contraceptive group have over their
associated controls. So imagine if we took ratios, and then
logged the ratio. Well, that would just be the log
difference of the two measurements. Then we could just do an ordinary one
sample t conference interval for the log of the ratios done matched pair by matched
pair. And so in this case, the geometric mean of
the ratios works out to be 1.04. Which in this case the order in which I
was dividing implied to four% increase in systolic blood pressure for the oral
contraceptive users. And t interval on the log scale.
So when I took each measurement, an oral contraceptive user, log the control user,
took the difference, pair by pair. I wound up with then, n measurements where
I started with 2N total measurements, each in pairs.
I had my n measurements on the log scale, I had an ordinary t interval and I
calculated it and I got 0.010 and 0.067. In this case, the units would be in log,
millimeters, or mercury. What we're interested in on a log scale is
whether zero is in this interval or not, right?
Zero is the important thing on the log scale.
If we exponentiate the interval, we get 1.01 to 1.069.
So an estimated via 95% confidence interval, one% to seven% increase in
systolic blood pressure for the oral contraceptive users relative to the
controls. And so on the exponentiated scale we're
interested in whether one is in the interval.
On the log scale, we're interested in whether zero is in the interval.
By the way if your numbers are kind of small, like in this case 0.01 and 0.067.
Exponentiating is about like one plus and if you are math person you take the Taylor
expansion of e to the x and go out one term and you see that it is pretty close
to one plus. You can actually exponentiate things very
quickly but just by taking one plus and then obviously if you, number that you're
looking at is pretty close to one and you want to log it, you can do one minus and
you same thing take the Taylor expansion for log and go out one term.
You can see, if the number is close to zero and you want to exponentiate it one
plus works pretty well in approximation if the number is pretty close to one.
And you want to log it. One minus does pretty well as well.
That's a, trick that's very useful, like when you do, logistic regression and
things like this where you need to, take exponents, quickly.
So let me just talk about this example Just a little bit more.
This estimate, 1.01 to 1.07. This one% to seven% estimated increase
between the two groups. That is a conference interval for this
sort of paired ratio of geometric means. And that's why it's useful in that we're
estimating a ratio here. So now let's just go through the same
exact exercise but instead of when we have parrot observation, we have two
independent groups. If you log the data from group one, log
the data from group two, create a confidence interval for the.
Difference in the group means on the log scale, and then exponentiate it, then what
your estimating, that confidence interval is an estimate of e to the mu one divided
by e to the mu two, that confidence interval is exactly an estimate of the
ratio of the population geometric means. Of course it's an estimate on the log
scale of the difference in the expected values on the, the mean on the log scale.
But if you exponentiate it, then you get exactly an interval for the ratio of the
population geometric means. And if you're willing to assume that the
data is symmetric in the log scale, then this is also equal to a ratio in the
population medians. There's one distribution where take logs
of things and they wind up as Gaussian is so important that we give it a name, we
call it the log normal distribution. And a random variable's log normally
distributed if it's log is a normally distributed random variable.
Note, it's not the log of a normal random variable as it's name kind of implies.
You can't take the log of a normal random variable because those can be negative and
you can't take the log of them. So if you want to remember what's a log
normal random variable, remember this phrase.
I am log normal means think logs of me then I'll be normal and then you'll
remember the correct order. But then also think when you are assuming
what the log normal is, if you are taking the log of something that's possibly
negative then you're doing it wrong. Okay so again log normal random variables
are not logs of normal random variables. As I stay here you can't even take the log
of normal random variable because it can be negative.
So formally, X is lognormal and it depends on two parameters, mui and sigma squared.
If log of X is normal mui comma sigma squared, and again that mirrors kinda what
we're often doing with logs. We're trying to take logs of things.
So that, on the log scale, the data is symmetric.
And then hopefully, the population distribution is also symmetric.
And if log of X is normal, for X being log normal.
Then if Y is normal, mue, comma, sigma squared.
Then E to the Y is log normal. So you can generate a log normal by
generating a normal random variable and exponentiating it.
I give you the log normal density here. If you want to, it depends on the mu and
the sigma squared. Its mean is E to the mu.
Plus sigma squared over two where mu and sigma squared are these mean and variance
on the log scale. And the variance is two mu plus, sigma
squared times e to the sigma squared minus one.
And its median, is e to the mu. And of course, it's geometric, what I'm
calling its population geometric mean is E to the Mu as well.
So you can see here, this gives you an exact example, where expected value of X.
And E to the expected value of log X are two different things.
Expected value of X in this case, when X is log normal, is E to the Mu plus sigma
squared over two. E to the expected value of log X is E to
the Mu. Okay.
So if X1XN are log normal MU sigma squared.
Then log X1 to log XN, where I'm calling this YN, Y1 up to YN, are normally
distributed with mean mu and variance sigma squared.
So they satisfy the conditions to create a T confidence interval.
And then mu is the log of the median of the XI.
E to the mu then gives the median on the original scale.
It also gives you the population geometric mean.
And then, again, assuming log normality in exponentiating T confidence intervals for
the difference in two logs. Two log, again, implies that your
confidence interval is estimating ratios of geometric means.
So, let's just go through a quick example of doing this.
Now I'm assuming you can do the arithmetic of this because you already know how to
create two group T confidence intervals. So all that we're doing is logging the
data and doing something you already know how to do.
So I just want to go through the interpretation real quick.
So imagine if you took gray matter volumes.
I actually did this for some data that I have.
I have brain gray matter volumes for a young and an old group defined as younger
than 65 and older than 65. But of course this doesn't account for
being young at heart or whatever. Young and old, as per my definition, but
if you're 65, rest assured, I don't think you're old.
It's just the definition I'm doing here. So we did two separate group intervals.
And for the old group got 13.24 to 13.27. And for the younger group got 13.29 to
13.31. Both of them are in the units of log cubic
centimeters. If you exponentiate those intervals you
get, 564 for one group and 578 about for the other one.
And 592. To, 606 cubic centimeters for both.
For old and young, respectively. So both of these.
Intervals estimate the population geometric mean, gray matter volume among
the older and younger groups respectively. If we're willing to assume that the
population of brain volumes on the log scale are symmetric then both of these
intervals estimate the population median gray matter volume for old and young
respectively. Then if we were to take the two groups and
do a two group T-interval on the log measurements yields 0.032 to 0.066, log
cubic centimeters, expedentiate this, you get an interval of 1.032 to 1.068, you
know, again, remember the trick, you add one when you expedentiate a, close to
zero. You wind with about a three% to seven%
higher. Geometric mean brain volume among the
younger group than the older group or if we're talking about medians, if we're
willing to assume that individual populations are symmetrically distributed,
then that would be estimated between three and seven percent.
Increase in grey matter volume for the younger group.
This, of course, being the case because as we age, we start to lose a little bit of
grey matter volume over time. Of course, you develop more neuronal
connections, so you get wiser. So you have, maybe, more neuronal
connections, but decrease in volume. So, anyway, what I hope you learned from
this was when you take logs of measurements and do what we talked about
in terms of creating confidence intervals, and exponentiate the intervals.
I hope you know what the estimates are then referring to.
And it's a common problem, people do this all tie time.
But I"m not sure if people always understand exactly what they're doing.
And that's why I devote an entire lecture to the subject of logging which is, in
practice, is a trivial extension of what we've already done.
Take logs of your data, do what we already do, and then exponentiate the intervals.
So no change in what we're doing. But I wanted everyone to understand
exactly what the implications of those things were.
And why log is sort of special in the sense it yields uniquely interpretable.
Results as opposed to doing other functions.
You could say, take cube root of the data, create the confidence interval on the cube
root scale and then. Raise the interval to the third power.
And you wouldn't get the same nice interpretations like you do with log.
Log is special that way. Alright, well thanks troops.
This was our last lecture. I hope you enjoyed the class.
And hope you survived the intense biostatistical training.
And I hope you go on to do great things with this knowledge.
And all the other courses you take from Corsara.