In this module, we're going to talk about what we call spatial scan statistics.

So, in previous modules,

we talked about spatial auto-correlation for identifying sort of clusteringness,

and whether the regions might be auto-correlated.

But we also want to be able to determine whether spatial events within a data

set could have occurred in this pattern by chance or by random.

And so spatial scans statistics are a method for determining whether

measures occurred in this geographic distribution by accident or not.

What I mean by that is really,

how likely was it that this pattern could have occurred by random in nature?

So what I mean by this is,

imagine here that I have a map on my screen.

So I've got roads,

and counties, and buildings,

and things, and each point here is a measurement.

It could be a measurement by county,

but in this case, let's think of this as illness.

So some people got the flu and went to the doctor.

Some people thought they got the flu and went to the doctor,

but they didn't have it they just had a bad cold.

So the triangles are people with the flu,

for example, and the control

are other people that the doctor saw that didn't have the flu.

So this is just our population of people that went to the doctor.

And these are their home address as to where they live.

And what I want to determine is,

if this chunk of people could have occurred by chance or not.

How likely is this pattern to have occurred in reality?

We could think about this as crime.

What if I have a crime spree?

And how do we determine whether this crime spree is occurring by chance or not?

And scan statistics are going to essentially,

for each point, draw a window,

bigger and bigger every time,

capturing all the possible windows in the data set,

and see if the patterns of

distribution of points in those windows could have occurred by chance or not.

And so, here we start with just one of our points, one of our cases.

We don't do this for the controls.

We only care about looking if the cases are distributed.

So, for each case, I draw a window,

the circle window, that gets bigger and bigger.

And for each circle, for each window,

I compute this likelihood function.

For the Bernoulli distribution,

the likelihood function is defined as the number of cases,

the total number of population in the window.

So this is the cases in the window.

So if it's little c,

it's cases in the window.

If it's big C, it's cases total in the world.

And if it's little n, this is the population in the window,

and if it's big N, this is the population in the world.

So let's look at this example a little quick.

So in my first circle,

my little c is three,

because I have three triangles in there.

And let's say my little n is four because I have four things in there.

I've got one square and three triangles,

and then I can count everything else.

So I've got one, two, three,

four, five, six, seven, eight,

nine,10, 11, 12, 13,

14, 15, 16, 17, 18.

So big N is 18.

And Big C is one,

two, three, four, five, six, seven.

So big C is seven.

Seven cases, three within this window,

a population of four in this window,

and a total population 18.

So I can fill out my Bernoulli distribution,

and I get a value, and that gives me my likelihood value L_0.

And this L_0 is the likelihood of this occurring with this Bernoulli distribution.

And then what I do is,

I take all the data, and I randomly redistribute it.

So now I notice my cases is in controls and moved around.

So I got my original distribution,

and I take all the data,

and I randomly threw it back on the screen,

and I keep my window in the same spot though.

And now for that same window,

I calculate a new likelihood function for all the data that fell into their.

And I take my data and throw it back on the screen again,

my window didn't move, and I calculate a new likelihood function.

And I do this a whole bunch of times.

And then what I do is, I sort the list,

and I want to see sort these from low to high,

and I want to see where my original likelihood value fell,

and that sorted position in the list divided by the link to the list,

gives me the probability of this occurring by chance or not.

And so, by doing this, I can find a p-value of how likely this distribution,

my original distribution was, compared to chance.

The problem is, this can be really expensive to compute.

I have to randomly redo all of my data.

The nice thing is, every calculation is completely independent.

So if I draw all my windows at once,

and randomly redistribute all of my values,

I can calculate these in parallel really quickly.

You can actually get software to do this,