In cluster analysis variables with large values
contribute more to the distance calculations.
Variables measured on different scales should be standardized
prior to clustering, so
that the solution is not driven by variables measured on larger scales.
We use the following code to standardize the clustering variables to have a mean of
0, and a standard deviation of 1.
To standardize the clustering variables we will first create a copy of the cluster
data frame and name it cluster var, then we use the preprocessing.scale function to
transform the clustering variables to have a mean of 0 and a standard deviation of 1.
First, we list the name of the clustering variable, then an equal sign, and
preprocessing.scale, then in parentheses we type the name of our variable again and
add .astype, and then in another set of parentheses, float64 in quotes.
Astype float64 ensures that my clustering variables have a numeric format,
and we will do this for all the clustering variables,
then we will use the train test split function in the sklearn cross validation
library to randomly split the clustering variable data set into a training data set
consisting of 70% of the total observations, and
a test data set consisting of the other 30% of the observations.
First we'll type the name of the data set which we'll call clus_train followed
by the name of the test data set which we'll call clus_test,
then we type the function name,
train_test_split, and in parenthesis we type the name of the full
standardized cluster variable data set which we called clustevar.
The test_size option tells Python to randomly place 0.3,
that is 30% of the observations in the test data set that we named clus_test.
By default the other 70% of the observations are placed in
the clus_train training dataset.
The random_state option specifies a random number seat
to ensure that the data are randomly split the same way if I run the code again.
Now we are ready to run our cluster analysis, because
we don't know how many clusters actually exist in the population for
a range of values on the number of clusters before we begin we'll import
the cdist function from the scipy.spatial.distance library.
In this example we will use it to calculate the average distance of
the observations from the cluster centroids.
Later, we can plot this average distance measure to help us figure
out how many clusters may be optimal, then we will create an object called clusters
that will include numbers in the range between 1 and 10.
We will use this object when we specify the number of clusters we want to test,
which will give us the cluster solutions for k equals 1 to k equals 9 clusters.
In the next line of code we create an object called meandist
that will be used to store the average distance values that we will calculate for
the 1 to 9 cluster solutions.
The for k in clusters: code tells Python to run the cluster analysis code below for
each value of k in the cluster's object.