Also, the approach of using Solver for
the cluster analysis is only practical for datasets that are relatively small.
So we just want to show that it is possible to use Excel to approach
cluster analysis from the point of view of an optimization problem.
To make it more interesting we're going to show how to use Excel for
cluster analysis using an example.
As we mention before, companies are often interested in segmenting their customers
to better target a specific product offerings to meet specific customer needs.
Let’s assume that an online retailer has collected data
on three contents of her customers.
This data is in the Excel file Market Segmentation- Excel.
You might want to pause the video to locate the file and
open it so you can follow along or
just continue to watch the video and play with the model later, it's up to you.
Let us start in the original Data tab to examine what data was collected for
each of the three candy customers.
You can see that for each customer, there are two demographic variables,
income level and number of dependence as well as two buying behavior variables,
number of purchases made last year, and the average value of each purchase.
Let's assume that the online retailers wants to know
if the data supports the hypothesis that there are three segments in her market.
So we're going to use closer analysis to help her answer this question.
The first step in our analysis is to normalize the data.
Recall that normalization is a transformation of each variable
in the dataset.
In our table, the variables are in the columns label Income,
Average Purchase, Last year purchases and the Number of Dependents.
The normalization consists of subtracting the mean and
dividing by the standard deviation.
Click on the normalized Data tab to see the normalized values.
The normalized values have a mean of 0 and a standard deviation of 1.
Now, click on the Clusters tab to look at the model.
The model is based on the idea of choosing three customers to represent the centroids
of the three clusters.
Cells H5, H6, and H7 have the three customers that haven't chosen a centroids.
The arbitrary solution that is shown on the spreadsheet consists of customers 10,
20, and 30 as the centroids of the three clusters.
The center values in the table correspond to the normalized values for
the selected customer.
For example, if we change 10 to 15, we see that the central values change.
In this other table, the model uses Euclidean
distances to assign the customers to the closest cluster.
For example, we see that the first six customers are assigned to cluster 2.
Customer 7 is assigned to cluster 3 and customer 10 is assigned to cluster 1.
The model also calculates the total squared distance,
which is the sum of all the distances in column E.
Now we're ready to optimize, which is nothing more than searching for
the set of three customers that are the best centroids.
Clearly, the best set is the one that minimizes the total sum of a squares.
We don't want to do this by manually changing the values because believe it or
not, there are more than 4.4 million ways of choosing three customers
from a set of 300.
So we're going to use an Excel tool called Solver.
To access this tool, we got to the Analyze Group in the Data tab.
If Solver is not there, you need to first load it.
If you Google adding solver to Excel, you will find instructions on how
to add Solver to the different versions of Excel for Windows and for the Mac.
You can pause this video and come back here once you have loaded Solver.
Okay, if your Solver is ready to go, then click on it and
you will see that the model is all ready there.
This is a simple model for the set of objective,
SJ9 that is the objective function is the total sum of squares.
We want to minimize this value so Min is chosen.
The changing cells are the values that the solver can manipulate to minimize your
objective function.
In this case, there are only three cells, H5 to H7, which correspond
to the three customers that are going to be selected as centroids.
And then, there are three constraints related to the changing cells.
The values in the changing cells must be integer.
They must be greater than or equal to 1, and they must be less than or
equal to 300.
For reasons that go beyond the scope of this course for
this particular problem, we must choose the Evolutionary Solver.
You can now click on the Solve button.
The solution process starts, and
search progress is shown on the status bar at the bottom of the spreadsheet.
The solution time depends on the computer speed.
So if it is taking too long,
you can always press this K key to stop the search.
The Evolutionary Solver is a so called Metaheuristic.
What this means is that it cannot guarantee that when it stops,
it has found the very best solution.
Also called the optimal solution.
When the Solver stops, it simply reports the best solution that it could find.
It is states that the Solver can not improve the current solution.
We're going to press the Escape key and assert the solver solution.
Since we may end up with different solutions,
we're going to change the solution that we have to one that I found, and
that I would like to discuss with you.
The solution has customer 103, as the first Centroid,
customer 112 as the second, and customer 170 as the third.
Let's enter those values in the corresponding cells and
examine the normalized centroid.