Now that the data is in-memory, I'll use

SWAT functionality to view it.

I'll print the dimension of the table and the first few rows

to make sure it's been read in correctly,

and I'll also find the overall mean rating in this data.

As expected, the table is 10,000,000 by 3,

the user_id and movie_id variables are anonymous,

and the average overall rating is 2.51.

I'll save this value as mean_rating for later use.

Next, I'll load the simple action set

and then use the distinct action to find

the number of distinct values for all three

variables in the data.

As expected, there are 10,000 users, 10,000 movies,

and five possible ratings.

I'll also use the freq action from the simple action

set to find the frequency of each rating in the data

and save the table as tbl.

Remember, this is an instance where

the table is copied to the client,

so I can use native open source functionality

to add the proportion of ratings for each level.

The largest proportion of ratings is for levels 2 and 3,

and surprisingly, there are only about 50,000 level

5 ratings out of the 10,000,000 observations.

Since tbl is on the client, I can plot the table

as well to visually see the proportion by rating level.

Levels 2 and 3 have similar proportions, as well as

levels 1 and 4.

Next, I'll load the fedsql action set

to explore the biases within the data.

Biases occur because users innately rate

on different scales.

For example, a five-star rating does not necessarily

mean the same thing for two different users,

or a person might rate a movie poorly simply

because he or she routinely doesn't like the genre.

Or perhaps some people tend to simply give low scores

because they have high standards.

Factorization machines account for these biases

in their predictions.

They account for a global bias, the rating

versus the overall average rating, a user

bias, the average rating given by the user, an item bias,

the average rating given to the movie,

and a pairwise interaction bias, which

accounts for the bias between a user and a movie.

let's first query the data to find the user bias.

I'll select the user_id variable, count

the number of ratings for each user

and name it frequency, and find the average rating

for each user.

I'll group the results by user and order

them in ascending order.

We can use open source syntax here

to add the user_bias variable to this table from the query.

The user bias is simply the average rating

minus the mean rating that we found before,

and we can use the head function to view this new table.

For example, the first user rated 1015 movies

and gave those movies an average rating of 1.53,

which equates to a bias that is almost 1

whole unit below the overall average rating.

We calculate the movie bias similarly

by replacing user_id with movie_id.

The first movie here was rated 992 times

and had an average rating of 1.58,

which again is almost a whole unit below the average.

Next, let's partition the data before we build

the factorization machine.

I'll load the sampling action set and use the srs action

to do so.

I'll use 90% of the data for training,

which equates to 9,000,000 observations,

and pass the partition indicator into the ratings table.

Now I'll load the factmac action set to build the model.

I'll specify the table name and use the partition indicator

to pass in only the training data as usual.

The target here is the rating variable,

and the inputs and nominals are the same,

both user_id and movie_id.

For the purposes of speed for this demonstration,

I'll choose to use only 10 factors,

optimize the factors over only 5 iterations,

and use a learning rate of 0.1 for the optimization.

With more time available, you'll want

to try more factors and optimization

iterations to improve the model.

I'll set a seed and use the saveState argument

to save the model for scoring under the name factmac_model,

and the output argument enables us to save the scored training

data.

Submit the cell to build the factorization machine.

The output displays the model specification,

the number of observations, the mean and standard deviation

of the target, the number of levels

for the nominal inputs, the optimization history

(where you can see how the loss is reduced each iteration),

the final loss, and the new output CAS table containing

the scored training data.

Using the fetch action, we can see the contents of the new CAS

table.

It includes the three original variables, the partition

indicator, and the predicted rating,

which appears to be reasonably close to the actual rating,

at least for these five observations.

Because we saved the model as an analytic store,

we need to load the aStore action set

to score the validation data.

To do so, well use the score action,

specify only the validation data in the table argument,

restore the factmac_model table, copy all the variables

into the output table for comparison purposes,

and save the scored validation as factmac_scored.

When I run the score action, the output

just shows timing information.

To evaluate the predictions, I'll first

use a DATA step to add the error as a variable

to the factmac_scored table and then SQL

to compute the mean square error.

let's first load the dataStep action set

and then use the runCode action to run DATA step functionality

on the CAS server.

I'll use the same name, factmac_scored,

in both the DATA and SET statements

to add the error variable to the table, which

is simply the actual rating minus the predicted rating.

When I run this, you'll see that the table has increased in size

by the one column I added.

Next, I'll query the data set and compute the mean squared error

by averaging the squared error terms,

and I'll also compute the root mean squared error.

So for this model, the average squared error is about 0.17

and the RMSE is about 0.41.

Again, the resulting error here could

be reduced using more factors and optimization iterations.

As an alternative to using a DATA step

and then SQL to find the MSE, we could instead

use SWAT functionality to mimic open source data

frame manipulation.

First, well create a reference to the scored validation data,

called factmac_scored again.

Then, we can add the error and the squared error to the table

as if we are adding a new variable to the data frame.

Using the head function, you can see

that the last three variables have been

added to this reference table.

To find the MSE, I'll just take the mean of the squared error

variable, and take the root of the MSE to find the RMSE.

Of course, the values are the same as before,

we just used a different method to find this information.

Next, let's find the average prediction value

for each level of actual ratings to see which levels

are the easiest to predict.

Using SQL, well select the rating,

find the frequency of each level of rating

in the validation data, find the average prediction

in each level, and group the results by the rating level.

Finally, I'll sort the table by the rating level from 1 to 5.

You can see that the average prediction is closest

to the actual ratings for levels 2 and 3,

as they are incorrect, on average, by only about 0.1.

On average, the model is off by about 0.3 for levels 1 and 4.

And on average, the model is actually

closest to a prediction value of 4

for the actual rating level of 5.

Next, I'm going to round the predictions to their nearest

whole rating using a DATA step.

I'll create the variable P_rating_round,

and if the value is rounded down to a value of zero,

I'll change the value to 1.

I'll run this DATA step and print the first five observations

to make sure that the variable was correctly

added to the table.

Here the new variable was rounded up for the first few

and then down for the third observation as expected.

Next, I'll use the rounded predictions

to create a crosstabulation between the predictions

and actual ratings.

I'll use the crossTab action from the simple action

set we loaded earlier.

The row variable will be rating, and the column variable

will be the rounded ratings.

In this case, the crosstab table is copied to the client,

and I can use open source data frame functionality on it.

I'll set the column and row names appropriately

based on the data, and I'll print the matrix.

So here is the frequency of predictions for each rating

level.

Ideally, we would see large values on the diagonal

to indicate better performance.

However, one thing to note from this table

is that the predicted rating is only a maximum

of one whole unit from the actual rating.

That is, for example, we never predict a value of 4

when the actual value is 2.

The proportion of correct predictions

is simply the sum of the diagonal divided

by the total number of predictions,

which is approximately 0.75.

To print the conditional probabilities

of predicted ratings given the actual ratings,

I'll divide the row values by their row sums.

So, for example, the probability of predicting

a rating of 1 given that the rating is actually 1

is about 0.72.

The first four diagonal entries are

near a value of three quarters.

However, the model did a poor job

of correctly predicting a rating of 5, possibly

in part to the fact that there were very few 5 ratings

in the training data.

Now that we've built the factorization machine

and created predictions, we can actually

recommend movies to users.

To do this, I created a function called

useri_top5, which as its name implies,

finds the top five recommended items for a given user.

The function simply wraps up SQL functionality

in order to query the data.

It searches the table for a given input user, orders

the information in descending order of the rating,

and then returns the top five items.

Therefore, I can specify any user ID as input

to the function (for example, user ID 1),

and it returns the following movie IDs to recommend.

Finally, end the CAS session with the endSession action.