In this video, we're going to look at

how predictive modeling can be used in educational setting.

And more specifically, we're going to look at that example of how we can use

predictive modeling to detect students who game the system.

So, this video is going to just start with

a brief introduction of predictive modeling and the example of gaming the system.

So, predictive modeling, what is it?

It's a category of machine learning,

which is a field in computer science,

where we study how the computers can learn from data.

So, more specifically, predictive modeling is a type of supervised learning,

which means that we know what we want the computer to learn.

And what we do is we give it example of what

we want to learn and we let the computer to discover

general rules about when an example is or is not an example of what we wanted to learn.

So, with predictive modeling what we're going to do is we're going to try to

develop models that can infer one specific aspect of the data,

which we call the predicted variables,

and this prediction can be about future event or

they can be about events that are happening in the moment,

but that are unknown values in our data.

And so, just to give a little bit of vocabulary.

When we talk about predictive modeling with the predicted variables,

we also sometimes call it the label,

and the predictors that are going to be used to predict that variable,

we also sometimes call features.

So, I'm going to be using those terms in this video.

So, there are two main categories

of predictive modeling that we're going to be talking about today.

The first one is regression and the second one classifications.

The idea behind those two types of analysis is very similar.

Their main difference is going to be in

the nature of the variable we're trying to predict.

So, for regressions, we're going to look at

predicting variables that are numerical and continuous,

whereas for classification, we're going to look

at predicting variables that are categorical.

So for example, for regression,

something that we might be interested in

the educational context is to

try to predict how long is it going to take for a student to

solve the problem or what is going to be the score of the student on

a future test or how many weeks is a student going to stay engaged in an online course?

On the other hand, when we are doing classification analysis,

we're going to try to predict a categorical label.

So for example, will the student complete the course?

So, we're not trying to predict what the grade of the student is going to be,

we're just trying to predict,

is the student going to complete?

Yes or no. The same thing if we look at,

will the student get the next answer right to this problem?

So, in this video,

we're going to illustrate the process of building a predictive model

using an example of when student game the system,

which is a classification problem.

So, what exactly is a student gaming the system?

Well, it's a type of disengaged behavior in which the students try to solve problems

without actually having to learn anything by

abusing the support that the software gives you.

So for example, trying to abuse the answer provided by

the system or trying to systematically try to guess the answer.

And more specifically, we're going to look at

the problem of students who game intelligent tutoring system,

and what an intelligent tutoring system is.

It's a problem solving environments where

the problem is broken down into smaller steps and in order to solve the problem,

the student has to solve each of those individual steps.

And usually, that kind of environment intelligent tutors,

they will provide support to the student in the form

of next step ends about what the students should be doing next.

And also, feedback on each step to say if the step was correct or incorrect.

And the reason why we are interested in studying gaming the system and those environment,

is that it's been linked to

poorer learning outcomes and poorer long term academic success.

So, as I said earlier,

gaming the system is a classification problem.

And this problem, what we're going to be predicting is a categorical label which is,

is the student gaming the system or is the student not gaming the system?

So, if we want to build this model,

the first thing we need to do is we need to get example of

what gaming behavior look like so that means we need to collect labels.

So, there are many different ways to collect labels depending

on what type of label you are looking for.

Some of them might come from the learning environment itself.

So, if you're looking at in software performance,

did the students just manage to complete the problem or not?

It could be school records,

if you want to look at grades of the students.

It could be test data.

But, sometimes we also want to predict things that are not in the learning environment.

And then, in that cases,

then we will have to go and collect additional data,

and this can be achieved using survey,

field observations, video, audio, text coding.

There's many different ways to do it.

In the context of gaming the system,

well we don't know when students are gaming the system.

The learning environment does not provide us with information about that,

the learning environment does not know whether a student is gaming the system or not.

And if it did, then we would not need to create a model of it.

So, we need additional data collection.

And one way to do that is to collect data using something we call text replays.

So, text replays, what they are,

is a textual representation

of the actions that the students are doing in the learning environment.

So for example, here we have a text replay which is

a clip of five action that the students was doing in the tutor.

And so, in order to get examples of gaming behavior,

a human will sit at a computer,

look at the text replay and then make a judgment on

whether this is an example of gaming or it's an example of not gaming.

And so, if we look at that specific clip just to give you

an idea what we can see is that the first action that the student is doing,

they get the correct answer,

and then they move on to the next step that they're supposed to do,

20 seconds later the answer minutes has the answer for that step.

And the system tells them that this is an error.

So, four seconds later,

they answer an hour which is again an error.

And then seven seconds later, they enter years.

Again, an error. And then,

14 seconds later, week which is again an error.

And so what we can see if we look at this behavior is

a very systematic behavior of trying to guess what the answer would

be by entering all the units of time that the student

can think of in order from the smallest one to the largest one.

And so, a human might look at this behavior and say,

"I think that the student is currently trying to game

the system because they're trying to guess the answer."

So, the human will do that for a large amount of clips

of behavior from the student to get a large amount of

example of what gaming and not gaming look like.

And then once we have those labels,

what we can do is,

we can take them and synchronize them with the data of what the student is doing,

and then we can use that to compute what we call "features",

which is the variables that are going to be used to create our predictive model,

and those features are going to summarize the behavior of the student in that example.

So, this is a process that we call "feature engineering."

So, in the context of an intelligent tutor, well,

some of the feature engineering we might do is look at,

in that clip of action,

how many actions were in that specific time window?

What was the average time taken between each actions?

How many times does the student asks for help?

How many answers were correct?

How many answers were incorrect?

Then once we've created those features,

we've computed them for the different examples of gaming behavior we have,

then we can synchronize everything together,

and we get this information about what the student was

doing while they were gaming or not gaming.

So, for example here,

if we take the first row of the table,

then we can see that for the first row,

the student did six action,

and then took three seconds per action,

which is fairly quick.

And then, they asked five hints,

they've got one correct answer,

and they didn't get any incorrect answer.

And the human label this particular behavior as gaming the system.

Now, once we've collected all this data,

what we do is we use machine learning algorithms to try to identify

patterns in the data that might not be easy for a human to pick up on.

And so, for example,

what we might want to do is build something we called a decision tree.

So, that tree what it's going to do,

it's going to partition the data according to the different values of the features,

and then depending on the different values of the feature,

it's going to make a prediction saying whether the student is gaming or not gaming.

So, if we get an example here of what a decision tree might look like,

we start at the root of the tree,

which is the node at the top.

And so, for each node,

there's going to be two branches,

and it's going to separate one of the feature into two partition.

So, first of all,

here we might have,

the first node is looking at the average time that the student takes per action.

So, in this example,

we might say, well,

if the student is taking more than three seconds per action,

then the student is not gaming because he's not going fast enough.

But then, if the student is taking three or less seconds per action,

then we're going to look at the number of hints that were requested.

If there's a large number of hints that were requested,

we're going to say the student is gaming the system

because they're asking a lot of hints and going fast.

Now, if the student is not asking a lot of hints,

then maybe we want to look at the number of incorrect answer and

say that if the student is going fast and entering a lot of incorrect answer,

they might be gaming the system.

So, this model is not an actual true model of gaming the system,

it's just an example to kind of illustrate the process.

A real model of gaming the system would probably be much more complex.

Now, once we have that model, what we can do is,

we can take data about the behavior of

a student and use it to get a prediction from the model.

So for example, if the student has an average time per action of eight seconds,

ask for hints five time,

and does three incorrect answers,

then we'll go into the tree and we'll take

the right branch depending on the value of the feature,

and what we're going to see here is that because

the average time per action is greater than three seconds,

then the model is going to say that this is not a gaming behavior.

And then, we could do the same thing for a different example

where the student took three seconds per action on average,

ask hints six time and entered zero incorrect answer.

And now we would take the right branch at

first because the time is small or equal to three,

and then we would take the right branch again

because the number of hints is greater than four.

In that case, the model would predict that the student is currently gaming the system.

So, this is the process to build the model.

Now, once we have the model,

what we'll want to do is evaluate,

is this a good model?

So, I won't go into details on how to evaluate if it's a good model or not,

but just to give you a general idea,

what we're going to do is we're going to take the model and we're going to

reapply it to data for which we have labels,

and then we're going to compare the labels and the prediction from the model.

Ideally, that's going to be done on data that

is new and unseen when we were building the models.

So what you might want to do is take your initial data,

and then get a part of the data and use it to build your model,

and then take part of the data and keep it aside for later to test on it.

You can also use a little bit more advanced techniques such as cross-validation,

which I'm not going to describe in this video.

So now, if we go back to our data,

we can add a column in our table to say,

"What did the model predict?"

And so, in this specific example,

if I use the model that I've shown earlier,

what we would get is that out of five examples,

the model would get the right prediction four times,

which mean that the model would have an accuracy of 80 percent.

So there is many different ways to evaluate the performance of a model,

many different metrics you can use,

each one are going to have their own strengths.

So it's up to you to decide which one is more appropriate for your data,

or you can also compute more than one metric.

So, some of them include accuracy, precision, recall,

Cohen's Kappa, and the area under the ROC Curve also known as AUC.

And now, just to quickly conclude this video,

what we did here is,

we took a very quick overview of how predictive modeling can be used in education.

We can see that it has many different applications.

For example, when we're studying behavior such as gaming the system,

then we might use our model to study

the relationship between the behavior itself and learning outcomes.

So maybe student who engage in more gaming the system,

don't perform as well on future tests.

Or we can use the model to inform teacher about what's going on

with the students so that the teacher can look at the report and see,

"Oh, there's a lot of gaming going on with this student,

I'd better go and figure out what's going on.

How can I better support the student?"

Or we could use the model to drive

automated interventions in the system itself to offset negative learning outcomes.

So maybe if we observe that the student is gaming the system pretty often,

then maybe we can propose remedial learning

and activities to try to improve the learning for that student.