Learn how probability, math, and statistics can be used to help baseball, football and basketball teams improve, player and lineup selection as well as in game strategy.

Loading...

From the course by University of Houston System

Math behind Moneyball

24 ratings

University of Houston System

24 ratings

Learn how probability, math, and statistics can be used to help baseball, football and basketball teams improve, player and lineup selection as well as in game strategy.

From the lesson

Module 1

You will learn how to predict a team’s won loss record from the number of runs, points, or goals scored by a team and its opponents. Then we will introduce you to multiple regression and show how multiple regression is used to evaluate baseball hitters. Excel data tables, VLOOKUP, MATCH, and INDEX functions will be discussed.

- Professor Wayne WinstonVisiting Professor

Bauer College of Business

Okay, let's begin our study of the math behind Moneyball or sports analytics with baseball's Pythagorean Theorem. Now you probably remember the Pythagorean Theorem from high school geometry.

Which was, if you have 3 sides of a triangle, A squared plus B squared equals C squared, if it's a right triangle and C is the hypotenuse. Bill James is given the most credit I guess for coming up with this idea. But the idea is if you know how many runs a team scores in baseball in a season and how many they give up, can predict how many games they'll win? And so, Bill, I don't think knew a lot of advanced math but he was brilliant as we'll see throughout this course at coming up with sort of cool ways to figure out what's really going on even if you don't have a really advanced math knowledge. Like I don't think Bill James knew a lot about regression but he came up with runs created as we'll see soon. And that was a really brilliant way to predict how many runs a team would score based on their hitting statistics. Okay, so what Bill wanted to figure out was, okay, based on the runs for and runs against, how many games would the team win? Or what percentage of their games would they win? And so he said there was some exponent, Ctrl+1 let's me do a superscript. So I can raise this to a power, and then I would divide that by the same thing.

To a power and if I hit Ctrl+1, get rid of this superscript, and then I can say plus runs against to that power.

Okay, now if you divide the numerator and the denominator by runs against, you get runs scored divided by runs against to the alpha which we could say is the scoring ratio. Sorry, the exponent or I could use alpha. So I'd say superscript, exponent, divided by and then, if I take runs against here I get the same thing. Ratio, to the exponent.

And Ctrl+1, and now the nice thing is that I get a plus one because runs against to the alpha divided by run against, I guess these should all be exponents here. We gotta be consistent. Okay, so and this would be plus one.

I'm sorry, that's not going to be, so I'm going to have to go. Let's just do this again. Ratio, Ctrl+1, superscript exponent Ctrl+1.

Plus one which divided by runs against to the exponent goes like that. Okay, so that's an estimate for the fraction of games won. The question is, what exponent works for the given sport? Well again, as Edwards Deming, a great American statistician said, I urge you to read his great book, Out of the Crisis.

He was the guru, helped the Japanese come back from World War II by using statistical quality control. Deming said, Edwards Deming said, In God we trust; to all others need data. Okay.

Okay, so the question is, what exponent would work? Let's try an exponent of three and I can name that cell. We'll see more on names later. If I go to this drop down box, I can type in exp for exponent here.

Now if I go anywhere else in this spreadsheet and click on exp it refers to that cell. And if I use exp in a formula it will refer to that cell. So if I would type =exp. And that just makes life easier. It's like it's dollar signs, so when you copy the formula you don't have to worry about dollar signing. So now we want to figure out for a given exponent, how well do we predict the number of games the team wins. So we have some data here, 1980 through 2006. Okay so we look at the scoring ratio, try and use this formula, get the predictive winner percentage, the actual win percentage and see the absolute error. We don't use the actual error because positive and negative errors would cancel out. At MAD means, we're not angry, it means Mean Absolute Deviation. We want to know what exponent would give us the smallest mean absolute deviation. Well, the way to do that is with what's called an Excel a one way data table. You can vary one parameter and see how one or more formulas changes. So the formula we'll want to vary will be this mean absolute deviation. So all we want to vary will be the exponent. We'll want to see its effect on the mean apps with deviation. Okay, so that's what we're going to do here. So the scoring ratio is runs scored divided by runs allowed.

And a little trick to copy that down, just double-click the left mouse at the little crosshair. See then in the next row, in this row I'm taking what's to the two columns to the left, divided by one column to the left, copy it down and it goes to the same thing. Now the predicted winning percentage would be the ratio to that exponent.

So they should win a little less than half their games and get 48.6%. Copy that down. Now the actual winning percentage is wins divided wins plus losses.

I have two, let [INAUDIBLE]. And then the absolute error would be the absolute value of actual winning percentage minus predicted and then I just average these to get the net.

Equal, type eave, I can double-click on Average, left paren, Ctrl+Shift+Down Arrow lets me go down to the bottom of the column. So I was off by 3.4%. Now does a one way data table work? This is a nice concept we use over and over again in Excel. One way data table, you can change one input, in this case the exponent and look at the effect on multiple outputs. We just have one here.

Okay, our only output will be the math. And so the way you set this up, you list the values of the input going down a column. So let's start with let's say 0.5, 0.6 and if I select those two cells, I can get the crosshair and Excel learns the pattern. We'll go to three. We, for baseball I know we don't have to go any higher, you'll see the smallest value occurs about 1.9 here. Then you go up one row and over one column and you show the formula you want to recalculate, and that formula is the MAD, right there.

So I select the table range, I go over Ctrl+Shift right arrow, Ctrl+Shift down arrow, and I go data analysis. We'll do goal seek in a few videos. Scenario manager I don't care about, but data table.

There is in this way, a one way data table only at column input cell. So the numbers in the first column are the exponents, I want to have those be fed into the exp cell. So I do this and I click OK and it should recompute the math for each case.

Okay, and you can see the smallest mean absolute deviation will occur right here, an exponent of 1.9, which is very close to two. And with different data sets you'll get a different exponent, but it should always be pretty close to two. But that's why it's called the Pythagorean Theorem, because if you put a two in there it's like A squared plus B squared equals C squared. It's ratios squared over ratios squared plus one. Predicts how many games a baseball team will win. Now in the NBA the exponent is near 14 and in the NFL, it's near 2.4. We may even give you a test question to try to figure it out for hockey. And if there's a tie in the NFL or the NHL, I'd count a tie as half a win. When doing this baseball and basketball don't have ties. Of course soccer has tons of ties. So you can look at premier league data and try and figure out what the exponent would be there. But this, you'll see why this is important in the next video because I can show that if you make a team 10 points better, team scores ten more runs, they should win about one more game in major league baseball. So if you can get a hitter that makes you 20 runs better, you expect to win two more games during the next season. And that's the way, basically, that managers and front office people need to think of things. Okay, just one last thing here. I sound like Colombo if you're old enough to remember that show. He'd always say one more thing before he caught the murderer. You knew you were finished if you're the murderer and he said one more thing. But I guess I always say one more thing when I have five more things, but if you go Formulas > Calculation > Options, there are three choices. Automatic is where your spreadsheet probably is, which means everything recalculates automatically when you enter formulas. Okay, now why can that be bad? If you have a humongous spreadsheet and every time you type stuff in one cell you have to wait five or ten minutes for it to recalculate, it's no fun. So we open, put our spreadsheets on manual, and then F9 will recalculate in spreadsheet.

Sorry. The formula is Calculation options, or Automatic Except for Tables. That means everything in your spreadsheet will recalculate when you enter formulas except your data tables which can be slow, and you'll have to hit F9 to have them recalculate, okay? And so that's really useful. Again we would do referee spreadsheets for Mark Cuban of the Mavericks and they'd be very slow, the data tables we'd have thousands of rows of data and so, we would hit F9 to recalculate them. So, if all the numbers look the same in a data table, maybe you have Automatic Except for Tables on. You have to hit F9 to recalculate the spreadsheet. Okay, so in the next video we'll try and understand why ten runs equals one win.

Coursera provides universal access to the world’s best education, partnering with top universities and organizations to offer courses online.