In simple linear regression, we analyze the linear relation between two quantitative variables, one independent, and the other dependent. Regression forms the basis for a lot of advanced statistical techniques. So it's really important to understand what regression is. In this video we'll start with correlation, the regression equation, the intercept, and the slope of the regression line. First, let's see why we need regression analysis. Suppose I want to become rich and famous by posting cat videos on the Internet. To be successful it would help to know what characteristics are associated with popular cat videos. One thing I've noticed is that, videos of kittens and young cats seem very popular. To confirm this I could collect information on some existing cat videos and analyze the relation between popularity measured by number of video of use and age of the cat as mentioned in the video description. Supposed this cat are plot of the looks like this. To determine how strongly these two quantitative variables popularity and cat age are linearly related. We can look at the correlation coefficient. This is a number between -1, +1 that expresses how tightly the data fit around an imaginary straight line through the scattered part. From the correlation coefficient Pearson's r we can learn whether the relation is positive or negative. In this case, videos become less popular as cat age increases. Pearson's r is negative. We can also see that the correlation is relatively strong. Of course, this means it might be a better idea to record my new kitten instead of my older cat. Since a younger age is strongly associated with a higher popularity score. It would be useful to describe the relation more specifically, and be able to predict an exact popularity score based on a cat's age. But the correlation doesn't give this information. This is why we use linear regression. It describes the relation, mathematically, through a regression equation. Giving popularity predictions for each cat age. This allows us to do a couple of interesting things. We can use inferential statistics to test if the equation is likely to be an accurate description of the relation in the population. We can also see how closely the predictions approximate the observed data points. In other words, how good our predictions are. We can use the regression equation to identify outliers, data points that deviates strongly from the rest. And finally, we can generate predictions for new cases. For example, to estimate how popular videos of my new kitten will be. So how does regression work? Well, in regression analysis we distinguish between an independent and a dependent variable. The dependent or outcome or response variable is the variable we want to predict. In this case, video popularity. The independent variable or explanatory variable or predictor is the variable that can be used to predict the response variable. In this case, we think cat age can predict popularity. I'll use the terms response variable and predictor from now on, because they're shorter and less easily confused. So the predictor always go on the X axis. The response variable, in this case, video of used per thousand always goes on the Y axis. In many cases it's clear what variable we want to predict and is therefore the response variable. Like popularity, in our example. In some cases, when the causal direction is unclear, it's arbitrary which variable we consider the predictor and which the response variable. In such cases the choice is simply determined by how we choose to frame the research question. Remember the imaginary line we just drew when we looked at the correlation? This actually is the best fitting straight line through the scatter plot. We call this the regression line. It's described by the regression equation. It gives predicted response variable scores for each value of the predictor. The equation is y hat sub i = a + b times x sub i. Y hats sub i is the predicted score on the response variable y, popularity, for case i given their value x sub i on the predictor x, cat age. The predicted score is determined by the intercept a and the regression coefficient or slope, b. The intercept a is equal to the value of y when the x equals zero, where it crosses the y axis. It determines where the line is spaced. The regression coefficient b determines the slope of the line. It determines whether it goes up or down and how steeply it climbs or falls. It tells us by how much video popularity will decrease if the cat's age increases by one unit. So in this case, one year. Suppose that in our example a is 44.95 and b is -3. Then the predicted popularity score for a video of a half year old cat is 44.95- 3 times 0.5. This equals 43.45 times 1,000. Or 43, 450 video views. Similarly, the predicted score for a two-year-old cat is 44.95- 3 times 2. This equals 38.95 times 1,000 or 38,950 video views. As you can see these are the y values for x is 0.5 and x is 2 on the regression line. These are the predicted scores which differ from the observed scores.