Hi, and welcome to the lecture on some background notation. In this first part of this lecture, we're just gonna cover some simple matrix derivatives. And we're going to give an example of using those matrix derivatives in a statistical context. As a first useful fact, consider a function f from Rp to R that is linear so that f(x) = a transposed times x. In this case, the gradient of f the collection of element-wise derivatives with respect to x is simply equal to a. Now consider a quadratic form, g, which is also a function from Rp to R that looks like this, x transpose Ax, where A is a p x p symmetric matrix. Then in this case, the gradient of g works out to be 2Ax. If I were then to take a second derivative of g, with respect to x to get the so called Hessian, which is the collection of pairwise derivatives. For example, if I have components i and xi and xj of the vector x, then the ij element of the Hessian matrix is the pairwise derivatives, order of derivatives, derivative with respect to xi, and derivative with respect to xj, this second derivative is simply 2A. Let's go through a simple example, but one that is very fundamental for the class. Imagine that you have a vector y, which is an n by 1 vector of data points and you would like to explain that vector y with x, an n by p collection of explanatory variables. So each row of x is a potential explainer of the datapoint, the corresponding data point of y. And you want to do this in a linear fashion so you want to simply multiply x by a p by 1 vector beta to get an explanatory version of y, using the rows of x. You're going to accomplish this by attempting to minimize the sum of the squared distances between the vector you'd like to explain why and the linear collection of explanatory variables, x times beta. If I were to expand this minimized distance out, I would get y transpose y- 2y transpose x beta + beta transpose x transpose x beta. Using our results above, if we were to take a derivative of this function with respect to beta, we would get derivative with respect to beta of this function, we would get negative 2 x transpose y + 2 x transpose x beta. Setting that equal to 0, we get the solution (x transpose x) inverse x transpose y is equal to say beta hat, let say that's our estimator. Now let's leave for the time being the idea that x transpose x is invertible. Let's put that discussion off for a second. Now if I wanna take the second derivative of asterisk, the only term that involves beta is this 2x transpose x beta. Using the results that we just talked about, we get x transpose x, 2x transpose x. Now, having said that, we note that x transpose x, and hence 2x transpose x is a positive definite matrix. We can see that, cuz if we take any non-zero vector a, and we have a transpose x transpose xa, that is exactly = the norm of xa squared. So thus that has to be, well greater than or equal to 0. So that means our solution to our derivative is in fact a minimum. And so this beta hat is the minimizer of this least square solution. So that's an example of using our matrix derivative results to find a solution to a statistically meaningful equation.