0:10

Hello. In this lesson,

Â we're going to look at using the Numpy module

Â within Python to explore higher dimensional data.

Â This will also include

Â the analytical quantification of correlations between different dimensions of a dataset.

Â So a good way to think about this is,

Â if you have a DataFrame and you want to understand whether two columns are correlated,

Â one technique is of course,

Â to make a scatter plot and look at it visually.

Â A second technique, is to actually compute correlation measures analytically.

Â Now, this notebook would be very similar to a previous notebook where we introduce Numpy,

Â however, now we will be looking at Numpy in a multi-dimensional setting.

Â Most of the time, we will be focusing on Numpy as

Â a two dimensional array which we often think of as a matrix.

Â So, all of these will be contained in the advanced Numpy notebook.

Â As I said before, this is going to focus on multi-dimensional arrays.

Â And in general, we're going to stay focused on a two-dimensional array.

Â But whatever we do,

Â will be easily extendable to higher dimensions.

Â One of the standard ways to create

Â a multi-dimensional or two-dimensional array is to start with

Â a one-dimensional array and then to reshape

Â it as we need for whatever analysis we're doing.

Â So here's an example, We first create a one dimensional 100 element array and then,

Â we reshape it into a 10 by 10 array.

Â This also has 100 elements.

Â And what we do is, we take the first 10 elements and make that a row and then,

Â the second 10 elements and make that the second row etc.

Â until we reach the conclusion.

Â This is demonstrated in this example.

Â You can see we've printed out a 100 element array and then,

Â when we turn it into a two-dimensional array and we print that out,

Â you can see how this is nicely organized.

Â There's also convenience functions for creating special matrices.

Â So if you want to create an identity matrix where

Â the diagonal elements are all one and the off diagonal elements are zero,

Â we do that with the np.eye.

Â The reason we have eye,

Â is because the identity matrix typically is represented by a capital letter I.

Â We can also use the Numpy methods to create

Â diagonal matrices with different diagonal elements,

Â and we can also create matrices with off diagonal elements being certain things.

Â Now, one of the more challenging things when dealing with multi-dimensional arrays,

Â is when you're trying to index or slice them.

Â And the important thing to do,

Â is that we're going to specify

Â the first dimension followed by a comma and then, the second dimension.

Â If there are additional dimensions,

Â we'll just be using a comma again.

Â So you can literally keep doing this until you run out of dimensions.

Â If only one dimension is specified by default,

Â it refers to the first dimension.

Â So we demonstrate this in the following code cell where we

Â first build a two-dimensional array.

Â That's three by three and we printed out,

Â here you can see it, zero, one, two, three, four, five, six, seven, eight.

Â And then, we start slicing in the first dimension,

Â in the second dimension, and in both dimensions.

Â So you can see we slice out the first row,

Â then we can also slice out the first column.

Â And how are we doing this?

Â here, we just grab the first element.

Â That's of course, going to refer to that first row.

Â Here, we put the colon which says,

Â select all rows, but now we're only selecting the second columns.

Â That gives us just the second column.

Â And of course, we can slice out individual elements in different ways.

Â We can also do a more complex slicing.

Â This case, this is a three by three by three matrix.

Â So it's a three-dimensional array.

Â And this example shows how to slice that.

Â We also can do standard things that we've typically done in Numpy including,

Â the masked arrays where we might want to grab out

Â certain elements where the element is greater than four,

Â or it's evenly divisible by two.

Â This shows how we do that.

Â We can also perform arithmetic on these masked arrays.

Â We can also perform the basic operations that we've done before,

Â where we're adding to each element,

Â or multiplying by each element.

Â And of course, we also can apply the summary functions that we've seen before, mean,

Â median, variance, standard deviation as well as

Â other universal functions like sin, co-sin, etc.

Â This makes it very easy to perform

Â complex analysis on these two or higher dimensional arrays.

Â The next thing that we're to look at is masked arrays.

Â We saw this in the one dimensional where we

Â can indicate that an array should be masked such that

Â operations that might trigger an error condition will instead

Â simply create not a number variable in our output array.

Â Now, the last thing I want to mention is actual correlation measurements.

Â So in a scatter plot notebook,

Â we saw visually how to interpret relationships between data

Â but we can also make an analytic quantification of this relationship.

Â The two main techniques for doing this are

Â the Pearson correlation coefficient and the Spearman correlation coefficient.

Â And this notebook shows you how to do that.

Â First, this is a figure taken from Wikipedia that shows

Â the Pearson correlation coefficient which is often

Â written with lowercase r for two different datasets.

Â You can see that these lines,

Â these straight linear relationships depending on

Â whatever the angle is, are positively correlated.

Â They have a value of r equals one.

Â When it's a negative correlation,

Â you can see it's minus one.

Â And as we go between these,

Â the value gets smaller until zero and then,

Â it gets more towards negative one.

Â And then of course, if there's just no real relationship at all,

Â the correlation coefficient is zero.

Â The Spearman correlation coefficient is similar.

Â However, the values change in that as x increases,

Â y will continue to move in the same manner.

Â So this has to imply a monotonic relationship as opposed to the Pearson correlation.

Â We typically write this as a row value

Â and we will use it in other tests, particularly hypothesis testing.

Â Now, one of the other things that goes along with correlations,

Â are co-variances where we measure what is

Â the relationship between two data sets where one variable is increasing,

Â what happens to the other variable?

Â So we've measured mean values before,

Â and we've measured variances before,

Â but when we have two-dimensional datasets,

Â we now have to measure the relationship between those two dimensions.

Â So how does one variable affect the other?

Â And we represent that with the co-variance.

Â We can easily calculate that with scipy.stats module.

Â We can calculate both the Pearson and

Â the Spearman coefficient as well as the covariance matrix from within Numpy.

Â And that's what this last code cell here does.

Â Hopefully, this has given you an introduction to two-dimensional,

Â or higher dimensional arrays.

Â They are a little more complicated than one mentioned,

Â but they enable much more detailed analysis of datasets,

Â provides you a richer toolset with which to attack your data problems.

Â If you have any questions with this material,

Â please let us know in the course forum. And good luck.

Â