And then I'll also open the 2009 file.
So now I have both the files - this is 2009, and the other file is 2007.
So in particular, what I'm interested in is in merging the information in the two files
and the common variable across both the files as the very first column,
which is the control variable.
This identifies a particular housing unit.
So we'll match the information in the two files using this common control variable.
So in particular, let's pose a question.
We wish to study the change in monthly utility cost
from the year 2007 to year 2009 for the same household.
So to do that,
what we need to do is we need too merge information from the file 2007 and 2009.
So, if you want, we could keep all of the remaining variables or,
for demonstration purposes, let's delete the remaining variables.
So this is the year 2007 file.
I'll delete all the variables except for the UTILITY variable.
Once again, my interest is in comparing the change in utility cost from 2007 to 2009.
So since this is a 2007 utility cost, monthly household utility cost,
I'll give it a name UTILITY 2007.
What I'm trying to do is I'm trying to merge information across the two files,
year 2007, 2009. So before going to the other file, let me create a column here,
and call it UTILITY 2009.
And in this column, I'll be putting in values:
the monthly utility cost for that particular household for the year 2009.
To populate this column, what I'll do is I'll use a VLOOKUP command,
which will look up the value in this particular file, which is 2007 file,
the particular household, match it up with the household and the year 2009 file,
and then give me the monthly utility value for that particular house,
household, in the year 2009.
So I do a "=VLOOKUP" open parenthesis,
The first value I need to put up is the lookup value,
in this case I'll be matching based on the control variable,
which is a unique identifying label for individual household.
Select that, that's my lookup value. Comma.
Where's my lookup table, and that's my year 2009 file.
So, then I'll access the year 2009 file.
So I select the entire 2009 file.
This is the data file 2009.
That's my lookup table.
I'm going the formula address column there.
Comma.
So, once the VLOOKUP command matches up the very first column,
that is, matches the individual household, I want to read the utility value.
This utility value, column U, is the 21st column from the very first column,
which to the control column.
So I, give in number 21 there, that's the column index number.
So once the VLOOKUP value matches the control variable in the very first column,
I want it to output the 21st value or the 21st column from there.
And the last input is the FALSE input.
Why?
Because I want an exact match.
So I give a FALSE there.
Close parenthesis.
Notice there are appropriate dollar signs inserted there
so that when I copy and paste this formula,
it will appropriately execute the VLOOKUP command.
I'll do an enter.
And, I get a value 229.166667.
So this was the monthly utility bill for this particular household in the year 2009.
To get it for all the households in my dataset,
I simply copy and paste this formula all the way down.
So, it's a large file, and it'll take some time to execute the VLOOKUP command.
So, the computer will executing the VLOOKUP command.
Now it's executed it, and notice, a lot many places it has a NA value, an error value,
or not applicable value.
Why do I get these?
Because, this particular household,
this particular household does not exist in the year 2009.
It exists in 2007; that does not exist for whatever reason
that individual household does not exist in year 2009.
So, since I'm interested in the same household
having two utility values for the year 2009 and 2007,
I'll delete all such households, which do not have two continuous values for 2007 - 2009.
So I'll delete all households with these error values.
However, before I delete this value,
so that computations are a little faster than Excel,
what I'll do is I will copy this column, column C,
and paste it as values,
and let's remove the formulas from there because I've already calculated my numbers.
So I'll copy and then paste it in the same place, but paste it as values.
So once I do this, notice formulas have vanished.
And why have I done it?
Simply so that the subsequent computations are a little bit faster
-- because as long there are formulas, every time you do a change in the worksheet,
Excel recalculates all the sets.
So I have copied and pasted it as values.
So now an easy way to delete all these error values, all these households,
which have these error values, or the missing values, is to filter my data.
So I go to Data, Filter and these filters get attached to my data headers.
We had discussed filtering in the very first course of this specialization
where we had extensively looked at filters, how they operate.
So then I go to UTILITY 2009 and de-select all.
And then I select only the missing values, NA, not applicable values.
It's doing OK.
So it filters out values, all filters or all households, whichever,
error or a not applicable value.
I need to delete all these households.
So easy way to delete that is I'll simply select all these houses.
Shift control down arrow.
And do a delete.
And now, the UTILITY 2009 value, I'll un-filter it, that is, select all.
OK.
So notice all my missing values are households
which have a missing value of utility in the year 2009 have been deleted.
But I'm left with these gaps in my data.
So how do I remove these gaps?
I simply sort this data.
I select all this data, all three columns, go to Data, Sort,
Sort by the very first value which is the control variable
which identifies an individual household.
Doing OK.
And it's sorted.
And now all my data is prepared.
Have I lost some information?
Yes.
Notice my households now in this file are 38,110 households.
Well some others have gotten deleted.
Why?
Because they did not have monthly household utility values for the year 2009.
So this is one way of merging two Excel files on a common variable.
So we demonstrated using one variable,
but you could have kept the other variables also,
or you could do it on some other variable, and so on.
That was one way of merging data across different Excel files
based on some common variable.
As mentioned, this would come handy
when we do some analysis spanning multiple years for the same housing unit.
One other thing that I wish to highlight is that, at many places,
you would see data that does not make sense.
For example, you may see negative values, negative or zero value for H, and so on.
These kind of errors typically occur in data and the reasons could be many.
It could be that people falsely report incomes, or ages,
or it could also be that the data capturing has not been done properly.
Whatever the reason, it may affect the data analysis.
We will deal with these kind of situations as they arise in our capstone project.