And you might measure them in qualitative terms or in quantitative terms.
So usually when we think about variables we think about things like this.
If you look down here at the bottom.
We think about country of origin, sex, treatment, and
quantitative variables like height, weight, blood pressure.
A lot of these measurements are actually derived
from much lower level measurements.
So for example, if you think about blood pressure,
blood pressure's actually measured by calculating a pressure measurement.
And so there's actually a lot of low level things that go
into calculating that pressure measurement.
And so those low level things are the kinds of things
that we're going to be talking about in raw versus processed data.
So the raw data are the original source of data,
they're often very hard to use for data analyses because they're complicated or
they're hard to parse or they're very hard to analyze.
Data analysis actually includes the processing or the cleaning of the data.
If you go and obtain a dataset that's actually a raw image file and you process
it, and turn it into a nice data frame that then you use to analyze an ARM.
That data processing actually is part of the data analysis,
your data science pipeline.
In fact a huge component of a data scientist's job is performing those sorts
of processing operations.
The raw data may only need to be processed once, but regardless of how often you
process it, you need to keep a record of all the different things you did.
Because it can have a major impact on the data stream analysis.
The processed data is data that is ready for analysis.
So the processing of the data might include merging, subsetting, transforming,
or you might go into a file and extract out a part of an image.
You might go into a file and
extract out a little bit of text from a preformed text field.
Or you may do a number of other things.
Depending on the field that you work in, there may be standards for processing.
For example, in the area where I work, genomics,
there are a lot of really standard preprocessing techniques that need to be
applied before you can analyze data.
A critical, critical component is that all steps should be recorded.
I can't state this strongly enough.
Preprocessing often ends up being the most important component of a data analysis in
terms of affect on the downstream data.
So paying attention to all the steps that you did is critically important if you're
going to be a data scientist who's careful about understanding what's really
happening in the entire data processing pipeline.
So, I'm gonna give you one really quick example of a processing pipeline just to
illustrate what I mean by there being different levels of raw data.
So, this an Illumina HiSeq machine,
so what this machine can be used to do is to sequence DNA.
And so that sequencing is much, much faster now than it used to be in the past.
When the original Human Genome Project got started, it took almost a decade and
over a $1 billion to sequence one human genome.
And that same process can now be performed in about a week for
about $10,000 using a machine like this.