So, what we want to do now is add a new recipe and as you can see you
can have multiple different data sources as part of your flow,
but right now we want to explore the data a little bit.
A recipe allows us to chain together a series of transformations on the raw data.
So, let's go ahead and edit
the recipe and this will load as at the time of this recording,
it was 10 megabytes of your sample data.
So we had 56,000 or so rows in the lower left hand corner,
you'll see that we've loaded about 13,000 of those just a small sample of it
into the web UI of Cloud Dataproc called the transformer view,
and this will give you just a brief sample of some of
the common key data values and allow you to just look across frequent histogram values,
how many different categories for those types of data as well and what are
all the different field values and
some interesting mathematical operations like averages,
medians, standard deviations, that kind of thing.
Now it is just a sample,
it's not part of this lab,
but you can actually explore the different types of sample types you can bring in.
So if you didn't want a random sample,
if you want it's just more of a stratified,
random sample or otherwise want to provide guidance on
what types of sample data you want to bring in that's a setting within Cloud Dataproc,
but right now we're fine with whatever it brings in as a random sample.
So, the first task in this lab is to understand the e-commerce data as
much as we can even before we start adding in
any transformations or even before we think about putting that table back into the query.
So, a couple of question we want to answer
how many columns are in the dataset and you just keep in mind we're
doing this for the e-commerce dataset now and
you'll get a really good understanding of how to work within this,
but if somebody gives you a random dataset and you're not familiar with it,
load it into Cloud Dataprep,
see the columns with the frequent values.
The same process we're applying here for exploring in,
it is going to be beneficial for you and your own datasets later on.
So, columns in the dataset we have 32 columns.
How many rows are loaded in in our sample?
You see here all the way at the bottom you about 13,000,
and now I'm going to ask a little bit more of the questions that,
it would normally take you a select, star,
group by a particular value within SQL and doing that 32 times sounds really onerous.
So, that is why we get this nice visual view of the frequent values for the sample.
So, first question is,
what are the most common channel groupings for the dataset?
So you have seven different category values here,
again with the sample.
Looks like we're driving a significant amount of traffic through referrals
and organic search and those are the top two again.
Big asterisk here is that this is just our sample of
data and your sample could actually be even different than mine,
because you could have loaded the sample different way
or could have chosen other random records,
but by enlarge referring traffic and organic search
for this particular date that we've loaded in.
What are the top countries that provide us visitors?
Over here by enlarge for this day 80 percent from the United States and we have India,
United Kingdom and a longer tail that we have here,
same deal for cities.
Now, let's scroll over.
What is the gray bar under total transaction revenue represent?
See you've got the data type in this particular case,
this is an integer and
Cloud Dataprep it's sometimes or it often will try to assume good intent.
So if you load something in there as a string that it thinks should be an integer,
it will try to auto convert that for you.
So you have to be a little bit careful with the intelligence
that it brings in especially when it comes to
data types where you have numbers and strings and mixed together.
So total transaction revenue here,
the gray bar represents missing values.
So, it's saying you specified,
I see a lot of integer values here,
so I'm going to call this an integer,
but there's a lot of null or missing values here,
more than half, and what does it actually represent in terms of our dataset?
It means you had a lot of transactions that didn't have revenue associated with them,
not everyone that visits your site is going to buy something unfortunately.
So you can back out almost a little bit of a conversion rate here.
So less than half of the visits here generated revenue.
Now, let's take a look at some metrics,
they would only have to do an aggregation function
for but it'll just be a couple of simple clicks here.
So you get time on site,
you get the page views,
and you get the session quality dimension that's zero to a 100,
whether or not the closer to 100 is closer to a transaction that's going to convert.
So you can even just visually look at these histograms
here and get a general distribution or its sense of a distribution.
So you see as you might expect,
we have time on site that is hovering way towards
the left side of the lower end of the scale here,
same thing with page views,
same thing with session quality dimension,
but let's get a little bit more specific.
If you click on the drop-down for the column,
you can look at the details of the column.
So this is time on site,
if you want things like statistics in the lower left hand corner,
you can see the average time on site about 900 seconds,
you can see the min max,
standard deviation, again just very quick at-a-glance values.
Now to go back, we're just going to switch back to the grid view,
do the same thing for page views.
How many pages are folks visiting on average?
Let's go to the column details again,
and according to this we have 20 pages.