Hello, my name is Pavel.
And today, we'll talk about the functions that
can be applied to data using the DataFrame.
As you remember that working with SQL you
can apply functions to columns to get your results.
Unlike the function which you use on the python,
the scale functions are applied not to a single word
here but all the words in the selected columns at once.
It happens distributively on all nodes of the cluster.
There are three types of the scale functions.
The first type, mapping functions.
They are the functions that map one word
from the source code into one word at the destination.
Such functions are the majority.
The second one is so-called,
generating functions that transform each element into several ones.
Actually, there is only one functions of that type but
what the function that is called Explode and allows you to expand areas in the new rows.
For sure we'll talk about it,
but we will do it in all good time.
And the third type of function is aggregating functions.
They are used when you need to create a new record from a group of existing ones.
We will talk about them on the next video.
All the functions you work with in SQL are also available in the DataFrame.
There are more than a hundred of them and I will not enumerate them.
Otherwise, you might get bored and fall asleep.
I will just show you
a few examples of using them and give you a link to the documentation.
Let's take access_log and apply some interesting functions to it.
In the log there is an interesting field, user_agent.
This is a line of information that our browser sent to a browser about itself.
The line itself has had a hard time long time ago when
servers tried to send different webpages to
different browsers and didn't support all of them.
So, browsers had to pretend to be other browsers.
I believe it will be interesting for you to analyze it.
Let's create a DataFrame that points to the access_log.
You start from a simple one.
For example, you measure the length of the text in
the user_agent field or spark functions are stored in the Pyspark SQL functions package.
So let's import it.
You can apply functions to the columns with the Select method.
You specify the columns and the functions that interest you and execute the query.
The result turned out to be correct,
but something went wrong with the name of the column.
Set the new column name through the last method and execute again.
That's much better. Let's look at the urls now.
Clearly, something is missing in them.
They miss the domains signs
a web-servers often doesn't know in which the domain it works.
To be honest, it is an artificial log that doesn't belong to any real domain.
So, let's add the urls to vk.com.
That is a very popular social network in Russia.
That can be done by Concat.
Heck! You failed, if you scroll the error message down to the end.
You can see the reason.
Spark is trying to find vk.com in a column table and it fails.
But you know that vk.com is just a line and not a column name.
How to make this clear to spark?
The function Lit will help us.
It explicitly tells the analyzers that this is a string.
Here is the syntax. As a result,
you get pages of users which you can try to look through.
Most of VK pages are restricted to its privacy settings.
Can you find any page with open content?
Now, let's analyze the user_agent.
To begin with, just choose the field to look closer.
Now, divide each line of the user_agent into words using the Split function.
To make the columns name less scary,
let's assigns the name list to it.
You extract the words from the resulting list,
reference to them by indices,
or separate words per line using the Explode function.
It would help us to find the users of android from the list.
Wow, it results that Android is kind of a Linux. Did you know it?
Also, Explode is the first step of word count.
It will do it when you assemble the aggregation.
As you know, the user_agent you can determine the type of operating system.
I personally use it for visiting the site.
Let's write such a determinant.
It can be implemented using the When function.
Also user_agents that contain android sub-string belong to Android OS.
Let's add this as our first condition.
The default value can be set using the Otherwise function.
Let's check what has happened.
It is possible to add additional conditions into the When function.
Let's filter our users from windows in access_log.
They all contain windows in the user_agent.
You can do this by changing the When function.
In your homework, you can explore this function and
learn how to define other operating system.
In this video, you learned the types of SQL functions and how to apply them.