And this list will transferred over the network from the second node to the first.
And How does this actually happen?
So, the first step is that on the source side, the data are written to disk,
and they are written to local disk, already separated.
They're already ordered by the partition in which they're gonna be.
On the receiving side, on the final operation.
And then, the data are requested over the network
from the output nodes.
And this data is transferred from the source to
the final destination.
So, it's very important to know when you are getting a shuffle.
So first of all, you have to know which operations.
We saw the list of y transformations.
All of them are going to trigger a shuffle.
And it's very important to understand if it's really a necessary of, or
if you can avoid it.
One interesting sample is sometimes in your data processing
you are using groupbykey and so you are getting for the same key,
you are moving all your data to the same partition, the same node.
But sometimes in the latest steps of your
workflow you are going to call a reduction so
at this point you are calling this reduction so for simplicity let's
think for example this is a sum and we're going to sum all of our values.
But there is a better way of doing this which is using reduce by key so
let's see actually how this works in an example with our usual two nodes.
So if you do it first the group by key and then add a use what you're gonna do is
you are going to transfer your data from the in
this case your going to transfer 2 and 8 from the second node to the first node and
then In the reduction phase, you are going to sum the three of them.
Now, with reduceByKey, you can improve
this by doing, first, the sum.
Before transferring over the network and so now you are now,
you are transferring instead of two numbers, you are transferring one only.
And so, the good of reduce by key is that you can perform
a first reduction phase before actually shipping your data through network.
And then of course, you need a second reduction phase.
And to get to the same final results.
So it's very important to understand when you can use reduceByKey instead
of the goByKey.