This video will teach you partitioning in
Spark and it's effective usage to reduce shuffles.
In the previous video,
you've learned that wide dependencies, create shuffles.
Shuffles are data transfers between different executors of a Spark cluster,
and in general, Shuffles are expansive.
If you open the Spark UI of your application,
and navigate to the page containing the statistics of the tasks that have been executed,
you will notice a column called, Shuffle Write Size.
This is your instrumentation mechanism to spot the shuffle,
and to measure how much data it has transferred.
So, what does Spark do to perform a shuffle?
To perform a shuffle, spark has to answer two questions,
what executors to send data to?
And how to it? Let's consider a simple example of a groupByKey transformation,
assuming you have to executors.
The first executor has keys, A and B.
The second executor has keys, A, B,
and C. To perform the groupByKey,
Spark has to move all of the values of the same key to that exactly one executor.
So, how does Spark know which record to send to which executor?
This is done using a partitioner.
A partitioner defines how records will be distributed,
and thus, which records will be completed by each task.
From the programming perspective,
a partitioner is just an interface with two methods.
numPartitions, and getPartition.
The numPartitions defines the number of partition in an RDD after partitioning.
The getPartition defines mapping from a key to the index of the partition.
Returning to our example of the groupByKey.
Their definition of the groupByKey method in the Spark API reference is as follows.
Now we know that groupByKey uses a Hash-partitioner.
It is a default partitioner in Spark,
which just hashes keys,
and sends the keys with the same hash value to the same executor.
Another example of a partitioner in Spark,
is the Range partitioner.
Range partitioning assigns records,
whose keys are in the same region to the given partition.
Range partitioning is required for sorting,
thus ensuring that by sorting records within a given partition,
the entire RDD will be sorted.
As you see in this example,
manual transformations do not assign a partitioner to the RDD.
In this situation, their default partitioner is created,
when Spark has to perform a shuffle.
But you might also explicitly assign a partitioner to your RDD.
This partitioner is called the known partitioner.
This definition becomes useful,
when you start to optimize shuffles.
The first optimization is to co-partition your RDDs.
RDDs are co-partitioned If they are partitioned by the same known partitioner.
Like in this example,
you have two different RDDs which are
partitioned by the hash-partitioner with done partitions.
By using co-partitioned RDDs,
You can reduce the volume of the shuffle.
This means, that to make a partition in the final RDD,
Spark has to merge only one partition of the parents at a time.
The best possible option is to have co-located RDDs.
Partitions of the RDDS are said to be
co-located if they're both loaded into memory on the same executor,
and this could be achieved if you co-partition your RDDs,
and then persist them in memory.
Well, okay. But what should you do if you assigned a known partition to your RDD,
but continue to apply new transformations?
To optimize your job,
you should definitely think about preserving the partitioner.
Manual transformations have a preserves partitioning
argument which is set to false by default.
It becomes obvious for the map method for example,
because map might spill not only new values but new keys.
But if you are 100% sure that your keys don't modify,
you can set preservesPartitioning to true,
or use map values which preserves a partitioner.
Summing up. You have learned that a partitioner
defines how records will be distributed between the executors,
and a partitioner is just an interface which has
two methods, numPartitions and getPartition.
That default implementation of the getPartition method is to use hashes,
computed on the keys of an RDD.
You also learned about several partitioner based optimizations, like co-partitioned RDDs,
which have the same known partitioner that allows to reduce the volume of the shuffle,
or co-located RDDs which entirely avoided.
If you assigned the known partitioner to the RDD,
you should definitely think about preserving it,
if your keys are static and don't change.