Let us now extend what we learned about the Message Passing Interface by understanding how we can add threading to it. So we know in our distributed system, we have these compute nodes that have a network interface. They have a local memory, and they are connected to an interconnect network. And we've learned the Message Passing Interface where effectively calls like MPI SEND and MPI RECV go through the network interface, into the interconnect, so as to achieve the communication that you want. Now, if you look at real hardware today in data centers, it turns out that the processing part actually has multiple processors, which are also referred to as processor cores, executing on the node. So when we run an MPI process, we are actually just running a single thread and only using one of these processors. And we should, of course, be using all of them to their fullest. One way to do this is to create separate MPI ranks, one per processor, but another way is to add threading. So, what we want to do is effectively create threads that run on the processors T0, T1, T2, so that we can get the benefits of both message passing and multi-threading to use the full power of the computer. Also, it's worth noting that these processors can communicate through the same local memory. So these threads can exchange data structures in a local memory without using calls to MPI. It's only when you need to exchange data on other nodes that you need to use MPI. So in this approach to combining MPI with threading, what we'll do is, we'll have one rank per node. And we can refer to as T0 as the master thread. This is the one that starts up MPI just like you've seen before. So it will call MPI INIT, and then at the end call MPI FINALIZE. But what it can do is create a number of other threads, so it can create worker threads. And these threads can execute in parallel using techniques that you would learn in a course on parallelism. So the idea is that when the master thread has to perform some operations like MPI SEND, MPI RECV, or MPI REDUCE, it may be blocked waiting on communications from other nodes. But these other threads can still continue useful work on the node. You'll basically have one rank per node, so this node could be running MPI rank 0. Another node could be running rank 1, but you have parallelism within the rank. Now, it's interesting that MPI offers a few different modes. So in MPI, you have different threading modes. One is referred to as funneled, And the idea here is that all MPI calls are performed by one thread. So from MPI's perspective, all the communications are occurring with a single thread like what we called the master thread over here. And it is completely unaware of all the other worker threads. Another one is called serialized. And here, you can at most have one MPI call at a time. Here's where you can leverage the concepts that you'd learn from a concurrency class to know that while there are multiple threads, only one thread can be performing an MPI call at any one time. And this helps the implementation, when in this mode, for MPI to know that there's not going to be any contention on MPI resources. But it could be that T0 makes an MPI call in one phase, T1 makes an MPI call in another phase, and so on. And finally, there is the most general mode which is called multiple. And here you can have multiple calls at the same time. So this gives more flexibility, but it puts more burden on the MPI implementation to take care of any contention. There are some caveats over here. So, for example, in the multiple mode, you cannot have two threads waiting on the same event. So if T1 wanted to do an MPI wait on a request R1, T2 is not allowed to also wait on the same MPI request, so this is not allowed. It will have to wait on some other request instead. So now, you have it. If you've learned the basic concepts of parallelism and concurrency, and you know how to do message passing in its most fundamental way. You can combine the two, you have MPI and threading. And you can pick the mode that you want to use depending on how you want to leverage the parallelism. And with that, you can exploit the computers available to you in any data center to the fullest.