Welcome back. We've got a great week planned. We're going to start with the discussion of distributed training, including a couple of different kinds of parallelism. Then we'll turn to high performance modeling, including high performance ingestion. Then we're going to turn to one of my favorite topics, knowledge distillation. So I'm really looking forward to this, I hope you enjoy it. Let's get started. We'll start with the discussion of distributed training. When you start prototyping, training your model might be a fast and simple task. However, fully training a model can become a very time-consuming task. Datasets in many domains are getting larger and larger. As the size of the datasets increase, models take longer and longer to train. The same logic applies for larger architectures, and it's not just training time. The number of epochs for a model also increases as a result. Solving this kind of problem usually requires distributed training. Distributed training allows us to train huge models, and at the same time, speed up training. In this lesson, you'll see how models can be accelerated using data and model parallelism. Also, you'll look at high performance modeling techniques like distribution strategies and high performance ingestion pipelines such as TF data. Now, let's look into basic ways to perform distributed training. These include data parallelism and model parallelism. Data parallelism is probably the easiest of the two to implement. In this approach, you divide the data into partitions. You copy the complete model to all of the workers, where each one operates on a different partition of the data and the model updates are synchronized across workers. This type of parallelism is model agnostic and can be applied to any neural network architecture. Usually, the scale of data parallelism corresponds to the batch size. In model parallelism, however, you segment the model into different parts, training concurrently on different workers. Each model will train on the same piece of data. Here workers only need to synchronize the shared parameters, usually once for each forward or back propagation step. You generally use model parallelism when you have a larger model which won't fit into memory on your accelerators. Implementation of model parallelism is relatively advanced compared to data parallelism. Model parallelism is a more complex topic. So for now, let's look at how data parallelism works with an illustration and then revisit model parallelism. In data parallelism, the data is split into partitions. The number of partitions is usually the total number of available workers in the compute cluster. To train, you copy the model onto each of these worker nodes, with each worker training on its own subset of the data. This requires each worker to have enough memory to load the entire model, which for larger models can be a problem. Each worker independently computes the errors between its predictions for its training samples and the labeled data. Then each worker performs backprop to update its model based on the errors, and communicates all of its changes to the other workers so that they can update their models. This means that the workers need to synchronize their gradients at the end of each batch to ensure that they are training a consistent model. There are two basic styles of distributed training using data parallelism. In synchronous training, each worker trains on its current mini batch of data, applies its own updates, communicates out its updates to the other workers. And waits to receive and apply all of the updates from the other workers before proceeding to the next mini batch. And all-reduce algorithm is an example of this. In asynchronous training, all workers are independently training over their mini batch of data and updating variables asynchronously. Asynchronous training tends to be more efficient, but can be more difficult to implement. Parameter server algorithm is an example of this. One major disadvantage of asynchronous training is reduced accuracy and slower convergence, which means that more steps are required to converge. Slow convergence may not be a problem, since the speed up in asynchronous training may be enough to compensate. However, the accuracy loss may be an issue depending on how much accuracy is lost and the requirements of the application. To use distributed training, it's important that models become distribute-aware. Fortunately, high-level APIs like Keras or Estimators support distributed training. You can even create custom training loops to provide more precise control. You'll see that to make these ordinary models capable of performing training or inference in a distributed manner, you need to make them distribute-aware with some small changes of code. By doing this, you get the power of training your model on a large number of accelerators, like GPUs or TPUs. In particular, to perform distributed training in TensorFlow, you can make use of TensorFlow's distribute.Strategy class library. This class supports several distribution strategies for high-level APIs and also supports training using a custom training loop. The tf.distribute.Strategy class supports the execution of TensorFlow code not only in eager mode, but also in graph mode, as well as using tf function. In addition to training models, it's also possible to use this API to perform model evaluation and prediction in a distributed manner on different platforms. It requires a minimal amount of extra code to adapt your models for distributed training. You can easily switch between different strategies to experiment and find the ones that best fit your needs. There are many different strategies for performing distributed training with TensorFlow. The following ones are the most used, one device strategy, mirrored strategy, parameter service strategy, multi-worker mirrored strategy, central storage strategy, and TPU strategy. Let's focus on the first three options. For more information on this tf.distribute.Strategy, see the links in the resource section. One device strategy will place any variables created in its scope on the specified device. Input distributed through this strategy will be prefetched to the specified device. Moreover, any functions called via strategy.run will also be placed on the specified device as well. Typical usage of this strategy could be testing your code with tf.distribute.Strategy API before switching to other strategies which actually distribute to multiple devices and machines. So this is typically something you'd use in development. Mirrored strategy supports synchronous distributed training on multiple GPUs on one machine. It creates one replica per GPU device. Each variable in the model is mirrored across all the other replicas. Together these variables form a single conceptual variable called a mirrored variable. These variables are kept in sync with each other by applying identical updates. Efficient all-reduce algorithms are used to communicate the variable updates across the devices. All-reduce aggregates tensors across all the devices by adding them up and makes them available on each device. It's a fused algorithm that is very efficient and can reduce the overhead of synchronization significantly. The parameter service strategy is a common asynchronous data parallel method to scale up model training on multiple machines. A parameter server training cluster consists of workers and parameter servers. Variables are created on parameter servers, and they are read and updated by workers in each step. By default, workers read and update these variables independently without synchronizing with each other. This is why sometimes parameter server style training is also referred to as asynchronous training. Typically in synchronous training, the entire cluster of workers would fail if one or more of the workers were to fail. It's important to consider some form of fault tolerance in cases where workers die or become unstable. This allows you to recover from a failure incurred by preempting workers. This can be done by preserving the training state in the distributed file system. Since all the workers are kept in sync in terms of training epochs and steps, other workers would need to wait for the failed or preempted worker to restart in order to continue. In the multi-worker mirrored strategy, for example, if a worker gets interrupted, the whole cluster pauses until the interrupted worker is restarted. Other workers will also restart, and the interrupted worker rejoins the cluster. Then there's needs to be a way so that every worker picks up its former state, thereby allowing the cluster to get back in sync to allow for training to proceed smoothly. For example, Keras provides this functionality in the BackupAndRestore callback.