Remember that we were talking about three possible places to do feature engineering. We said you could do feature engineering within TensorFlow itself, using feature columns, are they wrapping the feature dictionary, and adding arbitrary TensorFlow code. This is great because it's efficient. TensorFlow code and a GPU, RTPU. But why do I say arbitrary TensorFlow code? Because this needs to be code that's executed as part of your model function, as part of your TensorFlow graph. So, you cannot do a query on your corporate database and stick a value in there. Well, you could write a custom TensorFlow up in C++ and call it. But let's ignore that for now. Also you can only do things that rely on this input value and this input value alone. So, if you want to compute a rolling average, well that's hard to do. Later, we look at sequence models where it appears that we are processing a time series. So, multiple input values but the input there is the entire sequence. So, the limit we're doing TensorFlow processing is that we can do preprocessing on a single input only. TensorFlow models, sequence model is an exception, but TensorFlow models tend to be stateless. In the past two chapters, we also look at how to do preprocessing or feature creation in Apache Beam on Cloud Dataflow. Dataflow lets us execute arbitrary Python or Java code and allows us to handle multiple input values in a stateful way. For example, you can compute a time window average. Like the average number of bicycles at a traffic intersection over the past hour. However, you will have to run your prediction code also within a pipeline so that you can get the average number of bicycles at a traffic intersection over the past hour. So, this is good for examples like time window averages where you need a pipeline in any case. But what if all that you want is a min or max so that you can scale the values or get the vocabulary to convert categorical values into numbers. Running a Dataflow pipeline in prediction, just to get mini and max, seems a bit like overkill. Enter tf.transform. This is a hybrid of the first two approaches. With TensorFlow transform, you're limited to TensorFlow methods. But then you also get the efficiency of TensorFlow. You can also use the aggregate of your entire training dataset because tf.transform uses Dataflow during training but only TensorFlow during prediction. Let's look at how TensorFlow transform works. TensorFlow transform is a hybrid of Apache Beam and TensorFlow. It's in between the two. Dataflow preprocessing only works in the context of a pipeline. Think in terms of incoming streaming data such as IoT data, Internet of Things data, or flights data. The Dataflow pipeline might involve the predictions, and it might invoke those predictions and save those predictions to big table. These predictions are then served to anyone who visits the webpage in the next 60 seconds. At which point a new prediction is available in big table. In other words, when you hear Dataflow think back and preprocessing for machine learning models. You can use Dataflow for preprocessing that needs to maintain state such as time Windows. For on-the-fly preprocessing for machine learning models, think TensorFlow. You use TensorFlow for preprocessing that is based on the provided input only. So, if you put all the stuff in the dotted box into the TensorFlow graph, then it's quite easy for clients to just invoke a web application and get all the processing handle for them. But what about the in-between things. For example, you want to scale your inputs based on the min or max value in the dataset. If you want to do this, you need to analyze your data in Dataflow so you can do the entire dataset, find the min and max, and then do the transformation and Dataflow so that you can scale each individual input value. So, that's what tf.transform is about. It's a hybrid of Apache Beam and TensorFlow. To understand how this works, consider that in general preprocessing has two stages. Consider for example that you want to scale your input raw data so that Gradient Descent works better. In order to do that, you will first have to find the minimum and the maximum of the numeric feature over the entire training dataset. And then you will scale every input value by the min and max that were computed on the training dataset. Or consider that you want to find the vocabulary of keys for a categorical variable. Let's say you have a categorical feature that is a manufacturer of a vehicle. You will go through the entire training dataset to find all the possible values of a particular feature. Essentially, get the list of all the manufacturers. Then, if you find 20 different manufacturers in your training dataset, you will one-hot encode the manufacturer column into a vector of length 20. Do you see what's going on? The first step involves traversing the entire dataset once. We call this the analysis phase. The second step involves on-the-fly transformation of the input data. We call this the transform face. Which technology, Beam or TensorFlow is better suited to doing analysis of the training dataset? Which technology, Beam, or TensorFlow is better suited to doing on-the-flight transformation of the input data? Analysis and Beam transform in TensorFlow. There are two PTransforms in tf.transform. AnalyzeAndTransformDataset, which is executed in Beam to create a preprocessed training dataset, and TransformDataset which is executed in Beam to create the evaluation dataset. Remember that computing the min and max, et cetera the analysis is done only on the training dataset. We cannot use the evaluation dataset for that. So, the evaluation dataset is scaled using the min and max found in the training data. But what if the max in the evaluation is bigger? Well, this simulates a situation that you deploy your model and then you find that a bigger value comes in a prediction time. It's no different. You cannot use a valuation dataset to compute min and max of vocabulary et cetera. You have to deal with it. However, the transformation code that's invoked is executed in TensorFlow at prediction time. Another way to think about it is that there are two phases. The analysis phase. This is executed in Beam while creating the training dataset. The transform phase. This is executed in TensorFlow during prediction. So, executed in Beam to create your training and evaluation datasets.