Hello. We have recently processed telecommunications data set and joint information about the received messages with Milano Grid, but there are some more factors. For example, the number of calls sent and received, and the amount of traffic consumed by a mobile device, which is usually not a problem till your girlfriend finds a site with kittens. As you have already seen, you can pass the environment variables to your stream and scripts. Based on these variables, you can choose which column to process. However, tabular data is a common for distributed file systems. That is why Hadoop developers have provided a special class that you can use for streaming or produce applications. This class is called field selection MapReduce. Field selection MapReduce has similar functionality to the CLI utility called CUT. You can choose which columns from a record should be considered as the key, and which columns from the record should be considered as the value. For instance, in this example, I'd choose the first column, which corresponds to a square ID as a key. The fourth column with an index three because we enumerate from zero, and all other columns starting with index five are considered a value. The column sign is used to differentiate the key and value specifications. In this light, you can see the output of these microstrip. For records where you don't have enough columns, you only see partial data. The good thing about it is that you don't have to cover the edge cases by yourself. For all the other records, you see the columns starting from the first column except the column number five, which is aligned with the value specification. Let's combine several MapReduce applications into one. Real world applications usually consist of several steps. In community, it is called Job chaining. In our example, I would like to change the field selection application with the map side join. By the way, having small pieces of functionality is a good practice, compared to one good application that can do everything for you. It is much more maintainable as it would test these components independently. If you are writing MapReduce applications on Java or use Python packages such as Danboard, PYDOOP, HADOOPY or MrJob then these frameworks will provide that job change in functionality for your convenience. To clarify, I show you how you can do it by yourself. Having several MapReduce jobs, you should wait till the first job finishes before executing the consecutive ones. In this script, you can validate the return code of your application in the following way. In case everything is correct, the return code is zero by convention. Otherwise, it is different. Of course, it is an oversimplification. Using Java and call job.waitForCompletion, the MapReduce framework calls the status of this job by the application ID every five or so seconds. If you'd like to mimic this behavior in bar script, then first you should find out the application ID. You can do it with yarn application - least common. As soon as you get the application ID, you can get the status of it with the Yarn application -status. This slide reflects the status of the running job and here, you see the status of the same job when it completes. If you do not know the application ID, but you know the MapReduce application output folder, then you can check if a special file exists in this folder. This file is called underscore success. This empty file is used to mark at the job as successfully finished. This file is generated only after all the data from the MapReduce application is stored in HDFS. You can validate the HDFS file existence with hdfs dfs -is -e back to file, where e- means exists. If you would like to prevent running in several instances, of the same application simultaneously, then you can build a synchronization mechanism via process ID. PID is an acronym for Process ID, and this shortcut is widely used in Unix like operating systems. When you spawn a process, you start the process ID in a special file, so that every other application can take a look inside and see if this process is still alive, the so-called, sure it is, cat. So you should put the following code at the top of your script to validate that you don't have any concurrent execution of the same script. In companies with big clusters, you can have several client nodes. They are also called edge nodes, and are used to execute MapReduce applications. It means that storing the ID locally on one machine, doesn't prevent executions from other machines. If you can't execute your script from certain machines, then you should synchronize over unknown local storage. For example, you can store PID file in HDFS, with the hdfs dfs - put command. If this file already exists, you will get an error when you try to override it. To overcome the problem of still log files, you need to find a way to identify application ID. Occasionally, it is not easy. For example, see the following Stack Overflow discussion. Other way around, you can wrap a job with a distributed lock in service such as Oozie, Luigi Airflow, Askaban, Voldemort, wrong text. Who put here Harry Potter? Nice one there, those are fantasy fans. There are so many of the so-called workflow engines so you can easily find something that suits your requirements. Anyway, you get the idea how it works. Let us finally count who is more talkative, northern or southern people? What is your guess? It is not difficult to write your own Python script to sum value , but there is another one Java package available for you in streaming scripts. This package is called aggregate. In the mapper output, you need to prefix each key by the type of values such as long, double, or string, and also prefix keys by action. For example, so mean max or unique. Correct and complete examples are double the value sum and stream value mean. Let me add this MapReduce job as the third job to our script, and reveal the secret of talkative people. During the examined period of time, Northern people were more talkative than people in this south. The climate is colder there. Those that have to talk more, to warm themselves. In this video you can learn that northern people are more talkative than Southern ones. To be serious, first you know how to use the field selection MapReduce class to process tabular data. Second, you have also known what job chaining is, and how to implement it in streaming scripts. More general, you know how to break down a solution into multiple MapReduce tabs. Third, you can identify whether the MapReduce job is finished successfully by looking at HDFS output folder structure. And finally, you know how to use the aggregate package in streaming scripts, to find the minimum or maximum value, number of unique values and so on.