In this lesson, we want to address the problem related to the complexity of the data and the possible difficulties and risks that data analysis processes may entail. In particular, we will discover that data science is not necessarily a simple path to follow. We will address quality issues related to the context, the data itself, the analysis process, and the results. The first problem we face is the complexity of reality. As a famous data scientists, Shakespeare actually, said sometime ago, "There are more things in heaven and earth than we can imagine and tried to describe with our theories." This implies that any interpretation of reality requires the collection of raw data, which must then be understood in order to be transformed into meaningful information. From there, we need to further extract repetitive patterns to build valuable knowledge, which in turn will probably serve us to reach wisdom as a long-term goal of our life. To tackle this path, the human mind widely applies the modeling process through which a describes the reality in a simplified way. For example, if we want to represent the reality of a city and of the behavior of people living there, we can represent it with a model like this, and which we represent the city, the places within the city, the people present in the places, and the activities they perform in terms of posted photos, relationships with other people, shared content, and so on. Somehow, we can still borrow Shakespeare's vision. So we can see a duality and secularity between the digital world of data and the reality of facts we want to discover and analyze in the physical world. Indeed, we have a digital sphere from which we can gather information, for instance, web, social media, and other sources, to better understand the reality of the physical space below and its complexity. It is important to be aware of this complexity to understand that adequate tools are needed to deal with it. For example, already in 1979, Joel de Rosnay conceptualized the idea of a tool called macroscope, as a multi lens tool that allows you to capture and interpret complex realities for many perspectives. The second problem we face is the so-called cognitive bias, that is the observers bias, or the fact that any observer modifies and interprets reality in a subjective way, by introducing all durations of the same based on his own sensitivity. Behavioral science categorizes dozens of different possible bias in people. The important thing is to keep this in mind, as in our role as analysts and managers of data analysis initiatives, we will inevitably be subject to this problem. The third problem relates to the quantity of the data, in terms of accuracy, completeness, uniqueness, availability over time, and consistency of the data itself. This is a widespread problem, to the point that for example Gartner estimates that one-third of the world's largest companies may have problems with the quality of their data. The critical aspect is that when we perform data analysis, some results will still be obtained even from incorrect or bad data. The problem is that, the corresponding results will also be incorrect. This will create a vicious cycle, which starts from bad data produces wrong results, and ultimately will lead to incorrect decisions, and therefore, wrong business strategies. To overcome this problem, we need to pay a lot of attention in improving the quality of the incoming data using so-called data wrangling techniques. That is all the processes that transform the raw input data into good-quality, and complete data ready to be supplied in input to the next steps of the data analysis pipeline. The steps to be implemented in the data wrangling process include understanding, exploration, transformation, augmentation, and shaping or formatting of data. This process requires many iterations, and is rather tedious. Unfortunately, despite the prominent role assigned to data science today, it has even been selected by the Harvard Business Review as the sexiest job of the 21st century. Often, a large portion of the data scientist's time goes into data cleaning operations. As an example, I'd like to mention a data collection project implemented in Italian city to study interesting venues. When collecting such data from platforms like Foursquare, it turns out that over 50 percent of the recorded locations are dirty, irrelevant, obsolete, or fake. If we implemented any analysis on the data without a preliminary cleaning process, we would have obtained totally biased and irrelevant results. Another problem I would like to address is the so-called content or source bias. When we carry out an analysis, we must always keep in mind the question we want to answer. We will need to verify if the collected data can respond to that question or another. For example, collecting information about the volumes of check-ins and the different places of the city, we must ask ourselves what is the question we are addressing. Are we describing the dynamics of the city, or the location of its inhabitants, or are we simply describing a very special a population? That is, users who use a specific digital platform using some specific devices according to a well-defined pattern of behavior. Essentially, this is the problem of representativeness of the sample with respect to the general population. In many cases, using different data or different sources can bring completely different interpretations of reality. For example, this heat map shows the density of geo-located photos published on Flickr in the city of Milan. The analysis that collects more than 300,000 images show strong density and the most important are beautiful areas of the city. Conversely, if we use a platform like Instagram as a source, what we get is a completely different interpretation of the same city. The density is now evenly distributed over the entire area. Indeed, this is how people use Instagram in their everyday life. In general, content bias can come from aspects related to the source, the technology, or platform used for data collection. The customers or users demographics, or there different behavior. Another problem is that data can be organized according to different granularity, in time and space. When you want to integrate different data sources, it is necessary to make the granularity uniform across sources, through mechanisms like scaling, aggregation, or grid definitions over the data. For example, on a geographical scale, it may be necessary to construct grids, either with regular structure, fixed size cells, or with irregular shape based on business needs. In the example picture on the left, you can see data that combines presence of people in position of events. In this case, the two datasets are made uniform through a fixed-size grid on the map. In the example on the right, you can see a customized grid over a city, where the cells represent geographical areas based on the coverage of cellular telephone and tennis. A further problem is that general consistency of the data. A particularly significant problem when we wish to integrate data from different sources. In fact, this is one of the most interesting scenarios. As it is typically from a combination of data, to the most significant value can emerge. Unfortunately, the complexity of granting consistency across sources is often very high even in an apparently simple situations. For instance, let's pick the problem of understanding if two describe places are actually the same location or two distinct ones. This can actually become very challenging. Unfortunately, both in terms of textual description and geographical position described by GPS coordinates, we would hardly have a perfect match. It is therefore a question of finding strategies of partial similarity based on empirical rules that can find the optimal solution for making the decision. For example, one can consider a limited number of decimal places to the GPS coordinates. Or a partial similarity of the descriptive text. The last problem we face goes well beyond the technical aspects. It can in fact be considered a problem also at market and social level. It concerns the accessibility and availability of data. In fact, we can talk today about a so-called data divide. That is the phenomenon for which the ability to access data become strategic for competitiveness. Basically, entities that can have access to data or able to collecting enough data to understand reality in a better way, has a competitive advantage over those who fail to do so. Even companies that decide to make their data available to the public, usually share a certain amount of data, while they keep for themselves the part that is most relevant and valuable for the business. For example, even the data in Google Maps ascribing the venues is only partially available for automated use. While the basic elements describing a venue are all accessible to third party through API access, the most valuable part that is the number and dynamics of presence of people in the venues is not accessible via API. To conclude, I would say that the message recollect from this journey through the risks and challenges related to data science is that, if we want to extract the highest value out of the goldmine represented by data in the modern world, we must bear in mind that this will involve a considerable preparation effort in terms of data cleaning and selection. We will also need to explore the best ways to analyze the data, remembering that you will have to face various problems and probably try many different ways before finding the best solution.