Hey there, in this video we'll focus on common issues associated with dirty data. These include spelling and other text errors, inconsistent labels, formats and field length, missing data, and duplicates. This will help you recognize problems quicker and give you the information you need to fix them when you encounter something similar during your own analysis. This is incredibly important in data analytics. Okay, let's go back to our law office spreadsheet. As a quick refresher, we'll start by checking out the different types of dirty data that it shows. Sometimes, someone might key in a piece of data incorrectly. Other times they might not keep data formats consistent. It's also common to leave the field blank. That's also called a null which we learned about earlier. And if someone adds the same piece of data more than once, that creates a duplicate. So let's break that down. Then we'll learn about a few other types of dirty data and strategies for cleaning it. Misspelling, spelling variations, mixed up letters, inconsistent punctuation and typos in general, happen when someone types in a piece of data incorrectly. As a data analyst, you'll also deal with different currencies. For example one dataset could be in US dollars and another in euros and you don't want to get them mixed up. We want to find these type of errors and fix them like this. You'll learn more about this soon. Clean data depends largely on the data integrity rules that an organization follows, such as spelling and punctuation guidelines. For example, a beverage company might ask everyone working in the database to enter data about volume in fluid ounces instead of cups. It's great when an organization has rules like this in place, it really helps minimize the amount of data cleaning required. But it can eliminate it completely, like we discussed earlier, there's always the possibility of human error. The next type of dirty data our spreadsheet shows is inconsistent formatting; something that should be formatted as currency is shown as a percentage. Until this error is fixed like this, The law office will have no idea how much money this customer paid for its services. We'll learn about different ways to solve this and many other problems soon. We discussed nulls previously, but as a reminder, nulls are empty fields. This kind of dirty data requires little more work than just fixing a spelling error or changing a format. In this example, the data analyst will need to research which customer had a consultation on July 4th, 2020. Then when they find the correct information, they'd have to add it to the spreadsheet. Another common type of dirty data is a duplicate. Maybe two different people added this appointment on August 13th, not realizing that someone else had already done it. Or maybe the person entering the data hit copy and paste by accident. Whatever the reason, it's the data analyst job to identify this error and correct it by deleting one of the duplicates. Okay, now let's continue on to some other types of dirty data. The first has to do with labeling, to understand labelling imagine trying to get a computer to correctly identify panda bears among images of all different kinds of animals. You need to show the computer thousands of images of panda bears, they're all labeled as panda bears. Any incorrectly labeled picture, like the one here that's just "bear", will cause a problem. The next cause of dirty data is having an inconsistent field length, you learned earlier that a field is a single piece of information from a row or column of a spreadsheet. Field length is a tool for determining how many characters can be keyed into a field, assigning a certain length to this fields in your spreadsheet is a great way to avoid errors. For instance, if you have a column for someone's birth year, you know the field length is four, because all years are four digits long. Some spreadsheet applications have a simple way to specify field lengths and make sure users can only enter a certain number of characters into a field. This is part of data validation. Data validation is a tool for checking the accuracy and quality of data before adding or importing it. Data validation is a form of data cleaning, which you'll learn more about soon. But first you'll get familiar with more techniques for cleaning data. This is a very important part of the data on this job and I look forward to sharing these data cleaning strategies with you.