This session is about data and the normal distribution. We're going to get introduced to some concepts of the normal distribution and see how we can apply it in different phases of the Six Sigma project. But before we get there, let see what the measure phase of the Six Sigma project is all about. What happens in the measure phase? The first thing is that you identify variables. You identify the critical to quality characteristics, and you think about how you're going to measure these. Then you assess the measurement systems. The idea there is to make sure that your measurement systems are valid and they're reliable. They're valid in the sense that they're measuring what they're supposed to be measuring, they're reliable in the sense that when you use them over and over again they give you accurate results. They are sensitive to changes, that's what a measurement should be, and accessible in terms of they can be understood by people who are going to be seeing those measurements on a day-to-day basis so that they know what's going on in the process. We'll go from critical to quality characteristics to measurement systems in the measure phase. In the measure phase, we also go to establish the current performance on critical to quality characteristics. Once we've gone from figuring out what those critical to quality characteristics are and then the measurements are, then we need to establish current performance. Now, to establish current performance we use something called statistical process control. These are control charts that you can have for different types of data, for discrete data, for continuous data. There are many different types of control charts that you can use to establish the inherent capability of a process. Next, within the measure phase of the Six Sigma project, you also can establish the targets for improvement and what those targets should be. There you would be looking at things like the Sigma levels of the process, so you establish the sigma level of the process, but before that, you do a process capability analysis. A process capability analysis is to see how well the process is performing in relation to customer expectations. In relation to the voice of the customer, comparing the voice of the customer with the voice of the process, the VOC with the VOP in that sense. Those are the things that happen in the measure phase. Now, let's take a look at different types of data that can be used in the measure phase and then we'll get to distributions of data next. What are the different types of data that we can use in the Six Sigma project and that we need to start thinking about in the measure phase? First is simply verbal data. This could be open-ended comments from people, if you're doing a customer survey, they're telling you something about the product or the service. If you're doing an employee survey, they're telling you something about the experience that they have with their supervisor or working in that company. Here the example that you see is a statement that says, my supervisor respects my opinions. These are open-ended comments that you would have coming out of any kind of an interview or a survey that you do of the audience that you're interested in getting data from. Next, we get into data in the sense of numeric data. First we have discrete variables, and the way you can think about discrete variables are where decimal points do not matter, do not make sense in fact, not that they don't matter, they don't make sense. When we think about things like anything that has two values, say it's available or not available, we think of it as a 0, 1 situation. Something is on time or not on time, it's a 0, 1 situation. There's no 0.5, there's no 0.75. That's the first type of a discrete variable, and the data that we're talking about there is attribute data of a binary characteristic. It is binary in the sense that there are only two possible values for it. If you think about what is the underlying distribution for that kind of data, you may be familiar with this already that it's a binary data binomial distribution; two options, yes and no, or is good or not good, those kinds of data we are talking about there. Next within the categorical data, within attribute data, we have the nominal one. Here we don't really have numbers for different types of categories, but we are considering them as four different categories. For example, here we have, how do employees commute to work? They either walk, they come by bike, or they take the train, or they drive their own car. Those are four different types of ways of commuting to work for the employees. Now you can give these numbers as 1, 2, 3, and 4, you can code them as 1, 2, 3, and 4, but they don't really have any natural ordering. We can't say that one is higher than the other. So you can code these in some way, but they're not going to mean anything in terms of their natural ordering. The next category that we go to of types of data is ordinal data. Ordinal data is going to have meaning in terms of something is higher than the other. When you think of any customer satisfaction survey that you may be familiar with, those are the things that we get in the mail, or when you go to a restaurant they put it on the table saying, could you fill this out for us, and you may also be getting these as employee satisfaction surveys. Now these surveys have scales that go from extremely dissatisfied to extremely satisfied, or extremely happy with this to extremely unhappy, whichever way its ordered. The point there is that there's going to be some meaning of that ordering, that one is either going to mean very good and five is going to mean very bad, or five is going to mean very bad and one is going to mean very good. There's going to be some natural ordering to these categories. But remember, we're still talking about discrete categories. If you think about these three types of data, the binary data, the nominal without natural ordering, and the ordinal with natural ordering, the concept here is that you are taking data that is subjective and you're converting it to objective. You're taking information and you're converting it into objective data using either a binary scale or a nominal scale, or an ordinal scale, so you can express these in terms of numbers. Within discrete variables, we also have something called count data. What is count data? It's, as the name suggests. It's counting,for example, the number of defects in a product. If I'm looking at this clicker that I'm holding and I'm saying, how many defects are there in this clicker, I can count the number of defects. If I'm looking at defects in an application form that I get, I'm counting the number of defects and again, it's going to be discrete. I cannot find 2.5 defects. It's going to be either two defects or three defects. That's why it's still a discrete distribution but I'm looking at here different type of data within a discrete distribution and it is count data. Now, what are the implications of these different types of data? The underlying statistical frequencies, the underlying frequencies of data will be different. The underlying statistical distributions that you can use for these types of data are going to be different. That is going to have implications in terms of how you're going to do the analysis. The other implications of these types of data are, some will give you more information than others. Some will be, in that sense more valuable in terms of data collected than others. Some will also be harder to collect than others. There might be some trade-offs that you're thinking about as to which type of data we should collect. Well, you might be trading off with, here this one is simple to collect. We're simply asking a yes, no question if you're talking about the binary type of data, but we're not getting much more information than simply somebody was is happy or unhappy about something. We can move to some more in-depth information if we can move to more of ordinal scale which has a survey, a battery of questions, many questions that are scaled on 1-5 or 1-7. Typically we have odd numbers in those scales. There you are capturing a little more information. It's going to take more effort. It's going to cost you more, but you're going to get more information. You can do something with that information. When you're thinking about types of data, you should be thinking about what are the cost benefits of different types of data. Now, let's take a look at the other kind of data. When we're talking about discrete, we're talking the opposite end is the or the opposite of discrete is continuous data. Continuous data is any measurement data. There, we're basically saying that it can theoretically take infinite number of values. We can say, for example, that if you're talking about temperature, depending on the level of granularity that you want to go into, you can go up to many decimal places when you're talking about it in terms of Fahrenheit or Celsius. When you're talking about weight of something, depending on the level of granularity that you want to go into, you can be talking about 2.5 pounds, 2.68 pounds, 2.697 pounds. Then you can be thinking about it in terms of ounces if you want to get it to be more specific. That's the idea of continuous data, of measurement data. That's the data that we normally think about when you're thinking about numerical data. It's, very useful in terms of, it's a very specific measurement of something. But nevertheless it's a measurement of one characteristic. If I know that a critical to quality characteristic of a service in a restaurant is time, I can be measuring time, but it's only going to give me information about time. If I know that a critical to quality characteristic in a restaurant is temperature of food, then I can be thinking about measuring temperature of food. But then it is going to be very specific, but it's going to be only about the temperature of the food. Measurement data gives you much more information, but it's about a specific aspect of a product. Now, within measurement data, you can collect data that is cross-sectional or that is more of a time series. Simply here what we mean is that we could be looking at things as they are at a point in time or we can be looking at them over time. Is there a trend when we look at time series data? Then when we look at time series data, there are some implications in terms of, what analysis we can do? There will be specific things that we have to account for in terms of when we're doing time series data. When we're taking it from the same process over a period of time and we're trying to measure something, or if we're looking at sales over time, over different months or over different weeks, there will be some ways in fact, of adjusting to the colinearity, the obvious relationship that is going to be there when you have many weeks of sales data or many weeks of any process data. There's going to be some relationship between the previous week and the next week. You need to account for that. That's why you need to think about time series data as a little bit differently than when you're looking at cross-sectional data. Now, let's take this categorization and apply it to some different types of data that we have over here. Here you have different measurements, different things that are being measured. What I'd like you to do is apply the categorization that we just saw in terms of, is it discrete, is it continuous and, is it within discrete the different things that we saw, the ordinal, the nominal, the binary, and the count data, and whether you can apply those? You have paint viscosity, service at drive-through, and then you have on-time arrival or not, number of customer calls abandoned, humidity in a paint shop, and source country for outsource parts. Apply those categorizations and we'll come back and see if you were able to apply them correctly. We're back to the data types that we saw before the question and paint viscosity is something that would be a continuous measurements. It would be measurement data. It's something that you might measure in units that can have decimal points. It's a continuous measurement data. Service at a drive-through going from very unsatisfactory to do very satisfactory, it's categorical data, but it's ordinal. There is meaning to one being better than five. There's an implied hierarchy in those numbers. On-time arrival or not, something was on-time or not is obviously binary. There are only two options there. Number of customer calls abandoned should give you a hint just from the point just from the fact that it's a number of calls, its count data. You're counting the number of calls that were abandoned. Humidity in a paint shop. Again, it's going to be like viscosity that you saw earlier. It's going to be measurement data. Source country for outsourced parts is going to be categorical except it's going to be nominal. You're going to put these in different countries and you're going to say that if it's a one, it indicates that it's from the US, if it's two, it indicates it's from Canada, if it's three, it indicates it's from Mexico, if it's four, it indicates that it's from China. There's going to be no implied hierarchy in terms of the numbers that you're using. In fact, you could use any numbers for any of those countries. That's what we mean by, it being categorical but nominal data. Here you've seen the application of the different data types.