What do we really mean by data? Put simply, data are pieces of information about individuals organized into variables. By individual, we mean a unit of observation. An observation or unit of observation refers to a particular person or a particular object, any particular unit of observation within your study sample. By a variable we need a particular characteristic of the unit of observation. At the person level we might collect data on Height, Weight, Gender, Race etc. If we're collecting data on a sample of cars, we might measure variables such as Color, Tire Size, Mileage, Model and Number of seats etc. If our sample includes cities we might measure variables such as Population size, Tax Revenue, Energy Consumption, Number of Hospitals and so on. A data set is literally a set of data that's made up of individual observations and variables. Data sets are typically displayed in tables in which rows represent individuals, or units of observation, and columns represent variables. Here is a data set that shows medical records from a survey. In this example, the units of observation are patients, and the variables are Gender, Age, Height, Weight, Smoking, and Race. Each row then gives us all the information about a particular observation. In this case a patient. And each column gives us information about a particular characteristic of all the patients. >> Variables can also be classified into one of two types, Quantitative or Categorical. Quantitive variables take numerical values and represent some kind of measurement. Categorical variables, on the other hand, take category or label values and place an observation or individual into one of several groups. In this example, there are several variables of each type. Age, Weight and Height are Quantitative variables. Race, Gender and Smoking are Categorical variables. >> Notice that the values of the categorical variable Smoking have been coded as zero or one. It's quite common to code the values of a categorical variable as numbers. But you should always remember that these are only codes. Often referred to as Dummy Codes because they have no arithmetic meaning. That is, it doesn't make sense to add them, subtract them, multiply or divide them. Or even compare the magnitude of these values. Finally a unique identifier is a variable that is meant to distinctively define each of the units of observation of your data set. Examples might include serial numbers for data on a particular product, social security numbers for data on an individual person. Or maybe random numbers generated for any type of observation. To help us organize our data, every data set should have a variable that uniquely identifies the observations. This variable is particularly useful if you ever need to merge information across different data sets. In our example, the patient number one through 75 is the unique identifier. >> Relying on data sets, statistics pulls all of the behavioral, physical and social sciences together. It's arguably the one language that we all have in common. While you may think that data is very, very different from discipline to discipline, it really isn't. What you measure is different. And your research question is obviously dramatically different. Whom you observe and whom you collect data from or what you collect data from can be very different. But once you have the data, approaches to analyzing it statistically are often quite similar across many different disciplines. >> Since we will not actually be producing data in this course, your first step will be to choose a data set that offers the opportunity to conduct research on a topic of interest to you.