Before we understand Statistical Analysis, its relation to Data Analysis, and specifically data mining, let’s first examine what Statistics is. Statistics is a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of numerical or quantitative data. It’s all around us in our day to day lives. Whether we’re talking about average income, average age, or highest-paid professions—it’s all statistics. Today, statistics is being applied across industries for decision-making based on data. For example, researchers using statistics to analyze data from the production of vaccines to ensure safety and efficacy, or companies using statistics to reduce customer churn by gaining greater insight into customer requirements. Now let’s look at what Statistical Analysis is. Statistical Analysis is the application of statistical methods to a sample of data in order to develop an understanding of what that data represents. It includes collecting and scrutinizing every data sample in a set of items from which samples can be drawn. A sample, in Statistics, is a representative selection drawn from a total population, where population is a discrete group of people or things that can be identified by at least one common characteristic for purposes of data collection and analysis. For example, in a certain use case, population may be all people in a state that have a driving license, and a sample of this population that is a part, or subset, of the population could be men drivers over the age of 50. Statistical methods are mainly useful to ensure that data is interpreted correctly, and apparent relationships are meaningful and not just happening by chance. Whenever we collect data from a sample, there are two different types of statistics we can run. Descriptive statistics to summarize information about the sample; and Inferential statistics to make inferences or generalizations about the broader population. Descriptive Statistics enables you to present data in a meaningful way allowing simpler interpretation of the data. Data is described using summary charts, tables, and graphs without any attempts to draw conclusions about the population from which the sample is taken. The objective is to make it easier to understand and visualize raw data without making conclusions regarding any hypotheses that were made. For example, we want to describe the English test scores in a specific class of 25 students. We record the test scores of all students, calculate the summary statistics, and produce a graph. Some of the common measures of Descriptive Statistical Analysis include Central Tendency, Dispersion, and Skewness: Central Tendency, or locating the center of a data sample. Some of the common measures of central tendency include mean, median, and mode. These measures tell you where most values in your dataset fall. So, in the earlier example, the mean score, or the mathematical average, of the class of 25 students would be the sum total of the scores of all 25 students, divided by 25, that is, the number of students. If you order the above dataset from the smallest score value to the highest score value of the 25 students and pick the middle value— that is the value with 12 values to the left and 12 values to the right of a score value, that score value would be the median for this dataset. If 12 students have scored less than 75%, and 12 students have scored greater than 75%, then the median is 75. Median is unique for each dataset and is not affected by outliers. Mode is the value that occurs most frequently in a set of observations. For example, if the most common score in this group of 25 students is 72%, then that is the mode for this dataset. So, you can see how looking at your dataset through these values can help you get a clearer understanding of your dataset. Dispersion is the measure of variability in a dataset. Common measures of statistical dispersion are Variance, Standard Deviation, and Range. Variance defines how far away the data points fall from the center, that is, the distribution of values. When a distribution has lower variability, the values in a dataset are more consistent. However, when the variability is higher, the data points are more dissimilar, and extreme values become more likely. Understanding variability can help you grasp the likelihood of an event happening. Standard deviation tells you how tightly your data is clustered around the mean. And Range gives you the distance between the smallest and largest values in your datasets. Skewness is the measure of whether the distribution of values is symmetrical around a central value or skewed left or right. Skewed data can affect which types of analyses are valid to perform. These are some of the basic and most commonly used descriptive statistics tools, but there are other tools as well, for example, using correlation and scatterplots to assess the relationships of paired data. The second type of statistical analysis is Inferential Statistics. Inferential statistics takes data from a sample to make inferences about the larger population from which the sample was drawn. Using methods of inferential statistics you can draw generalizations that apply the results of the sample to the population as a whole. Some common methodologies of Inferential Statistics include Hypothesis Testing, Confidence Intervals, and Regression Analysis: Hypothesis Testing—For example, for studying the effectiveness of a vaccine by comparing outcomes in a control group, hypothesis tests can tell you whether the efficacy of a vaccine observed in a control group is likely to exist in the population as well. Confidence Intervals incorporate the uncertainty and sample error to create a range of values the actual population value is like to fall within. Regression Analysis incorporates hypothesis tests that help determine whether the relationships observed in the sample data actually exist in the population rather than just the sample. There are various software packages to perform statistical data analysis, such as Statistical Analysis System (or SAS), Statistical Package for the Social Sciences (or SPSS), and Stat Soft. Statistics form the core of data mining by: Providing measures and methodologies necessary for data mining; and Identifying patterns that help identify differences between random noise and significant findings. Both data mining, which we will learn more about in this course, and Statistics, as techniques of data analysis, help in better decision-making.