The first key step toward understanding a set of data is to explore the data and describe it in summary form. The main three aspects of data description are frequency distributions, measures of center tendency, and measures of variability. These tell us about the shape, center, and spread of the data.
In this article you will learn these essential descriptive statistics and how to use them to help you in data analysis to obtain insights from the data and help draw conclusions to some extent.
What are Descriptive Statistics?
Descriptive statistics are simply describing what is or what the data shows. They summarize the characteristics of a data set and enable us to present the data in a more meaningful way.
Typically, there are three general types of descriptive statistic that are used to describe data:
The Frequency Distribution.
Measures of Central Tendency; such as the mean and median.
Measures of Variability; such as the variance, standard deviation, and standard error.
Frequency Distribution
Frequency Distribution is simply the number of times each value (or range of values) occurs in a data set. Usually, frequency distribution values are presented in table, or graphic such as histogram or bar chart .
Suppose we have a data include 182 roses with different colors, We can present thee number (frequency) of each color as shown in Figure 1.
Figure 1: Color of one hundred eighty-two rose. Source: Statistics for the life sciences. 4th ed. Samuels, Myra L., Jeffrey A. Witmer, and Andrew Schaffner
In real research work, most of the times the data include so many different scores that the data points (columns) are too close together for straight lines to connect them. And usually, the data is negatively skewed (the distribution tail extends to the left) or positively skewed (the distribution tail extends to the right) as shown in Figure 2.
Figure 2: A negatively skewed distribution contains extreme low scores, to the left, that have a low frequency. On the other hand, a positively skewed distribution contains extreme high scores, to the right, that have low frequency. Source: Basic Statistics for the Behavioral Sciences, 6th ed. Gary W. Heiman.
Mean and Median
The mean is the sum of all the data values (observations) divided by the number of the data values itself.
The median is the value that falls exactly at the middle of the data set values when the data is arranged in numerical order. The median value falls at which half the observations are above it and half the observations are below it (Figure 3).
Figure 3: Detecting the median
Both the mean and the median are used to measure central tendency. We use them to describe a data set with a single value that represents the center of the data. In symmetrical data distribution (not skewed), the mean and median are the same (Figure 4).
Figure 4: The mean and median position at symmetrical data distribution. Source: Basic Statistics for the Behavioral Sciences, 6th ed. Gary W. Heiman.
When the data contains outliers, you can compare the mean and the median to decide which the better measure to use is. Because unusual values (outliers), affect the median less than they affect the mean.
Generally speaking, if your data distribution is asymmetric (skewed), as shown in Figure 5, or contains outliers, the median is more representative measure of central tendency than the mean. For instance, the median is often used as a measure of central tendency for income or salaries data, which are generally highly skewed.
Figure 5: Mode, median, and mean positions at asymmetric (skewed) distribution.
Standard Deviation and Standard Error
Standard Deviation (SD) and Standard Error (SE) are the most commonly used measure of dispersion (variability) in most researches. The measure of dispersion or variability describes how the data values are spread (far/close) from each other and from their mean (Figure 6).
Small variability indicates small differences among the data values. Also, small variability indicates that the values are consistently close to each other; distances between values are occurring in similar behaviors.
In summary, measure of variability indicates how the values are spread out and how the distribution is; the smaller the variability, the closer the values are to each other and to the mean.
Figure 6: Illustration for high and low standard deviation.
The Standard Deviation (SD and Standard Error (SE) are common measures of variability and both measures are closely related to the Variance (σ2).
The standard deviation is the square root of the variance.
The standard error is calculated by taking the standard deviation and dividing it by the square root of the sample size; as shown in the formula:
Some researchers occasionally confuse the SD and the SE. However, each of them has its own meaning:
The SD represents the dispersion of individual data values and indicates how accurately the mean of the sample data represents the sample data itself.
The SE describes how precisely the mean of the sample data is close to the actual mean of the population.
Keep in mind that, in most research work, we calculate the sample mean because we are not only interested in the mean of this particular sample, but also in the mean of the population from which the sample derived. Therefore, SE is an important measure in research works.
Example: Suppose we have of a data set contains a 16 patients’ height (8 males and 8 females) (Table 1). We need to know if our sample patients are in the range of the expected healthy average, since the expected average height of a healthy population should be about 163 cm for women and 176.5 cm for men.
Table 1: Adults’ males and females heights (our sample)
Males' height (cm) | Females height (cm) |
172 | 166 |
168 | 162 |
174 | 154 |
175 | 160 |
165 | 148 |
178 | 155 |
163 | 170 |
172 | 163 |
Now, we need to explore the data using basic descriptive statistics to understand our data and draw a conclusion based on this descriptive summary. Many software and programming languages can calculate descriptive statistics for you easily. Here, we used Microsoft Excel to do so (Table 2).
Table 2: Descriptive statistics for patients’ height.
Descriptive statistics results presented in Table 2 showed that mean and median do not markedly differ for both males and females data, and the skewness of both males and females’ height data are between -0.5 and 0.5 which means that both distributions are approximately symmetric (not skewed). Therefore, these mean and median values can be used to measure central tendency of the data sets.
Since standard deviation (SD) of males’ heights is lower than females’ heights, the males showed closer data to each other (lower variability) than females. In addition, the standard error (SE) of males' data is also lower and consequently males' data is slightly more representative to their population than females.
We can also concluded that our sample data mean of males (171) and females (160) are slightly lower than the expected average height of a healthy population which is163 cm for women and 176.5 cm for men.
References
Heiman, G. W. (2011). Basic Statistics for the Behavioral Sciences (6th ed.). USA: Cengage Learning.
Mendenhall, W. M., & Sincich, T. L. (2016). Statistics for Engineering and the Sciences Student Solutions Manual (6th ed.). USA: Taylor & Francis Group, LLC.
Samuels, M. L., Witmer, J. A., & Schaffner, A. (2012). Statistics for the Life Sciences (4th ed.): Pearson Education, Inc.
Weiss, N. A., & Weiss, C. A. (2012). Introductory Statistics (9th ed.): Pearson Education, Inc.
Bình luận