Data Summarization

Ankit Gupta
5 min readDec 15, 2021

--

Photo by Sarah Pflug from Burst

Before we can proceed to perform all kind of cool stuff with data, we need to know and get the feel of it. We have to understand the story that data is trying to tell us. This feat is achieved by the process of data summarization. It is one of the core part in both descriptive and inferential statistics.

Measure of Central Tendency

Oftentimes we need to see the central value of a particular data to generate any assumptions about it. A measurement of central tendency is a statistical summary that helps us approximate the center point of the given dataset.
In order to measure the central location of dataset, we use the three distinct measurement types:
1. Mean
2.Median
3.Mode

Mean or Average : Mean is a number approximating the central value of data set by summing all the values in the set and then dividing it by their count.

For population mean, µ = sum of all data set in population/total number
For sample mean X = sum of all data set in sample/count of data set.

Mean is highly sensitive to outliers, and hence should be used cautiously and judiciously by looking the dataset keenly with the lens of business requirement.

“Mean is highly sensitive to outliers, and hence should be used cautiously and judiciously by looking the dataset keenly with the lens of business requirement.

Median : It is another way to approximate the central tendency and data distribution surrounding it. However, it does not involve summing all of the measurement values. A median is usually the middle value if the total count of observations are odd.
It is extremely useful in scenarios when we have outliers or skewed dataset with us.

Mode : It is simply the number which appears most often in a given set. If there is two mode present, then it is called bimodal and if there are further modes available then it is called multimodal.

Measure of Dispersion

Sample and Population Variance : Population variance tells us how the data points in a population is spread out. It is represented by σ2. It is the average of distance between each data point in the population to the mean, squared.

Population Variance

Often we are unable to calculate the variance of population variance because of unavailability of complete data or because of time and cost, we make an approximation of population with sample variance. Sample variance tells us how the data points in a sample is spread out. It is represented by S2. For sample variance we average it out with N-1.

Sample Variance

Population Standard Deviation : Standard deviation gives us an idea about the distribution of data around mean . It is represented by σ. Its is the squared root of variance.

Population Standard Deviation

Similar to sample variance, we also use sample standard deviation, represented by s. For sample standard deviation, again we average it out with N-1.

Sample Standard Deviation
Dispersion of data for a normal distribution measured with standard deviation.

Skewness : It is usually described as a measure of dataset’s symmetry or lack of symmetry. Skewness helps us to see which side of the mean we have more number of data. It’s quite useful in knowing if there’s any distortion or asymmetry that deviates from the normal distribution in dataset.

Negative and Positive Skewness

Kurtosis : It helps us in understanding tail extremity. It is all about how the data is distributed along the tail ends and helps us in identifying the outliers.

  • Usually for normal distribution Kurtosis is 3.
  • High Kurtosis indicates that data is heavily tailed and might have outliers.
  • Low Kurtosis indicates data to be light tailed and less outliers.
  • Kurtosis >3, means long tailed distribution and also peak is narrow and longer.
Kurtosis

Quartiles & IQR :

These are the values that divides data into quarters. The 4 quarters are:

  • lowest 25% of data
  • next lowest 25% of data
  • second highest 25% of data
  • Highest 25%

Upper quartile(Q3) is a number dividing third and fourth quartile. It separates lowest 75% data from the top 25%. And best way to visualize it is with box plot.

Correlation : It describes the strength and direction of relationship between variables. Correlation is measured using a numerical value known as correlation coefficient(r).

  • Closer the value of correlation coefficient to 1, stronger the relationship.
  • Similarly, a value of 0 means no relation at all.
  • A value towards -1 represents a negative relation among the variables.

“Correlation is not causation”, it means that just because two things are correlated does not mean that one causes the other.

--

--

Ankit Gupta
Ankit Gupta

No responses yet