The Power of Basic Statistics: Key Principles for Data Analysis

Priyanka Dave
5 min readMay 2, 2023

Table of contents:

  • Types of Variables
  • Measures of Central Tendency
  • 5 Number Summary
  • Measures of Dispersion

Types of Variables:

  1. Qualitative variables:
  • Has no numerical value and can’t be counted.
  • Also called as categorical variable.
  • There are mainly 2 types of qualitative variables:
  1. Nominal: Example — city name, nationality of a person, yes/no, true/false, bank account type, seasons, etc,.
  2. Ordinal: Where order matters. Example — feedback (poor, good, very good, excellent)

2. Quantitative variables:

  • Has numerical value associated with it.
  • Also called as continuous variables.
  • Example — Weight, stock price, salary, bank account balance, etc,.

In Data Science,

Variables which cannot be used for arithmetic operations, considered as categorical variables. Example, we cannot do sum of 2 cities.

Variables which can be used for arithmetic operations, considered as continuous variables. Example, we can do arthmetic operations on salary, account balance, height, weight, etc,.

Measures of Central Tendency:

  1. Mean:

Let’s say we have above dataset.

Here, mean = 126/10 = 12.6

2. Median:

  • To find median, first we need to sort dataset into ascending order then apply above formula.
  • Here, median = 3+3/2 = 3

3. Mode:

  • The value which occurs most often.
  • Here, mode = 2 (as it occurs 3 times in our dataset)

Now, from above value of mean, median and mode we can clearly observe that mean is influenced by extreme value (100 is the extreme value in our dataset). But, same is not true for median and mode.

So, when extreme value exists in the dataset at that time mean is not the right measure to get insights about central point.

5 Number Summary:

Let’s say we have below data:

5 number summary tells us great deal about the data distribution.

First sort data in ascendending order.

Minimum, Median & Maximum
Q1 & Q2

From above images 5 number summary of given data is:

  1. Minimum = 1

2. First Quartile (Q1) = 2

3. Second Quartile (Median) = 3

4. Third Quartile (Q3) = 4

5. Maximum = 100

Measures Of Dispersion:

It tells about how your data is spread (also known as variation, fluctuation)

  1. Range:

Range = Maximum — Minimum

If extreme value exists in our data then Range is not the right measure to use.

2. Inter Quartile Range (IQR):

IQR = Q3 — Q1

IQR removes top 25% and bottom 25% of data. So even if extreme values are there it will not consider those values to calculate the range.

IQR is less affected by outliers.

3. Standard Deviation:

  • Standard Deviation tells us how far the observations are from the mean.
  • Low standard deviation means data are clustered around the mean, and high standard deviation indicates data are more spread out.
  • Let’s see below example where we have collected data of runs scored by Cricketer A in 10 matches.

Here, to calculate variance we have divided the sum by 9 and not by 10. Why?

If Cricketer A has played only above 10 matches in their whole career then this data would be the population and we should use 10 instead of 9. But considering sample data we have used 9 and not 10.

So here standard deviation 30.60 tells that all the data points are on an average 30 runs away from the mean. One more thing we can say here is the player is not consistent.

3. Coefficient Of Variation (Relative Dispersion):

  • Coefficient Of Variation (CV) is defined as ratio of standard deviation to the mean.
  • Let’s say we have last 10 matches data of 2 players and we need to decide whom we should consider for the world cup.
  • Cricketer A scores average 58.7 runs with S.D of 12.57
  • Cricketer B scores average 64.6 runs with S.D of 78.06

Who is better?

  • CV (A) = 12.57/58.7 = 0.21 = 21%
  • CV (B) = 78.06/64.6 = 1.20 = 120%

There is 21% variation (Risk) is associated with player A, while 120% variation (Risk) is associated with player B.

Though average is high of player B, we will consider player A for the world cup because of lower risk.

Thank You.

--

--