Skewness Understaning by CHIRAG

 

What is skewness?


In simple words, skewness is the measure of how much the probability distribution of a random variable deviates from its Mean the normal distribution.

Now, What is normal distribution here?

Well, the normal distribution is the probability distribution without any skewness. You can look at the image below, which shows symmetrical distribution that’s a normal distribution,



So far, we’ve understood the skewness of normal distribution using a probability or frequency distribution.

 Now, let’s understand it in terms of a boxplot because that’s the most common way of looking at a distribution in the data science space.

The above image is a boxplot of symmetric distribution. You’ll notice here that the distance between Q1 and Q2 and Q2 and Q3 is equal,(Q3-Q2=Q2-Q1)

let’s jump to the formula for skewness now:

1) SKEWNESS= MEAN-MODE/STANDARD DEVIATION

2)MODE= 3MEAN-2MEDIAN

3)SKEWNESS= 3(MEAN-MEDIAN)/STANDARD DEVIATION


Common characteristics of all the Normal distributions

  1. They all are Symmetric 
  2. Mean=Median=Mode
  3. Empirical Rule The empirical rule, also sometimes called the three-sigma or 68-95-99.7 rule, is a statistical rule which states that for normally distributed data, almost all observed data will fall within three standard deviations


  4. the standard normal distribution is a special case of the normal distribution where the mean = 0 and the SD = 1.
  5. This distribution is also known as the Z-distribution.
 

Types of Skewness

Positive Skewed or Right-Skewed  (Positive Skewness)

  1. Right skewed distributions occur when the long tail is on the right side of the distribution.
  2. This condition occurs because probabilities taper off more slowly for higher values.
  3. In positively skewed, the mean of the data is greater than the median .
  4. MEAN > MEDIAN > MODE

Right Skewed Box Plot

If a box plot is skewed to the right, the box shifts to the left and the right whisker gets longer. As a result, the mean is greater than the median


In the below boxplot, you can see that Q2 is present nearer to Q1. This represents a positively skewed distribution. 

In terms of quartiles, it can be given by: (Q3-Q2) > (Q2-Q1)

Right Skewed Histogram

histogram is right skewed if the peak of the histogram veers to the left. Therefore, the histogram’s tail has a positive skew to the right.


Negative Skewed or Left-Skewed  (Negative Skewness)

  1. Left skewed distributions occur when the long tail is on the left side of the distribution.
  2. This condition occurs because probabilities taper off more slowly for lesser values.
  3. In negatively skewed, the mean of the data is less than the median .
  4. MODE > MEDIAN > MEAN

Left Skewed Boxplot

If the bulk of observations are on the high end of the scale, a boxplot is left skewed. Consequently, the left whisker is longer than the right whisker.

In the below boxplot, you can see that Q2 is present nearer to Q3. This represents a Negatively skewed distribution. 


In terms of quartiles, it can be given by: (Q2-Q1) > (Q3-Q2)

Left Skewed Histogram

Left skewed histograms are Histograms with long tails on the left.

Rule of thumb :
  1. If the skewness is between -0.5 & 0.5, the data are nearly symmetrical.
  2. If the skewness is between -1 & -0.5 (negative skewed) or between 0.5 & 1(positive skewed), the data are slightly skewed.
  3. If the skewness is lower than -1 (negative skewed) or greater than 1 (positive skewed), the data are extremely skewed.

How Do We Transform Skewed Data?

Since you know how much the skewed data can affect our machine learning model’s predicting capabilities, it is better to transform the skewed data into normally distributed data. Here are some of the ways you can transform your skewed data:

  • Power Transformation
  • Log Transformation
  • Exponential Transformation

Note: The selection of transformation depends on the statistical characteristics of the data.

How to calculate Skewness in Python?


First, let’s create a list of numbers 
x = [55, 78, 65, 98, 97, 60, 67, 65, 83, 65]

To calculate the Fisher-Pearson correlation of skewness, we will need the scipy.stats.skew function:

And we should get: 0.647511295006068

To calculate the adjusted skewness in Python, pass bias=False as an argument to the skew() function:

print(skew(x, bias=False))

And we should get:

0.7678539385891452
-------------------------------------------------------------------

Now lets go through some important questions & answers

1) If a positively skewed distribution has a median of 50, which of the following statement is true?

A) Mean is greater than 50
B) Mean is less than 50
C) Mode is less than 50
D) Mode is greater than 50
E) Both A and C
F) Both B and D

2) Which of the following measures of central tendency will always change if a single value in the data changes?
A) Mean
B) Median
C) Mode
D) All of these

3) Which is the best measure of central tendency – Mean, Median, Mode?

4)What is the empirical rule?

5)What are the different measures of Skewness?

=============================================
Thank you 
Stay connected for more articles on DATA SCIENCE....























    Comments

    Popular posts from this blog

    Overfitting , Underfitting Bias & Variance Understanding by CHIRAG

    Linear Regression Understanding by CHIRAG

    Understanding Confusion Matrix by CHIRAG