Descriptive Statistics

In Descriptive Statistics your are describing, presenting, summarizing and organizing your data.
It gives basic information about data helps to further proceed the data analysis.

Two types of Descriptive statistics:

Measure of Central Tendency
Measure of Spread / Dispersion

1.Measure of central tendency

A basic step in exploring your data is getting a “typical value” for each feature (variable).
An estimate of where most of the data is located.

Methods to describe value for the feature.

Mean

Trimmed mean

Median

Mode

Mean or Average

The sum of all values divided by the number of values.
N (or n) refers to the total number of records or observations.
In statistics it is capitalized if it is referring to a population and lowercase if it refers to a sample from a population.

Trimmed Mean

A trimmed mean (sometimes called a truncated mean) is similar to a mean, but it trims any outliers.
These means are expressed in percentages. The percentage tells you what percentage of data to remove.
For example, with a 5% trimmed mean, the lowest 5% and highest 5% of the data are excluded. The mean is calculated from the remaining 90% of data points.
Example

Find the trimmed 20% mean for the following test scores: 60, 81, 83, 91, 99.
Step 1: Trim the top and bottom 20% from the data. That leaves us with the middle three values:60, 81, 83, 91, 99.
Step 2: Find the mean with the remaining values. The mean is (81 + 83 + 91) / 3 ) = 85.

Median

The median is the middle number on a sorted list of the data.
If there is an odd number of an observation, the median is the middle value.
Median will be average of middle 2 terms, if number of terms is even.
If n is ODD

Median is the value at position (n+1)/2.

If n is even

- Find the value at position (n/2)
- Find the value at position (n+1)/2
- Average two values to get median.

Example

Find the median for 12 , 24 , 41 , 51 , 67 , 67 , 85,99.
Step1 : Data is even so average of middle two number is 51 + 67 /2 = 59
Find the median for 2, 4, 6, 8, 10
Step1 : Data is odd so the median is (n+1)/2 which is 6/2 = 3 which 3rd location 6.

Mode

Mode is the term appearing maximum time in data set.
Example : 2, 6, 7, 8, 8
Mode value is 8 which having two occurrences.

Outliers

An outlier is any value that is very distant from the other values in a data set.
In fact, a trimmed mean and median is widely used to avoid the influence of outliers.

How to detect Outliers?

Most commonly used method to detect outliers is visualization. We use various visualization methods, like Box-plot, Histogram, Scatter Plot .
Some analysts also various thumb rules to detect outliers. Some of them are:

Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR.
Use capping methods. Any value which out of range of 5th and 95th percentile can be considered as outlier.
Data points, three or more standard deviation away from mean are considered outlier.

How to Treat outliers?

Deleting Observation
Imputing
Transforming and binning values
Treat Separately.

Note

An outlier can affect the mean of a data set.
Median and trimmed mean is best if feature has outliers.

2.Measure of Spread / Dispersion / Variability

Location is just one dimension in summarizing a feature.
A second dimension, variability, also referred to as dispersion, measures whether the data values are tightly clustered or spread out.

Variance
Standard Deviation
Mean Deviation / Mean Absolute Deviation
Range
Percentile
Quartiles

Variance

Variance is a square of average distance between each quantity and mean.

S^2 = sample variance
xi = the value of the one observation
x{bar} = the mean value of all observations
n = the number of observations

Standard Deviation

Standard deviation is the measurement of average distance between each quantity and mean. That is, how data is spread out from mean.
A low standard deviation indicates that the data points tend to be close to the mean of the data set, while a high standard deviation indicates that the data points are spread out over a wider range of values.
When we are asked to find SD of some part of a population, then we use sample Standard Deviation.
when we have to deal with a whole population, then we use population Standard Deviation.

where x̅ is mean of a sample.
where µ is mean of a population.

Mean Deviation / Mean Absolute Deviation

It is an average of absolute differences between each value in a set of values, and the average of all values of that set.

Range

Range is one of the simplest techniques of descriptive statistics. It is the difference between lowest and highest value.
Example : 2, 6, 7, 8, 9

Range = 9 - 2 = 7

Percentile

Percentile is a way to represent position of a values in data set. To calculate percentile, values in data set should always be in ascending order.

Quartiles

The sum of all values divided by the number of values.
There are three quartile values.

First quartile value is at 25 percentile.
Second quartile is 50 percentile and third quartile is 75 percentile.
Second quartile (Q2) is median of the whole data.
First quartile (Q1) is median of upper half of the data.
And Third Quartile (Q3) is median of lower half of the data.

Example : 12 , 24, 41, 51, 67, 67, 85, 99, 115

Q2 = 67: is 50 percentile of the whole data and is median.
Q1 = 41: is 25 percentile of the data.
Q3 = 85: is 75 percentile of the date.

Interquartile range (IQR) = Q3 - Q1 = 85 - 41 = 44

Learn Data Science Material which helps to learn concepts in Python, Statistics , Data Visualization, Machine Learning , Deep Learning. And it contains Projects helps to understand the flow of building model , and what are the necessary steps should be taken depending on the data set. Interview Questions helps to crack the interview.

Data Science Material

Learn Python from basics to advanced.

Python course

Join ML in python channel in telegram , Where you can learn every concepts in Python, Statistics, Data Visualization, Machine Learning, Deep Learning.

ML in Python

Join Aptitude Preparation channel in telegram , this channel helps to crack any interview.

Aptitude Preparation

Python Introduction

Introduction Python is developed by Guido Van Rossum and released in 1991. Python is high level, interpreted, general purpose programming language. It is one of the top five most used languages in the world. Currently there are 8.2 million developers who code in Python. Python is one of the most preferred languages in the field of Data Science and Artificial Intelligence. Key Features Python is an interpreted language, unlike compiled languages like Java, C, C++, C#, Go etc., Python codes are executed directly even before compiling. Python is Dynamically typed, no need to mention type of variable before assigning. Python handles it without raising any error. Python codes can be executed on different software or operating systems without changing it. Python supports both Functional and Object oriented programming as it supports creating classes and objects. Python has high number of modules and frameworks support. Python is free and Open Source, which means it is availa...

Towards Machine Learning

Search This Blog