Skip to main content

Descriptive Statistics

  •  In Descriptive Statistics your are describing, presenting, summarizing and organizing your data.
  •  It gives basic information about data helps to further proceed the data analysis.


Two types of Descriptive statistics:
  • Measure of Central Tendency
  • Measure of Spread / Dispersion 

1.Measure of central tendency


  • A basic step in exploring your data is getting a “typical value” for each feature (variable).
  • An estimate of where most of the data is located.

 Methods to describe value for the feature.

    Mean
    Trimmed mean
    Median
    Mode
  

   Mean or Average

  • The sum of all values divided by the number of values.
  • N (or n) refers to the total number of records or observations. 
  • In statistics it is capitalized if it is referring to a population and lowercase if it refers to a sample from a population. 
  

  Trimmed Mean

  •   A trimmed mean (sometimes called a truncated mean) is similar to a mean, but it trims any outliers.
  •  These means are expressed in percentages. The percentage tells you what percentage of data to remove. 
  • For example, with a 5% trimmed mean, the lowest 5% and highest 5% of the data are excluded. The mean is calculated from the    remaining 90% of data points.
  • Example 
    • Find the trimmed 20% mean for the following test scores: 60, 81, 83, 91, 99.
    • Step 1: Trim the top and bottom 20% from the data. That leaves us with the middle three values: 60, 81, 83, 91, 99.
    • Step 2: Find the mean with the remaining values. The mean is (81 + 83 + 91) / 3 ) = 85.
      

  Median

  •   The median is the middle number on a sorted list of the data.
  •   If there is an odd number of an observation, the median is the middle value. 
  •   Median will be average of middle 2 terms, if number of terms is even.
  •   If n is ODD
    • Median is the value at position (n+1)/2.
  •   If n is even
    •     - Find the value at position (n/2)
    •     - Find the value at position (n+1)/2
    •     - Average two values to get median.
    
  • Example
    •  Find the median for 12 , 24 , 41 , 51 , 67 , 67 , 85,99.
    •  Step1 : Data is even so average of middle two number is 51 + 67 /2 = 59 
    • Find the median for 2, 4, 6, 8, 10
    • Step1 : Data is odd so the median is (n+1)/2 which is 6/2 = 3 which 3rd location 6.
    

  Mode

  •  Mode is the term appearing maximum time in data set.
  •  Example : 2, 6, 7, 8, 8
  •  Mode value is 8 which having two occurrences.
    

Outliers


  • An outlier is any value that is very distant from the other values in a data set.
  • In fact, a trimmed mean and median is widely used to avoid the influence of outliers.

 How to detect Outliers?

  • Most commonly used method to detect outliers is visualization. We use various visualization methods, like Box-plot, Histogram, Scatter Plot . 
  • Some analysts also various thumb rules to detect outliers. Some of them are:
    • Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR.
    • Use capping methods. Any value which out of range of 5th and 95th percentile can be considered as outlier.
    • Data points, three or more standard deviation away from mean are considered outlier.






  How to Treat outliers?


  • Deleting Observation
  • Imputing
  • Transforming and binning values
  • Treat Separately.



  Note

  • An outlier can affect the mean of a data set.
  • Median and trimmed mean is best if feature has outliers.
  

2.Measure of Spread / Dispersion / Variability


  • Location is just one dimension in summarizing a feature. 
  • A second dimension, variability, also referred to as dispersion, measures whether the data values are tightly clustered or spread out. 

  •  Variance
  •  Standard Deviation
  •  Mean Deviation / Mean Absolute Deviation
  •  Range
  •  Percentile
  •  Quartiles

  

  Variance

  •   Variance is a square of average distance between each quantity and mean.


  •   S^2    =   sample variance
  •   xi    =   the value of the one observation
  •   x{bar} =   the mean value of all observations
  •   n    =   the number of observations
  
  

    Standard Deviation

  •  Standard deviation is the measurement of average distance between each quantity and mean. That is, how data is spread out from mean.
  •  A low standard deviation indicates that the data points tend to be close to the mean of the data set, while a high standard deviation indicates that the data points are spread out over a wider range of values.
  • When we are asked to find SD of some part of a population, then we use sample Standard Deviation.
  • when we have to deal with a whole population, then we use population Standard Deviation.
  

  • where x̅ is mean of a sample.
  • where µ is mean of a population.

      

    Mean Deviation / Mean Absolute Deviation

  •  It is an average of absolute differences between each value in a set of values, and the average of all values of that set.
    
    

    Range

  • Range is one of the simplest techniques of descriptive statistics. It is the difference between lowest and highest value.
  • Example : 2, 6, 7, 8, 9
    •  Range = 9 - 2 = 7
    
    

    Percentile

  • Percentile is a way to represent position of a values in data set. To calculate percentile, values in data set should always be in ascending order.
  
  

    Quartiles

  • The sum of all values divided by the number of values.
  • There are three quartile values. 
    •  First quartile value is at 25 percentile. 
    •  Second quartile is 50 percentile and third quartile is 75 percentile. 
    •  Second quartile (Q2) is median of the whole data. 
    •   First quartile (Q1) is median of upper half of the data. 
    •    And Third Quartile (Q3) is median of lower half of the data.
  • Example : 12 , 24, 41, 51, 67, 67, 85, 99, 115
    • Q2 = 67: is 50 percentile of the whole data and is median.
    • Q1 = 41: is 25 percentile of the data.
    • Q3 = 85: is 75 percentile of the date.
  • Interquartile range (IQR) = Q3 - Q1 = 85 - 41 = 44
  


Learn Data Science Material which helps to learn concepts in Python, Statistics , Data Visualization, Machine Learning , Deep Learning. And it contains Projects helps to understand the flow of building model , and what are the necessary steps should be taken depending on the data set. Interview Questions helps to crack the interview. 





Learn Python from basics to advanced. 




Join ML in python channel in telegram , Where you can learn every concepts in Python, Statistics, Data Visualization, Machine Learning, Deep Learning.

  

Join Aptitude Preparation channel in telegram , this channel helps to crack any interview.

  

Comments

Popular posts from this blog

Python Introduction

 Introduction  Python is developed by Guido Van Rossum and released in 1991. Python is high level, interpreted, general purpose programming language. It is one of the top five most used languages in the world. Currently there are 8.2 million developers who code in Python. Python is one of the most preferred languages in the field of Data Science and Artificial Intelligence. Key Features Python is an interpreted language, unlike compiled languages like Java, C, C++, C#, Go etc., Python codes are executed directly even before compiling.  Python is Dynamically typed, no need to mention type of variable before assigning. Python handles it without raising any error. Python codes can be executed on different software or operating systems without changing it. Python supports both Functional and Object oriented programming as it supports creating classes and objects. Python has high number of modules and frameworks support. Python is free and Open Source, which means it is availa...

Percentage Problems on overall percentage change - Module [ 1 ]

                                            Module -1  Let us discuss , if the problems is based on percentage change. 1. If the salary of person increased by 10% and then decreased by 10% , what is the overall percentage change in the salary? Ans :                     Let us assume the salary of person is 100% , then it is increased by 10% so it becomes 110%. Now it is decreased by 10% from 110 that is 11, so                     110 - 11 = 99                    Initial salary is 100% , now the salary is 99% that is 1% change in the percentage.   2. If the cost price of an article is 100 , while selling he increased the cost price by 20% and then decreased 20% . what ...

Importance of data preprocessing in machine learning

                          Data  Preprocessing Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis. Need of Data Preprocessing Inaccurate data There are many reasons for missing data such as data is not continuously collected, a mistake in data entry, technical problems with bio-metrics and much more. The presence of noisy data The reasons for the existence of noisy data could be a technological problem of gadget that gathers data, a human mistake during data entry and much more. Inconsistent data  The   presence of inconsistencies are due to the reasons such that existence of duplication within data, human data entry, containing mistakes in codes or names, i.e., violation of data constraints and much more. Steps Involved ...