- In Descriptive Statistics your are describing, presenting, summarizing and organizing your data.
- It gives basic information about data helps to further proceed the data analysis.
Two types of Descriptive statistics:
- Measure of Central Tendency
- Measure of Spread / Dispersion
1.Measure of central tendency
- A basic step in exploring your data is getting a “typical value” for each feature (variable).
- An estimate of where most of the data is located.
Methods to describe value for the feature.
Mean
Trimmed mean
Median
Mode
Mean or Average
- The sum of all values divided by the number of values.
- N (or n) refers to the total number of records or observations.
- In statistics it is capitalized if it is referring to a population and lowercase if it refers to a sample from a population.
Trimmed Mean
- A trimmed mean (sometimes called a truncated mean) is similar to a mean, but it trims any outliers.
- These means are expressed in percentages. The percentage tells you what percentage of data to remove.
- For example, with a 5% trimmed mean, the lowest 5% and highest 5% of the data are excluded. The mean is calculated from the remaining 90% of data points.
- Example
- Find the trimmed 20% mean for the following test scores: 60, 81, 83, 91, 99.
- Step 1: Trim the top and bottom 20% from the data. That leaves us with the middle three values:60, 81, 83, 91, 99.
- Step 2: Find the mean with the remaining values. The mean is (81 + 83 + 91) / 3 ) = 85.
Median
- The median is the middle number on a sorted list of the data.
- If there is an odd number of an observation, the median is the middle value.
- Median will be average of middle 2 terms, if number of terms is even.
- If n is ODD
- Median is the value at position (n+1)/2.
- If n is even
- - Find the value at position (n/2)
- - Find the value at position (n+1)/2
- - Average two values to get median.
- Example
- Find the median for 12 , 24 , 41 , 51 , 67 , 67 , 85,99.
- Step1 : Data is even so average of middle two number is 51 + 67 /2 = 59
- Find the median for 2, 4, 6, 8, 10
- Step1 : Data is odd so the median is (n+1)/2 which is 6/2 = 3 which 3rd location 6.
Mode
- Mode is the term appearing maximum time in data set.
- Example : 2, 6, 7, 8, 8
- Mode value is 8 which having two occurrences.
Outliers
- An outlier is any value that is very distant from the other values in a data set.
- In fact, a trimmed mean and median is widely used to avoid the influence of outliers.
How to detect Outliers?
- Most commonly used method to detect outliers is visualization. We use various visualization methods, like Box-plot, Histogram, Scatter Plot .
- Some analysts also various thumb rules to detect outliers. Some of them are:
- Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR.
- Use capping methods. Any value which out of range of 5th and 95th percentile can be considered as outlier.
- Data points, three or more standard deviation away from mean are considered outlier.
How to Treat outliers?
- Deleting Observation
- Imputing
- Transforming and binning values
- Treat Separately.
Note
- An outlier can affect the mean of a data set.
- Median and trimmed mean is best if feature has outliers.
2.Measure of Spread / Dispersion / Variability
- Location is just one dimension in summarizing a feature.
- A second dimension, variability, also referred to as dispersion, measures whether the data values are tightly clustered or spread out.
- Variance
- Standard Deviation
- Mean Deviation / Mean Absolute Deviation
- Range
- Percentile
- Quartiles
Variance
- S^2 = sample variance
- xi = the value of the one observation
- x{bar} = the mean value of all observations
- n = the number of observations
Standard Deviation
- Standard deviation is the measurement of average distance between each quantity and mean. That is, how data is spread out from mean.
- A low standard deviation indicates that the data points tend to be close to the mean of the data set, while a high standard deviation indicates that the data points are spread out over a wider range of values.
- When we are asked to find SD of some part of a population, then we use sample Standard Deviation.
- when we have to deal with a whole population, then we use population Standard Deviation.
- where x̅ is mean of a sample.
- where µ is mean of a population.
Mean Deviation / Mean Absolute Deviation
- It is an average of absolute differences between each value in a set of values, and the average of all values of that set.
Range
- Range is one of the simplest techniques of descriptive statistics. It is the difference between lowest and highest value.
- Example : 2, 6, 7, 8, 9
- Range = 9 - 2 = 7
Percentile
- Percentile is a way to represent position of a values in data set. To calculate percentile, values in data set should always be in ascending order.
Quartiles
- The sum of all values divided by the number of values.
- There are three quartile values.
- First quartile value is at 25 percentile.
- Second quartile is 50 percentile and third quartile is 75 percentile.
- Second quartile (Q2) is median of the whole data.
- First quartile (Q1) is median of upper half of the data.
- And Third Quartile (Q3) is median of lower half of the data.
- Example : 12 , 24, 41, 51, 67, 67, 85, 99, 115
- Q2 = 67: is 50 percentile of the whole data and is median.
- Q1 = 41: is 25 percentile of the data.
- Q3 = 85: is 75 percentile of the date.
- Interquartile range (IQR) = Q3 - Q1 = 85 - 41 = 44
Comments
Post a Comment