This is part 2 of the series “Developing a machine learning system” .
The goal of every single article is reduce the gap between theoretical machine learning and applied machine learning .If you are following Learncodeonline blogs on machine learning in sequence with which they are being uploaded then you are ready to build a machine learning system from scratch. The goal of this new series of articles is build a working machine learning system which will be able to classify among three types of flowers, for simplicity of this article we will be using one of the most popular toy dataset called iris flower dataset.
It is recommended to follow this series from part 1.
After reading this article yo will know,
- Some fairly advance statistical tools for EDA.
- Theoretical explanation with code.
- what to infer from results.
Let’s first cover some statistical measures,for most of statistical measures we will be using numpy library. numpy is one of the most popular math library used in python for numerical computations.Applying same method on every flower don’t make sense so we will be only applying operations on setosa flowers.
setosa = iris.loc[iris["species"] == "setosa"];
Mean tells us about the central point where most of the data points lie , it is the measure of central tendency and denoted by .
import numpy as np print("Setosa flower petal length mean",np.mean(setosa["petal_length"]))
Expected output : 1.464
There is a big disadvantage that comes with mean , is that it is easily corruptible i.e even a single outlier can affect the mean of the data.
There is sometimes a need to check how far the data points are deviated from mean of the data i.e range or spread of data from mean , for those situations we use standard deviation as our tool of measure the spread of data.
import numpy as np print("Standard deviation of satosa flower : "np.std(setosa["petal_length"]))
Expected output : 0.1717…..
Observation : spread of median is small and most of data points are not that much deviated from central point.
Median shows the same information as mean does i.e it tell us about central behavior of the data, but the biggest advantage that comes with median is that it is not easily corruptible .The syntax for calculating median is quite simple.
Calculating median : Sort and find the middle value. in case of even find avg of two middle values.
print("Median of setosa flowers : ",np.median(setosa["petal_length"]))
Expected output : 1.5
In-spite of regular use of percentile there is no proper definition of what percentile is , the most popular definition is that percentile shows how much percentage of scores fell under that or fell before that score. Let’s take a example , if you score 80 marks in exam out of hundred then in that case it does not show any useful information but if you say your sore is in 95th percentile then it is saying you scored better than 95% of the people . By using the concept of percentile you can check performance of a system or service , for e.g if the 90th percentile of the days taken to complete the delivery is 3 , that means 90% of the orders are being delivered in less than 3 days now this is a very good measure of calculating how your logistics service is working . Although the concept of percentile will not show very useful information in case of this flower dataset but it is still worth looking.
print("90th percentile of petal length of satosa flowers : "np.percentile(iris_setosa["petal_length"],90))
Expected output : 1.7 , this means 90% of the satosa flowers have petal length less than 1.7 .
The concept of quantile is simple dividing the whole data or probability distribution of data into contiguous intervals with each interval having equal probability, for sake of simplicity of article we will use 4-quantiles or quartiles ,
- smallest element
- middle element between median and largest element
- largest element
Quartiles can be used to show what is magnitude of the data points lie under a quartile , like if median is 1.7 then we can say 50% of the points have petal length less than 1.7.
the extra parameter which is list gives the percentile method instruction to calculate the value of quartile at 0, 25, 50,100 rather than at some specific scalar value.
This complete our discussion of EDA , in next upcoming article we will cover how to build a simple working machine learning model for this dataset.