The goal of every single article is reduce the gap between theoretical machine learning and applied machine learning . If you are following Learncodeonline blogs on machine learning in sequence with which they are being uploaded then you are ready to build a machine learning system from scratch. The goal of this new series of articles is build a working machine learning system which will be able to classify among three types of flowers, for simplicity of this article we will be using one of the most popular toy dataset called iris flower dataset. There are couple of reasons for choosing this article ,
- It is quite simple to understand for any beginner
- It is one of the most used dataset for any machine learning course.
You can download the dataset from here, it is a csv file.
If you are unfamiliar with ,
- Linear algebra for machine learning , click here.
- K-nearest neighbor, click here.
- Plots used for Exploratory data analysis , click here.
- Normalization and standardization, click here.
- Logistic Regression, click here.
We will be working with a bunch of python libraries , so i will cover them as we go through rather specifying them all in beginning. This series of article is divided in three,
- Exploratory data analysis
- Developing a working model
- Discussing on our overall approach
Exploratory data analysis
This is the most important step when developing a machine learning model , even most important then itself the model.
“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.”
― John W. Tukey
if you don’t understand data , then you cannot simply make a good machine learning model, in this phase we usually try to understand data by using tools in statistics like plots, mean , variance etc. Covering all statistical tools is beyond the scope of article and neither needed to understand this dataset.
Note : All the code is written in python using anaconda environment.The dataset must be present in the same directory in which we are writing our code.
import pandas as pd iris = pd.read_csv("iris.csv")
we have use pandas to work with our csv file , it generates a dataframe for corresponding datafile which makes it very easy for applying operations on data. This is what I like about python , the syntax is self explanatory.
Visualizing our data set.
The head functions displays first five rows of the data frame. Expected output.
Size of data frame,
The expected output should be 150 x 5, 150 shows sample size and 5 shows no. of features.
To find on what features we are working on,
Expected output Index([‘sepal_length’, ‘sepal_width’, ‘petal_length’, ‘petal_width’, ‘species’], dtype=’object’). here sepal length sepal width etc are features and the dataframe is of type object.The column species contains class labels.
No of samples per class
The value counts gives us the no of samples per class label. Expected output,
This is perfectly balanced dataset as the no. of samples per class is same , this is one of the ideal dataset which you will never find for real problems.
OK from now we will be doing some histogram visualization so , make yourself comfortable with Plots , if you haven’t.
Visualizing our dataset using pair plots, its kind of extension to regular histogram.The idea is simple plot histogram for every feature against every feature , by doing that we will be able to know, what features are correlated.
import seaborn as sns import matplotlib.pyplot as plt sns.set_style("darkgrid"); sns.pairplot(iris, hue="species",size=3); plt.show()
Now its time to analyze this plot ,
- Observation 1: The plot of petal lengths and petal width of flower is extremely important features as they are able to discriminate between classes of flower very well.
- Observation 2: Sepal width is not a useful parameter when it comes to discriminate between classes of flowers as their histograms overlap.
The Observation during exploratory data analysis must be aligned with the goal of problem, observations that are not aligned with the problem goal are of no use.
As we above observed that petal length and petal width are good features to be used , let’s visualize their pdf’s,
Petal length PDF
sns.FacetGrid(iris, hue="species", size=5).map(sns.distplot, "petal_length").add_legend(); plt.show();
- Observation : Satosa flowers have petal length less than 2 , at this moment if you are being asked that create a binary class classification as which are satosa flowers , even a simple if condition can be used to built a model . This is what EDA is all about we use basic statistical tools to find what is relevant for our problem.
Petal width PDF
sns.FacetGrid(iris, hue="species", size=5).map(sns.distplot, "petal_width").add_legend(); plt.show();
Observation : Satosa flowers have petal width less than 0.7 , and petal width of versicolor lies between 1.0-1.7, where as for virginica it lies between 1.4-2.5. There is a clear separation between satosa petal width and others.
There are couple of things we still don’t know , like what percentage of verginica points have a petal_length of less than 4, for these types of situations we use CDF rather than PDF. Make sure you have read all articles linked at beginning of this article as only then you will be able to interpret this,
CDF of petal length of Virginica flowers
counts, bin_edges = np.histogram(iris['petal_length'], bins=10, density = False) pdf = counts/(sum(counts)) cdf = np.cumsum(pdf) plt.plot(bin_edges[1:], cdf) plt.legend() plt.show()
Observation : roughly 40% of the points have petal length less than 4 of verginica flowers.
By this we end our discussion on EDA.There are much more statistical tools today which we use on dataset but as a beginner, it will be overwhelming to showcase all tools in first article. After reading this article now you have idea what does it means when we say “performing Exploratory data analysis” . In next article we will try to explore more statistical tools like quartiles, percentiles etc and also to built a machine learning model from scratch.