This is part 3 of the series “Developing a machine learning system” .
The goal of every single article is reduce the gap between theoretical machine learning and applied machine learning .If you are following Learncodeonline blogs on machine learning in sequence with which they are being uploaded then you are ready to build a machine learning system from scratch. The goal of this new series of articles is build a working machine learning system which will be able to classify among three types of flowers, for simplicity of this article we will be using one of the most popular toy dataset called iris flower dataset.
The goal of this article is to develop a fully working machine learning system which will be able to predict class label given a input vector of features. We will be doing a case study approach so we will be comparing performance of two classical machine learning algorithm knn and logistic regression.
After reading this article you will know,
- Create a working algorithm using sklearn library
- how to connect all parts of the system
- Discussion on, how you as a beginner can deploy machine learning system
Before proceeding further it is recommended to follow this series from part 1.
Importing all required libraries
This is a good way , as to import all required libraries in beginning rather than in middle of code
import pandas as pd import numpy as np from sklearn.cross_validation import train_test_split from sklearn import preprocessing from sklearn.neighbors import KNeighborsClassifier from sklearn.linear_model import LogisticRegression
The first two libraries have already been discussed in previous parts, the only one left is Sklearn. It is by far one of the most popular libraries to deploy machine learning algorithms , it consist of pre-built models which we can use for our own data rather than coding from scratch every time we work on some other data and in case of real time model development it is almost impossible to code everything from scratch.
Reading and understanding how our data is represented
iris = pd.read_csv('iris.csv') iris.head()
Everything is quite right except the label representation of species , we cannot compare our model prediction with a string label , so we have to convert every string label into a vector using one-hot-encoding.
Converting class labels to one-hot-encoding
lb = preprocessing.LabelBinarizer() lb.fit(np.array(['setosa', 'virginica','versicolor'])) Y = lb.transform(specieslabel)
This will change our string labels to one-hot-encoded vectors e.g [0,0,1]
Split train and test data
This is a necessary step as after training algorithm on certain amount of data we need a data which algorithm has not seen before to check its performance on real data. The split ratio which we will be using is usual 70% train and 30% test.
X = iris.drop(['species'],axis =1) X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
This section is divided into two parts , part 1 covers implementation of knn and in part 2 we have covered implementation of logistic regression.
cv_score =  k = for i in range(1,2,30): knn = KNeighborsClassifier(n_neighbors=i) scores = cross_val_score(knn, X_train, Y_train, cv=10, scoring='accuracy') cv_score.append(scores.mean()) k.append(i) best_k = k[cv_score.index(max(cv_score))] print("best k for knn ",best_k)
Expected output , ” best k for knn 5 “
The flow of this snippet is quite simple first we declare two list,
cv_score : to store mean cross validation score of all model at particular value of i , the method cross_val_score is sklearn implementaion of k-fold cross validation with value of k for k-fold.
K : to store different values of k , this is been done to select the best value of k for knn.
Then we start a loop to check model’s performance for different values of k ranging from 1 to 30 with a jump of 2. After this loop completes we extract the k which has best accuracy score .
Building final knn model
knn = KNeighborsClassifier(n_neighbors=1) knn.fit(X_train, Y_train) y_pred = knn.predict(X_test) print("Accuracy on test data ",accuracy_score(Y_test, y_pred)*100,"%")
Expected Output, ” Accuracy on test data 95%”
For simplicity of this article we will not be using any regularization, so there is no need to use cross-validation dataset.
logisticregression = LogisticRegression() logisticregression.fit(X_train, Y_train) y_pred = logisticregression.predict(X_test) print("Accuracy on test data ",accuracy_score(Y_test, y_pred)*100,"%")
Expected output, “Accuracy on test data 96%”
As this model does not include any hyperparameter selection , we can skip cross-validation test and can directly jump over model fitting over data with default parameters and then predicting the lables of test data
This ends our model building procedure. Though in real life predicting model getting an accuracy of over 90% is a big achievement.
It is recommended that you should implement this code on your machine as the only key to learn machine learning is by implementing models and understanding data. Getting good in field of machine learning requires a great amount of hard work and dedication as this branch of computer science is highly focused on applied mathematics, some concepts from Algorithms in computer science etc. This was last article of this series, hope you enjoyed reading it.