The Purpose of this article is to demonstrate some popular featurization techniques used in Machine Learning, but this article cannot provide all featurization techniques exist today as there are 100’s of techniques in each type of data weather it is text ,image ,connections ,catagorical values ,sequence data which are found after decade of research in feature engineering.
Definition of Feature
In context of machine learning a feature of a data is numerical value which describe any property of that data ,you must be thinking why only numerical the answer is simple as mathematical operation like matrix representation ,dot product can only be applied on numerical values not on any other form,even if the feature is not a numerical value then we have to convert it into a numerical value after that only we can apply any operation.
In this article we will cover the popular methods of featurization on following type of data:
These are very special type of data and are widely popular ,consider this example dataset
Index Weather Playable
1 cold 1
2 hot 0
3 rainy 0
4 cold 1
5 rainy 0
In above given dataset the weather feature is a categorical feature i.e its tell us what type of weather it is ,but as we know we cannot apply direct manipulation on this data so we have to convert our weather feature into a numerical form ,you can argue that why not assign a unique number to each type of weather but this is a wrong idea as let’s say weather – 0 hot-1 rainy-2 but doing this will allow us to compare as number by default are comparable ,so one solution to this problem is One hot encoding,the procedure works as follows ,
- create a binary vector of size equal to total number of categories in our case it will be of size three where each location in a vector correspond to a unique category.
- at each data point ,place 1 correspond to that category location and 0 in rest of places.
After applying following operation our weather feature at each point will be like this [1,0,0],Now this is in a form on which manipulation can directly and also it preserve the nature of categorical feature.but there is one disadvantage of using this as if we have large number of categories then one hot encoding will produce large size sparse vector ,There exist a hack by which you can avoid large dimension sparse vector by replacing weather by avg temp in that season but this will not work as you cannot always relate a category with numerical value or sometime you don’t even have that much information about that category.
Image featurization is very important as many of the crucial machine learning problems include some variant of object detection or recognition.Currently there exist many technique to convert a image into numerical form but In this article we will cover only three ,
- RGB Vector representation
- Histogram of pixel color
RGB, Vector representation is quite simple ,it works like this store the Red , Green ,Blue value for each pixel of the image and then stack them with each other with corresponding y label .
Histogram of pixel color , this is represented as combination of three histogram color red blue green by observing the intensity of color in histogram we can come up with various decision as what this image might be as very less red and green intensity and medium blue intensity can be considered as image of sky but this is not a very popular image featurization as it does not clearly tells us as what the image is all about.
There must be a question in your mind that what is best featurization technique the answer to this question is ,there is no such thing as best featurization technique it varies from problem and with deeplearning era we don’t do explicit featurization much it is it self learned by deep neural network .One of the best neural network for object detection purposes is Convolutional neural network which is now the goto method for mainly all kind of object detection purpose,we will cover CNN in upcoming articles but for now lets focus on traditional machine learning practices.
Featurizing text data is one the most difficult type of featurization as preserving the semantic meaning of sentence is big challenge,In this article we will cover following featurizing techniques,
- Bag of words
- tf-idf(term frequency -inverse document frequency)
Bag of words, this technique works as follows,Let’s take example of reviews of customers for products then the bag of word representation of each review will be a vector in which each location of vector corresponds to number of times the word present inside that review .The set of all words is determined by all the words in whole document corpus(set of reviews in our case).There is also a variant of bag of words which is binary bag of words, this the number of times is replaced by 1 if word is present in review or 0 if word is not present inside the review.
Tf-idf((,This representation is a multiplication of two techniques tf and idf,Let’s see how it is calculated,
tf(word w)=(No of times a w occur in review/total no of words in review),it always lie between 0,1 so you can interpret tf as probability of finding a word inside a review or sentence.
Idf(w) =log(total number of document corpus in our case you can consider it as total number of reviews/ total no document corpus in which w is present)
tf-idf(w) = tf(w)*idf(w)
there is also a very powerful technique Word2vec but it is also part of deep learning so we will try to cover it in future articles.
What will be the better way to represent connection other than graph representation using adjacency matrix or adjacency list which is a very common way of representing any graph. Diagramatic representation of adjacency matrix is given below.
We will be covering many interesting Machine Learning topics in future ,till then explore the world of machine learning.