Understanding Logistic Regression

Let’s talk about classification first as Logistic regression is also a classification technique.In context of Machine Learning a classification algorithm is such an algorithm when when provided an input it can predict to which class(+ve or -ve) it belongs ,this type of classification is also called “binary classification”  as there are only two classes there is also another type of classification which is “One vs all” but first we will focus on binary classification .

In this article we will cover geometric derivation of Logistic regression but this is not the only way of deriving this technique we can also work-out it using probability or loss-function interpretation.In all classification techniques the goal is to find out best decision plane i.e a Hyperplane which separates data into positive and negative labels,consider the image below for complete understanding.

Now we understand what to use to separate the data now in case of 2-dimension data a line is used ,in 3-dimension a plane is used ,in N-dimension where n>3 a hyperplane is used,In this article we will be using

Linear Algebra:To extend our understanding to any dimension.

If you feel unaware of what ,or how we will be using these don’t worry after reading this you will be confident about Logistic regression.Let’s build some maths ,

equation of line : ax+by+c=0 or more precisely dot_product(Transpose(W),X) +W0  in this X coordinates of the point are constant only W which is normal to the line and w0 the y-intercept is variable .Here W ,X are 2-dimensional row vectors ,so if the vector size becomes 3 the equation will remain same but it will be the equation of a  plane  and for more than 3 it will be a hyperplane .The question is how this relates to our problem for that just see the image below.

Now ,we know that X is fixed so We can only tweak W and B to get the best plane .but before that we have to take following assumptions ,

  1. Data is Linearly separable
  2. Output Labels y’s are either {+1 or -1}
  3. w is a unit row vector

So,the concept is simple if the perpendicular distance of the point from the decision is positive then it is positive class point and if negative then  it belongs to negative class ,why ? take a look

Distance = dot_product(transpose(W),X) /||W|| , and we know dot_product is calculated as follows ||W||.||X||cos(angle between (W,X)) and cos(0>90) is positive and cos(180<90) is negative ,the  sign of distance tells us on which side the point lies,Now we need a way to check if the predicted class label is correct how about this simple Y[i]*dot_product(transpose(W),X) .

If Y[i]*dot_product(transpose(W),X)>0 correctly classified

and if Y[i]*dot_product(transpose(W),X)<0 incorrectly classified

Our goal becomes to maximize(Y[i]*dot_product(transpose(W),X)) i.e, to correctly classify as many points as possible but this objective function is prone to outliers i.e if one point distance is very large compared to all the other point then our best hyperplane  will not be the actual best hyperplane for geometric understanding consider the image below

This will be decision boundary
This should be decision boundary

We have to find some way to limit the distance in such a way so that if it becomes larger it should not affect our model one such hack is sigmoid function as it grows linearly for small values but for large values it squashes ,one more interpretation of use of sigmoid in logistic regression is that it gives us a probabilistic score,now our objective function becomes

Maximize(sigmoid(Y[i]*dot_product(transpose(W),X)))

but this is also not the complete objective function we need a function to be monotonic i.e if X increase F(X) should also increase or non-decreasing one such function is Log(x) and by rules of maxima and minima we know

Maximize(f(X)) =Maximize(g(f(x)))

Now,our objective function becomes

Maximize(-log(1+exp(-(Y[i]*dot_product(transpose(W),X)))))

and this has to be done for all training points in our dataset.

Now by rules of maxima and minima we know that,

Maximize(-f(x)) = minimize(f(x))

Our final objective function becomes ,

Minimize(log(1+exp(-(Y[i]*dot_product(transpose(W),X)))))

if you have seen this,objective function

minimize(y[i]*log(sigmoid(Y[i]*dot_product(transpose(W),X)))+(1-y[i])*log(1-Y[i]*dot_product(transpose(W),X))))

then don’t worry they both are same it just the difference how we have derived it.

but this objective function can easily over fit the training data so we need a way to prevent it from overfit the data the answer is regularization ,the concept of regularization is simple just add lambda*dot_product(W.T,W) for all W’s to our loss function ,

Minimize(log(1+exp(Y[i]*dot_product(transpose(W),X)))+lambda*dot_product(W.W))

the  dot_product(W.T,W) is square of l2 norm. we have successfully designed logistic regression from scratch. We will be covering many topics of Machine learning so stay tuned with learncodeonline

Send a Message