Classification in Machine Learning Explained

What is Logistic Regression AKA Classification?

Logistic regression is a statistical method that allows us to perform classification. Essentially, the algorithm outputs probabilities which can then be mapped to different classes. The output is usually a discrete variable, such true or false, a class (cat or dog) and even ordinal classes (bad, medium or good). It is very common to find logistic regression in machine learning, especially in neural networks nodes. This article explains the difference between the statistics and the machine learning notations for the logistic regression algorithm.

Why is Logistic Regression Important?

With logistic regression we estimate the relationship between an output (dependent variable) and one or more inputs (independent variables) in order to help us make predictions. For example, a doctor might be interested in predicting whether a cancer will recur in a patient given the characteristics for the type of tumor. The output is a yes/no (true/false) discrete variable. Will the cancer return for a patient given a set of variables?


Let’s have a look at the jargon related to logistic regression, remember the linear regression definition?

  1. Predictors or Independent Variables or Inputs
  2. Response or Dependent Variable or Output
  3. Residual – The difference between the observed (real) value and the predicted value of the dependent variable.
  4. Coefficients or Weights – These are calculated to determine the line that best “fits” the data. In stats, usually denoted as β. In ML these are usually denoted as W or θ.
  5. Bias or Intercept – It helps offset the effects of missing relevant predictors for the response and helps make the mean of the residuals 0. The intercept or bias acts as the default value for the function i.e. when all independent values are zero. 

We can also vectorize this notation into a neat vector multiplication like so
h(x) = Wᵀx. Essentially we create a vector with all the weights (W0, W1, … , Wn) and do the same with all the independent variables (X0, X1, … , Xn). Now we transpose vector W and multiply it with the X vector (all the independent variables). It looks so much cleaner and it’s way faster for machine learning computations.

Applying the Sigmoid Function

Let’s transform the Linear Regression equation to help it serve our purpose of classification better by restricting the output to discrete values instead of real values. The sigmoid function can be very useful because it can take any real-valued number and transform it into a value between 0 and 1.

Combining the sigmoid function with the Linear Regression equation, where h(x) = P ( y=1|x) becomes the probability of y = 1 (y is positive) given x, we get:

After a bit of math magic, we finally get to the log-odds ratio.

The above equations will generate a value for the probability that the response is equal to 1 given the current data points.

Now, that we have generated probabilities, the original concern remains. How do we map it to the classes?

The answer is fairly straightforward. A threshold / decision boundary is defined which basically implies that a probability above the threshold is classified at category 1 and a probability below the threshold is classified as 0. 

Let’s take a simple example with 2 independent variables X1 and X2 and supposed the weights are as follows: W0 = -3, W1 = 0 and W2 = 0. We would get y = 1 if h(x) >= 0. In other words, y = 1 if Wᵀx is bigger or equal to 0. Now let’s plug in the values. -3 + X1 + X2 >= 0. If we move the -3 to the other side we get the equation for our decision boundary (a straight line) X1 + X2 >= 3 . Everything below the line is y = 0 and everything above it is y = 1.

Optimizing the Decision Boundary

The cost function to determine the coefficients in the Logistic Regression algorithm is the Cross-Entropy as opposed to the Mean Squared Error (MSE) used in Linear Regression.

The Cross Entropy function Explained

The Cross-Entropy function in statistics is defined as:

Cost/Error Function for Logistic Regression

Essentially, we want to find the optimal W to minimize the error function J(W). In machine learning we change the Y hat with our hypothesis and m represents the number of training examples. J is our cost function used on D, the dataset given the weights W (or θ).

This minimization of the cost function occurs through an algorithm known as the gradient descent algorithm. You can find the simple explanation in the linear regression post.

Gradient Descent, Regularization and more

For a deeper comprehension about how gradient descent operates and clear examples of linear regression in machine learning with technical explanations, enroll in the “Learn AI with an AI Course” with Audrey Durand.

Machine Learning in Healthcare Series

In this series of articles we explore the use of machine learning in the healthcare industry. Important concepts and algorithms are covered with real applications including open datasets, open code (available on Github) used to run the analyses and the final results and insights.

Subscribe to the Korbit Newsletter

Stay up to date with news, blog posts, code examples and new courses. The newsletter goes out every month.

Series Articles