Logistic Regression Isn't Regression: It's Classification Through a Probability Trick

logistic-regression classification sigmoid supervised-learning math

The name is confusing. Logistic Regression sounds like it predicts continuous values the way Linear Regression does, but it's actually a classification algorithm. The "regression" part refers to the math underneath, not what it outputs. What it actually does is predict the probability that something belongs to a class, then threshold that probability into a decision.

The problem with using regular linear regression for classification: it can output any number, including negatives and values above 1. Probabilities have to live between 0 and 1. So Logistic Regression runs the linear output through a function that squashes everything into that range: the sigmoid function:

$\sigma(z) = \frac{1}{1 + e^{-z}}$

This produces an S-curve. For very large positive $z$ , the output approaches 1. For very large negative $z$ , it approaches 0. At $z = 0$ , you get exactly 0.5. The model learns weights so that the linear combination of inputs produces a $z$ -value that, after the sigmoid, gives the right probability.

The standard decision rule: if $\sigma(z) \geq 0.5$ , predict class 1. If $\sigma(z) < 0.5$ , predict class 0. You can adjust this threshold: lower it to catch more positives (higher recall), raise it to be more conservative (higher precision).

What clicked

The sigmoid isn't magical, it's just a mathematical wrapper that takes "any number" and maps it to "a probability." The learning part (finding the right weights) happens via maximum likelihood estimation, not least squares like linear regression.

Still shaky on

The loss function for logistic regression is called binary cross-entropy, not MSE. I know the formula involves log probabilities but I haven't worked through why MSE fails as a loss function for classification. That's a gap I want to fill.

What's next

Decision Trees, the completely different approach where instead of fitting a curve or a line, you split the data with a series of yes/no questions.