In this lecture we studied maximum likelihood inference for linear classifiers. We saw that ordinary least squares regression can be phrased as maximum likelihood inference under the assumption of additive Gaussian noise. We then derived the closed-form solution to the normal equations for small-scale problems and discussed alternative optimization methods, namely gradient descent, for larger-scale settings. We noted several potential issues in directly applying linear regression to classification problems and explored logistic regression as an alternative. Maximum likelihood inference for logistic regression led us to first- and second-order gradient descent methods for parameter inference. See Chapter 3 of Bishop and Chapter 3 of Hastie for reference.
A few notes on linear and logistic regression:
- Ordinary least squares regression is, in principle, easily solved by inverting the normal equations:
$$
\hat{w} = (X^T X)^{-1} X^T y .
$$
In practice, however, it often computationally expensive to invert \( X^T X \) for models with many features, even with specialized numerical methods for doing so. - Gradient descent offers an alternative solution to the normal equations, replacing potentially expensive matrix inversion with an iterative method where we update parameters by moving in the direction of steepest increase of the likelihood landscape:
$$
\hat{w} \leftarrow \hat{w} + \eta X^T (y – X\hat{w}),
$$
where \( \eta \) is a tunable step size. Choosing \( \eta \) too small leads to slow convergence, whereas too large a step size may result in undesirable oscillations about local optima. Intuitively, gradient descent updates each component of \( \hat{w} \) by a sum of the corresponding feature values over all examples, where examples are weighted by the error between actual and predicted labels. - Two high-level issues arise when directly applying ordinary least squares regression to classification problems. First, our model predicts continuous outputs while our training labels are discrete. Second, squared loss is sensitive to outliers and penalizes “obviously correct” predictions for which the predicted value \( \hat{y} \) is much larger than the observed value \( y \) but of the correct sign.
- Logistic regression addresses these issues by modeling the class-conditional probabilities directly, using a logistic function to transform predictions to lie in the unit interval:
$$
p(y=1|x, w) = {1 \over 1 + e^{-w \cdot x}}
$$
While maximum likelihood inference for logistic regression does not permit a closed-form solution, gradient descent results in the following update equations:
$$
\hat{w} \leftarrow \hat{w} + \eta X^T (y – p).
$$
In smaller-scale settings one can improve on these updates by using second-order methods such as Newton-Raphson that leverage the local curvature of the likelihood landscape to determine the step size at each iteration.
Naive Bayes, linear regression, and logistic regression are all generalized linear models, meaning that predictions \( \hat{y} \) are simply transformations of a weighted sum of the features \( x \) for some weights \( w \), i.e. \( \hat{y}(x; w) = f(w \cdot x) \). Linear regression models the response directly, whereas Naive Bayes and logistic regression model the probability of categorical outcomes via the logistic link function. Model fitting amounts to inferring the “best” values for the weights \( w \) from training data; each of these methods quantify the notion of “best” differently and thus results in different estimates for \( w \). In particular, Naive Bayes estimates each component of the weight vector independently, while linear and logistic regression account for correlations amongst features.
In the second half of class we discussed APIs for accessing data from web services. As an example, we used Python’s urllib2 and json modules to interact with the New York Times Developer API. See the github repository for more details.