$ \newcommand{\CC}{\mathcal{C}} \newcommand{\CF}{\mathcal{F}} \newcommand{\CI}{\mathcal{I}} \newcommand{\CJ}{\mathcal{J}} \newcommand{\CS}{\mathcal{S}} \newcommand{\RR}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \newcommand{\P}{\mathbb{P}} \newcommand{\In}{\mathbf{1}} \newcommand{\df}{\nabla f} \newcommand{\ddf}{\nabla^2 f} \newcommand{\ra}{\rightarrow} \newcommand{\la}{\leftarrow} \newcommand{\inv}{^{-1}} \newcommand{\maximize}{\mbox{maximize }} \newcommand{\minimize}{\mbox{minimize }} \newcommand{\subto}{\mbox{subject to }} \newcommand{\tr}{^{\intercal}} $
In the recent years, a vast amount of data is collected. This fact affects the optimization problems that are solved in various fields from finance to machine learning. A considable fraction of data-driven large-scale problems arises from parameter estimation models. These estimations require minimizing the divergence of the predicted parameter values from their observed values. As data is large, a typical objective function involves summation of a huge number of functions.
Let us give an example problem from speech recognition. Consider a set of training data consisting of few milliseconds of speech frames $(x^{(k)}; y^{(k)})$, $k=1, \dots, N$. Here, $x^{(k)} \in \RR^n$ is the features vector and $y^{(k)} \in \CC$ is the label representing the phonetic state. The goal is to maximize the conditional probability of the correct phonetic state given the observed features vector.
The mathematical model defines a weight vector $\theta_l \in \RR^n$ for each label $l \in \CC$, and assigns each frame $k$ a probability of having a certain label $j \in \CC$:
$$ \P\left(y^{(k)}=j~|~x^{(k)}; \theta\right) = \frac{\exp{(\theta_j\tr x^{(k)})}}{\sum_{l \in \CC}\exp{(\theta_l\tr x^{(k)})}}. $$For notational convenience, we store $\theta$ as a $|\CC| \times n$ matrix. Then, the scaled log-likelihood function is given by
$$ \frac{1}{N} \sum_{k=1}^{N} \underset{l_k(\theta)}{\underbrace{\sum_{j \in \CC} \In\{y^{(k)} = j\} \log\frac{\exp{(\theta_j\tr x^{(k)})}}{\sum_{l \in \CC}\exp{(\theta_l\tr x^{(k)})}}}}, $$where $\In$ is the indicator function that takes the value 1, if the statement passed to it is true; otherwise it returns 0. Then, the function
$$ J(\theta) = -\frac{1}{N} \sum_{k=1}^{N} l_k(\theta) $$is called the sample average approximation. Note that we have a minus sign in front of the function as we shall later minimize the log-likelihood instead of maximizing. In fact $J(\theta)$ is an approximation to the expected loss
$$ \E[l(\theta)] = -\int_z l(\theta | z)~p(z)~dz, $$where $p$ is the unknown distribution of the data. Under reasonable assumptions, we have
$$ \P\left(\lim_{N \ra \infty} J(\theta) = \E[l(\theta)]\right) = 1. $$When we minimize the sample average approximation, we obtain the optimal solution
$$ \theta^* = \arg\min_{\theta} J(\theta), $$which is also known as the maximum likelihood estimator. Here is the sequence of approximations that we expect from this learning problem
$$ J(\theta^*) \approx \E[l(\theta^*)] \approx \min_{\theta}\E[l(\theta)]. $$The left most value is obtained with the empirical objective function, the middle value is obtained with the real objective function. The right most one is the real optimization problem that we like to solve. However, the model is not available as data distribution is not known.
A conventional gradient descent approach for minimizing $J$ sets an initial $\theta^{(1)}$ and then, proceeds by evaluating at iterations $i=1, 2, \dots$ the following relation
$$ \theta^{(i+1)} = \theta^{(i)} - \alpha_i \nabla J(\theta^{(i)}) = \theta^{(i)} - \alpha_i \frac{1}{N} \sum_{k=1}^{N} \nabla l_k(\theta^{(i)}), $$where the step length $\alpha_k$ is also known as learning rate. As we need to go through the entire data set, this method is called the batch gradient descent. The learning rate can be computed by line search or set to an appropriate constant. However, in general, the sequence of $\alpha_i$ satisfies the following two conditions:
$$ \alpha_i \underset{i \uparrow \infty}{\ra} 0 ~\mbox{ and } \sum_{i=1}^{\infty} \alpha_i = \infty. $$A simple but poor choice is $\alpha_i = i^{-1}$ as it converges to 0 very fast. Another choice is $\alpha_i = 10^{-4} i^{-0.5}$.
In order to solve this problem with gradient descent, we need to evaluate $\nabla J(\theta)$. It is not difficult to show that
$$ \nabla_{\theta_j} J(\theta) = -\frac{1}{N} \sum_{k=1}^{N} \left(x^{(k)}\left(\In\{y^{(k)}=j\} - \frac{\exp{(\theta_j\tr x^{(k)})}}{\sum_{l \in \CC}\exp{(\theta_l\tr x^{(k)})}}\right)\right). $$Thus, we can apply the gradient descent algorithm for each $j \in \CC$ by
$$ \theta_{j}^{(i+1)} = \theta_j^{(i)} - \alpha_i \nabla_{\theta_j} J(\theta^{(i)}). $$Homework: Derive the formula for $\nabla_{\theta_j} J(\theta)$.
Now, we generate our synthetic multiclass classification problem to test our batch gradient descent algorithm.
pkg load statistics;
xc = [1, 4; 4, 3; 3, 1]; % Centroids
N = 100; n=2;
C = 3; % Three labels
x = zeros(N, n);
y = zeros(N, 1);
clf; hold on;
for k=1:N
l = randperm(C, 1); % Select one class randomly
x(k, :) = mvnrnd(xc(l, :), 0.2*eye(2));
y(k) = l;
if (l==1)
scatter(x(k, 1), x(k, 2), 20, 'b');
elseif (l==2)
scatter(x(k, 1), x(k, 2), 20, 'r');
else
scatter(x(k, 1), x(k, 2), 20, 'g');
end
end
There are three classes in this elementary example. Below, we first set up the sample average approximation function and then solve the resulting model with a simple batch gradient descent.
To run the following code, you need to unzip this file into the same directory.
addpath('./octave');
theta0 = rand(C, n);
theta = theta0;
alpha = 1;
maxiter = 100;
for i=1:maxiter
DJ = saa(x, y, theta, 'bgd');
theta = theta - alpha*DJ;
alpha = 1.0e-4*(1/sqrt(i));
if (norm(DJ, 'inf')<1.0e-4)
break;
end
end
theta
We have obtained the parameter $\theta$. For a given point, we can now calculate the probabilities that it belongs to one of these classes.
xtest = [2; 1];
fprintf('Point (%.2f, %.2f)\n', xtest(1), xtest(2));
denom = 0;
for l=1:C
denom = denom + exp(theta(l, :)*xtest);
end
for j=1:C
num = exp(theta(j, :)*xtest);
fprintf('\tbelongs to class %d with probability %f\n', j, num/denom);
end
One iteration of the batch gradient descent evaluates the full gradient $\nabla J$, which requires $N \times |\CC|$ function calls. Stochastic gradient descent on the other hand simplies this gradient calculation by sampling only one of the component functions. Suppose the function $\hat{k}$ is randomly selected, then the iterations proceed with the following relation
$$ \theta^{(i+1)} = \theta^{(i)} - \alpha_i \nabla l_{\hat{k}}(\theta^{(i)}). $$Next we solve the same problem with the stochastic gradient descent. You should notice that the computation time with the batch gradient descent is significantly longer than the time with the stochastic gradient descent. This is an expected behavior as the stochastic gradient descent requires only $|\CC|$ function calls.
theta = theta0;
alpha = 1;
maxiter = 100;
for i=1:maxiter
DJ = saa(x, y, theta, 'sgd');
theta = theta - alpha*DJ;
alpha = 1.0e-4*(1/sqrt(i));
if (norm(DJ)<1.0e-4)
break;
end
end
theta
This discussion about gradient-based methods is quite rough. For instance, instead of selecting just one $l_k$ randomly, it is possible to select a subset from $\{1, \dots, N\}$. This is known as the minibatch approach. All these variants of gradient methods use the first order information. There are also matrix-free approaches that use second order information. It is also easy to imagine parallel implementations of (mini) batch gradient methods.
Consider now the binary classification problem. Then, $y^{(k)} \in \{0, 1\}$. This problem can be modeled as a linear separation problem, where we try to split the points $x^{(k)}$ into two sets with a hyperplane. Formally,
$$ \left. \begin{array}{rcl} \theta\tr x^{(k)} - c \leq -1 & \implies & y^{(k)} = -1 \\ \theta\tr x^{(k)} - c \geq 1 & \implies & y^{(k)} = 1 \end{array} \right\} (\theta\tr x^{(k)} - c)y^{(k)} = 1. $$We next define
$$ l_k(\theta) = \max\{0, 1 - (\theta\tr x^{(k)} - c)y^{(k)}\}, $$which is also known as the hinge loss. Then, the objective function becomes
$$ \min J_\lambda(\theta) = \frac{1}{N} \sum_{k=1}^N l_k(\theta) + \frac{\lambda}{2} \|\theta\|^2. $$The last term above is used to avoid overfitting. Overall, this objective function is not differentiable due to $\max$ function. However, we can use the subgradient method as the subdifferential set for each function $l_k$ is easy to evaluate
$$ \partial l_k(\theta) = \left\{ \begin{array}{rl} -y^{(k)}x^{(k)}, & \mbox{if } (\theta\tr x^{(k)} - c)y^{(k)} < 1, \\ 0, & \mbox{if } (\theta\tr x^{(k)} - c)y^{(k)} > 1, \\ -\beta y^{(k)}x^{(k)}, \beta\in[0,1]& \mbox{if } (\theta\tr x^{(k)} - c)y^{(k)} = 1. \end{array} \right. $$It is then relatively easy to write the iterations of the stochastic subgradient descent method. Suppose at iteration $i$, the function $\hat{k}$ is randomly selected. Then, we have
$$ \theta^{(i+1)} = \theta^{(i)} - \alpha_i (g_{\hat{k}} + \lambda \theta^{(i)}), $$where $g_{\hat{k}} \in \partial l_k(\theta^{(i)})$.
Here is a simple binary classification problem where the points $x^{(k)}$ are in $\RR^2$.
xc = rand(2)*4 + 1;% Centroids
N = 100; n=2;
C = 2; % Two labels
x = zeros(N, n);
y = zeros(N, 1);
clf; hold on;
for k=1:N
l = randperm(C, 1); % Select one class randomly
x(k, :) = mvnrnd(xc(l, :), 0.2*eye(2));
y(k) = l;
if (l==1)
scatter(x(k, 1), x(k, 2), 20, 'b');
else
scatter(x(k, 1), x(k, 2), 20, 'r');
end
end
Homework: Solve this problem with the stochastic subgradient descent method.