Image classification

Classifier

$ y = \textbf{f}_{H} (X, \phi ) \quad X \in \mathbb{R}^{HxWxC} \quad y \in \mathbb{R}^{K} $

$ \textbf{f} : \text{Classifier}\\ y : \text{class scores}\\ X : \text{Image data}\\ \phi : \text{Parameter set}\\ H : \text{Hyper-parameter set} $

To solve for $\phi$, define a loss function and minimize it,

$ L(\textbf{f},y') \qquad y' \quad \text{is target}\\ \phi = argmin_{\phi*} \big(\sum_{i}L(\textbf{f}_{i}, y'_{i}) + \lambda R(\textbf{f}) \big) $

Where

$ R \quad \text{is regularizer function}\\ \textbf{f} \in \text{Reproducing Kernel Hilbert space} $

Regularizers does implicit dimensionality reduction. It forces the useless parameters to 0. Useless dimensions in parameter space leads to overfitting which leads to large difference between emperical and expected loss.

L1 regularizer does hard dimensionality reduction

L2 regularizer does soft dimensionality reduction. That is, dimensions may not actually be 0, but close to 0.

Solve for $\phi$ using first order gradient descent

$ \phi = \phi + update(\Theta, \nabla_{\phi}L) $

Where,

$ \Theta \quad \text{are optimization parameters}\\ \nabla_{\phi}L \quad \text{is the gradient of loss function w.r.t parameters (refer automatic diff notes)} $

Loss functions for classification tasks

Binary hinge loss

$ L = \quad \mid 1 - yy' \mid_{+} \quad = \quad max \big(0, 1-yy' \big)\\ $

Where,

$ y = f(X ; \phi) \in \mathbb{R}\\ y' \in \{-1,1\} \text{ is the target} \\ $

This function is convex but not differentiable

$ yy' \text{ is positive when prediction and target have same sign and the prediction is greater by a margin of 1} $

$ y = 10 \quad y' = 1 \implies yy' = +ve \quad \text{and} \quad L=0\\ y = -1012 \quad y' = -1 \implies yy' = +ve \quad \text{and} \quad L=0\\ y = 0.1 \quad y' = 1 \implies yy' = +ve \quad \text{but because of margin} \quad L=0.9\\ $

In case of correct prediction, the loss is always 0 no matter how big the classifier output is.

Multi class hinge loss

$ L = \quad \sum_{j \neq l} \mid 1 - (y_{l} - y_{j}) \mid_{+} \quad = \quad max \big(0, y_{j}-y_{l}+1 \big)\\ $

Where,

$ \textbf{y} = f(X ; \phi) \in \mathbb{R}^{K}\\ K \quad \text{is the number of classes} \\ l \in \{0, ... , K \} \quad \text{ is the target label} $

If the score corresponding to target label is higher than scores of all other classes, then loss is 0

Binary cross entropy loss

$ P(y'=1 \mid y) = \sigma(y) = \frac{1}{1+e^{-y}}\\ P(y'=-1 \mid y) = 1-\sigma(y) = \frac{1}{1+e^{y}}\\ P(y' \mid y) = \sigma(y)^{y'}\big(1-\sigma(y)\big)^{(1-y')} $

To find the optimum parameters, we minimize the negative log likelyhood

$ NLL = -\sum_{i=1}^{N}logP(y'_{i} \mid y_{i}) $

Therefore, loss for each sample is

$ L = -logP(y'_{i} \mid y_{i}) = -(y'log(\sigma(y)) + (1-y')log(1-\sigma(y)) $

This loss function is convex and differentiable

Multi class softmax loss

$ P(y'=l \mid y) = \frac{e^{y_l}}{\sum_{j}e^{y_j}} $

Minimizing negative log likelihood, we get the loss for each sample as

$ L = -log(\frac{e^{y_l}}{\sum_{j}e^{y_j}}) $

Note

$ -log(1) = 0\\ -log(0.001) = 3\\ $

Triplet loss

TODO

Robust classifier

Classifier should be robust to

Translation and rotations
Different lighting condition
Deformation (different poses or body expressions)
Occlusion (when object is hid by another object)
Background clutter (when objects looks blended with background)

Training dataset should contain different variants of same object to make our classifier invariant to those properties

Simple classifier : (K) Nearest neighbour

$ y= \textbf{f}_{K}(x)$ (Non parametric model)

Training : Store all image features and its corresponding lables

Testing : Find the (top-K) feature(s) in the training set that is(are) closest (according to some distance metric) to the test feature and assign the corresponding label

Problem: Test time complexity O(N) where N is size of training data.

Simple Linear classifier

$ y= \textbf{f}_{\gamma}(x, \{\textbf{W},b\}) = \textbf{W}x + b$

$ \gamma : \text{Learning rate}\\ $

Choosing hyper parameter

Choose the hyperparameter settings that work best on both test and validation set. This is the number that should go into the report

Incase of small datasets, use cross validation

Why do we need both validation and test set?

Gives more confidence on our chosen hyper parameter settings.

From coding point of view, our algorithm should not have access to the labels in the test set. Where as, it can access validation set labels to monitor performance

Dataset assumptions and traps

It is assumed that all data in datasets are independent and are sampled from same probability distribution. If you happen to collect data over time and assign the test set with data sampled towards the end, then your test set might not actually reapresent the real situation