$ y = \textbf{f}_{H} (X, \phi ) \quad X \in \mathbb{R}^{HxWxC} \quad y \in \mathbb{R}^{K} $
$ \textbf{f} : \text{Classifier}\\ y : \text{class scores}\\ X : \text{Image data}\\ \phi : \text{Parameter set}\\ H : \text{Hyper-parameter set} $
To solve for $\phi$, define a loss function and minimize it,
$ L(\textbf{f},y') \qquad y' \quad \text{is target}\\ \phi = argmin_{\phi*} \big(\sum_{i}L(\textbf{f}_{i}, y'_{i}) + \lambda R(\textbf{f}) \big) $
Where
$ R \quad \text{is regularizer function}\\ \textbf{f} \in \text{Reproducing Kernel Hilbert space} $
Regularizers does implicit dimensionality reduction. It forces the useless parameters to 0. Useless dimensions in parameter space leads to overfitting which leads to large difference between emperical and expected loss.
L1 regularizer does hard dimensionality reduction
L2 regularizer does soft dimensionality reduction. That is, dimensions may not actually be 0, but close to 0.
Solve for $\phi$ using first order gradient descent
$ \phi = \phi + update(\Theta, \nabla_{\phi}L) $
Where,
$ \Theta \quad \text{are optimization parameters}\\ \nabla_{\phi}L \quad \text{is the gradient of loss function w.r.t parameters (refer automatic diff notes)} $
$ L = \quad \mid 1 - yy' \mid_{+} \quad = \quad max \big(0, 1-yy' \big)\\ $
Where,
$ y = f(X ; \phi) \in \mathbb{R}\\ y' \in \{-1,1\} \text{ is the target} \\ $
This function is convex but not differentiable
$ yy' \text{ is positive when prediction and target have same sign and the prediction is greater by a margin of 1} $
$ y = 10 \quad y' = 1 \implies yy' = +ve \quad \text{and} \quad L=0\\ y = -1012 \quad y' = -1 \implies yy' = +ve \quad \text{and} \quad L=0\\ y = 0.1 \quad y' = 1 \implies yy' = +ve \quad \text{but because of margin} \quad L=0.9\\ $
In case of correct prediction, the loss is always 0 no matter how big the classifier output is.
$ L = \quad \sum_{j \neq l} \mid 1 - (y_{l} - y_{j}) \mid_{+} \quad = \quad max \big(0, y_{j}-y_{l}+1 \big)\\ $
Where,
$ \textbf{y} = f(X ; \phi) \in \mathbb{R}^{K}\\ K \quad \text{is the number of classes} \\ l \in \{0, ... , K \} \quad \text{ is the target label} $
If the score corresponding to target label is higher than scores of all other classes, then loss is 0
$ P(y'=1 \mid y) = \sigma(y) = \frac{1}{1+e^{-y}}\\ P(y'=-1 \mid y) = 1-\sigma(y) = \frac{1}{1+e^{y}}\\ P(y' \mid y) = \sigma(y)^{y'}\big(1-\sigma(y)\big)^{(1-y')} $
To find the optimum parameters, we minimize the negative log likelyhood
$ NLL = -\sum_{i=1}^{N}logP(y'_{i} \mid y_{i}) $
Therefore, loss for each sample is
$ L = -logP(y'_{i} \mid y_{i}) = -(y'log(\sigma(y)) + (1-y')log(1-\sigma(y)) $
This loss function is convex and differentiable
$ P(y'=l \mid y) = \frac{e^{y_l}}{\sum_{j}e^{y_j}} $
Minimizing negative log likelihood, we get the loss for each sample as
$ L = -log(\frac{e^{y_l}}{\sum_{j}e^{y_j}}) $
Note
$ -log(1) = 0\\ -log(0.001) = 3\\ $
TODO
Classifier should be robust to
Translation and rotations
Different lighting condition
Deformation (different poses or body expressions)
Occlusion (when object is hid by another object)
Background clutter (when objects looks blended with background)
Training dataset should contain different variants of same object to make our classifier invariant to those properties
$ y= \textbf{f}_{K}(x)$ (Non parametric model)
Training : Store all image features and its corresponding lables
Testing : Find the (top-K) feature(s) in the training set that is(are) closest (according to some distance metric) to the test feature and assign the corresponding label
Problem: Test time complexity O(N) where N is size of training data.
$ y= \textbf{f}_{\gamma}(x, \{\textbf{W},b\}) = \textbf{W}x + b$
$ \gamma : \text{Learning rate}\\ $
Choose the hyperparameter settings that work best on both test and validation set. This is the number that should go into the report
Incase of small datasets, use cross validation
Gives more confidence on our chosen hyper parameter settings.
From coding point of view, our algorithm should not have access to the labels in the test set. Where as, it can access validation set labels to monitor performance
It is assumed that all data in datasets are independent and are sampled from same probability distribution. If you happen to collect data over time and assign the test set with data sampled towards the end, then your test set might not actually reapresent the real situation