What are different meanings for the “quality” of a learning algorithm?: quality of a learning algorithm can have many meanings.
How do we define the overall accuracy of a classifier?: We can use the labeled test set to calculate the accuracy of a classifiert. Overall accuracy can be defined as a fraction of objects from the test data set, that were correctly classified.
What is k-fold cross validation? When do we need it?: k-Fold cross validation: decompose the data into k subsets of equal size, that are called folds. Iteratively use k-1 partitions for training, the remaining one for testing. We can use it for Train-and-Test paradigm, to train and test models with the limited set of labeled data.
Explain the bias-variance tradeoff (in your own words): bias represents the rate of systematic errors, and variance - the random errors and noise. Overfitted models reducing the bias, but there are very sensitive to new data - therefore they have high variance. Underfitted trees make more systematic errors, therefore high bias. There is a tradeoff between these two parameters: you cant completely minimize one without increasing of another parameter. We can create very deep tree for example, to reduce the bias. But it will be overfitted, and will have big error rate on new data. We can construct a shallow tree, but then we are going to make systematic errors - we will predict something wrong most of the time.
What are the different error types of a classifier?: There are two types of errors: false positive and false negative. The classifier accuracy does not distinguish between them, but we are often more interested in one class more than in another class of errors. There are two measures, that use these error types: the precision and the recall. Precission: if we classify something as positive, how certain we can be. This is a fraction of true positives in all predicted positives. Recall: how good we are at detecting positives. This is a fraction of the predicted true positives in all actual positives. Conservative classifier want to give few yes answers with high certainity - high precission, low recall. The optimistic classifier want to give many yes answers with lower certainity - low precision, high recall.
Explain Cohen’s Kappa coefficient and the F-score: F-score measures the classifier accuracy using the precision and recall. There are different types of F-scores. For example, the $F_1$ score is calculated as harmonic mean of precision and recall. We can also weight the importance of recall and precision. F-Score does not take the true negatives into account.
Cohen’s Kappa coefficient measures the difference between our classifier and random guessing. We can also use it to compare the results of different classifiers.
What advantage do you see in using the informational loss instead of the quadratic loss?: If the classifier return the classes of the objects with not 100% probability, but different possbile classes with according probabilities, we can use the loss functions to evaluate it. Before we have used 0-1 loss function, for classifier that return the certain class for each object. Loss is 0 for correct prediction, 1 otherwise.
Quadratic Loss: loss is the sum of squares between predicted class probability and actual probability. Classifier has to minimize this function. Quadratic loss function have the upper limit, Q ≤ 2
Informational Loss: another measure of loss in case of classifiers that return probabilities, is the informational loss. Informational loss is $-log(p_c)$, where c is the actual class of the object. If we output the probability of actual class as 1.0, then we do not make the mistake, and information loss is 0. As we are getting away from the true class, the information loss grows faster and faster.
Probabilities of the classes, that are not real class, do not matter in calculation of information loss. Informational loss penalizes the large errors more badly. Quadratic loss function use the probabilies of all classes, which involves unnecessary computations.
What ways do you know to measure the quality of a classifier? Which one would you prefer in which situation?: It depends how balanced is the data.
If data set is highly imbalanced - have much more objects of some class than the objects of the another classes, then the standard accuracy is probably not meaningfull: the classifier can always dummy predict the main class and still show us the good accuracy rate.
Do we care about false positives equall to false negatives? If not, we can use the F-Score and choose or adjust the weights with setting the Beta.
If we want to penalize the large errors more badly, then we can use information loss.
We can use Cohens Kappa to compare different classifiers with each other.
What is a lift chart? How different is it from the ROC curve?:
Lift Chart: we have some action, for example an advertisment, and we know the response rate of this advertisements, the conversion. Then we use some tools from Data Science, to find a fraction of people, that would respond to this advertisement with higher probability. The difference between responce rates is called lift. We can build the visualisation to compare the numer of responses to the fraction of original sample. This visualisation is called a lift chart. The lift chart can be used to perform a Cost/Benefit analysis.
ROC curve: ROC stays for Receiver Operating Characteristics. It calsulate the rate of true positive values and false positive values. We create a 2D chart, that represents this two rates. The integral of this chart is often used as measure of algorithm quality.
How can we evaluate the result of a regression model?: we can used a mean of squared errors. This method is sensitive to outliers. We can take a squared root, or us a mean of errors without a square, or calculate the relative squared error. We can also calcule the R square coefficient. It shows us how much better are the predictions compared with always predicting the average. And we can also use the correlation coefficient between actual and predicted values.