What is Linear Model?: Linear Models: Each object is described as a vector of attribute values. We want to know the class of a new object. There is a function $f_w(x) = w_1 * x + w_0$ for the dividing line. We want to learn the parameters $w_1$ and $w_2$. We could do it with OLS, but it is very expensive on the large scale, because of inverting of N*N matrix. We can train the model via gradient descent. This method is sensitive to outliers. Not all problems are linerally separable.
What is Support Vector Mashines?: Support Vector Mashines: not all the separations lines produced by linear classifier are good. Foe example, because of strong influence of outliers. SVM tries to find the main line with maximal margins, that are called support vectors. But these approach is requires much more training time, than simple linear classifier.
What is Decision Trees?: Learn a function that is represented as tree. Nodes are attribute splits, leaves are class labels. The tree can be represented as a set of IF-ELSE rules. We want to find the splits that lead to homogenous partitions, but the trees have to be simple, to avoid overfitting.
What is Bayesian Classifier?: Bayesian Classifiers: is based on probabilities. Based on likelihood of observed data, and prior knowledge, estimate the probabilities for classes. Good thing is, if we have a new data, we can update the existing model incrementally. We compute the probability of different classes given an object, and then choose the class with highest probability. Probability of the class can be estimated from the data.
The probability that object o has a class $c_j$is the likelihood of observing the object o in class $c_j$ multiplied with the prior probability of this class, and normalized with the prior probability of observing this class. Value $p(o)$ is constant, and can be simply estimated from the data.
Object o becomes a class, that maximizes the likelihood to observe this object in this class and the probability of this class. $p(c_j)$ can be calculated as the number of objects of the class $c_j$ divided by overall number of the objects. The estimation of $p(o|c_j)$ - likelihood of observation the object o in class $c_j$- is the hardest part.
We have to estimate conditional probabilities $p(o|c_j)$ to learn the model. We can estimate this parameters via maximum likelihood estimation. The problem comes, if the objects have too many attributes. We would require very much data to fit the model. We can reduce the dimensionality of the data with PCA, or assume the statistical independence between attributes.
Bayes Classifier with assumed independence between attributes is called Naive Bayes Classifier. Then the likelihood can be calculated aus a product of likelihoods af single attributes.
If the attribute is cathegorical, than we can estimate $p(o_i|c_j)$ by counting, but if attribute is continuius, then we have to fit some model with Normal distribution to this attribute and then calculate $p(o_i|c_j)$ from probability density function. Both can be computed easily.
If we cant assume that the attributes are independent, then we use a Non-naive Bayes Classifier. Then we can use the multidemensional Multivariate Gaussian or some Gaussioan Mixture Model. This approach is much more expensive than Naive Bayes Classifier.
You have to add some dummy values for all attributes, to avoid the case $p(o_i|c_j) = 0$, which would eliminate the whole
Pros and Cons of all classifiers:
Regression vs Classification: Regression is defferent from classification. Classification predicts categorical class labels, regression models the continuous functions.
Explain the k-NN classifier. Can we also use it for regression?: For example, if we want to predict a Flight Delay for some given Wind Speed, we can check which Flight Delay value have the k nearest neighbours of the query object, and calculate the prediction as a mean of these values.
If we have for example two-dimensional data and we have a value of one attribute, then we can predict the corresponding value of the second attribute of this object. To do it, we have to find k nearest objects according to the known attribute, and calculate a mean from the second attribute values of these objects. This value is our prediction.
How can we learn the weights in a linear model?: We can do it with Gradient Descent. We can define the training errors with some squared difference function between expected values and calculated values. We want to find weights, that minimizing this training error function. We update the weights iteratively, based on their individual contribution to the training error, also called gradient. We iterate until some maximum numbers of iterations. The gradient descent converges, because the training error function is convex.
How do we build a decision tree? : we want to approximate a discrete target function. The construction contains two different actions: tree construction and tree pruning. For tree construction we initialize the empty tree and chose the attributes and create splits for these attributes recursively. It is used to increase the classification accuracy on the training data.
For tree pruning we identify and remove branches and nodes that were caused by noise and outliers. Pruning may occure at construction or after, it is needed to avoid overfitting.
Attributes and split positions are selected on a heuristics and different statistics. We perform these steps, until we reach some condition to stop the partitioning - we ran out of samples, all samples of the partition belong to the same class, we reached some maximal partition size, etc.
How can we decide which attributes to split and where to split?: we have a set of objects, and we want to split it into partitions. We need some measure, that helps us decide, on which attribute we have to split. We want to minimize the homogenity of different partitions. We have seen three common ways to do it.
Information Gain: calculate the entrophy of the original partition. Then for each attribute calculate the entrophy values for all partitions, induced by this attribute. Difference between the entrophy of original partition and the normalized entrophy of all partitions induced by this attribute is the information gain. Then use the split with attribute, that gives us a biggest information gain.
Gini Index: is one minus squared probabilities of the classes. High value of gini index means that the classes have equal probability in this dimension, and this attribute is a bad choise for split. Small value means that one class has a higher probability in this dimension, and we want to use it for split.
Missclasification Error: is one minus the probability of most frequent class in this dimenstion. That works like linear gini index.
How can we define, where to split, if we know the attribute?: So, we know an attribute, that will be used for split. We can sort the data by this attribute. Than we can test some of n-1 possible splits or all of them, and find the one split with for example, best information gain.
Explain different ways to avoid overfitting when building a tree: There are two approaches to avoid overfitting.
Postpruning: Prune after the tree construction. Remove incrementally the branches bottom-up and get a set of progressively pruned trees. Then use some another training data set to peek the best pruned from the set of candidates. Use some coefficients to prefer smaller trees.
Prepruning: Prune the tree while contruction. Maintain the global goodness measure, do not split the partition, if this would cause the falling of this goodness measure below some threshold. For example, peek the minimum number of objects in the leaf, usualy much larger than one. Or set some minimal fraction of the majority class in a leaf. Main problem of the prepruning is a right threshold peek. Therefore prepruning is harder than postpruning.
Explain, which measures we can use for postpruning: we can use some measure, that is calculated from the weighted sum of tree size and classification error.
Explain the ID3 algorithm:
How can we build or use a tree while considering that different mistakes may have different costs?: There are two different types of mistakes: False positives and false negatives. There are not equally bad - false positive can have much larger negative consequences, than a false negative, but it depends on the context. First possible solution is to give the classifier more negative samples in the learning phase. This would work for any classifier. But if we have not so many training objects, we can duplicate the important samples, or use some weights. Another possibility is to use the conditional risk.
We can use a conditional risk after the classification, in the evaluation of the results. We can calculate the probability of each class for this object, and multiply it with positive coefficients for corret guess and negative coefficients for incorrect guess. Negative coefficients have to be larger than positive, to penalize some kinds of errors more powerful.
x is some data point, like the customer with some attributes, $c_j$ is some class, for example grant a loan or reject a loan.
Explain the difference between a Bayes Classifier and a Naive Bayes Classifier? What changes in the decision rule?: We say, that the objects are giben as d-dimentional vectors with attribute values, $o = (o_d, ..., o_d)$. Naive Bayes Classifier assumes, that the attribute values $o_j$ are conditional independent. We can calculate the conditional probabilities from data as the product of all marginal probabilities.