Outlier Detection

What is an Outlier according to Hawkins?: “An outlier is an observation which deviates very much from the other observations. It was likely was generated by a different mechanism”
Why should Outlier Detection be considered unsupervised?: If we have some labeled outliers, than it is more classification task, than an outlier detection. If in the future comes the new outlier. that was never seen before we will absolutely not able to detect it even with model that was trained with some labeled outliers. They are very rare and unpredictable.
Do you see a relationship between Clustering and Outlier Detection?: Yes, there are antipods is some sense. We can use the output of the clustering algorithms and consider the noise-objects, that stay outside from clusters, as outliers.
Describe an exemplary Outlier Detection task: For example, we have an online game, and the data of the players. Some win rates, success values, something like that. We can use this data, to detect the outliers. These outliers can be the exceptionally good gamers, that we maybe want to invete in the professional teams, or these outliers can represent the players that are using cheats or exploits, and we will investigate and maybe ban them.
What types of outliers do you know?: There are three types of the outliers. Global outliers are outliers, that deviate from the majority of data. Most outlier detection methods deal with the global outliers. Local outliers are depending on the context. ”The temperature is 28°C today, is that exceptional?” For december in Karlsruhe, yes, but for june - not exceptional. Collective outliers are some abnormal bursts of density in observations.
Which Statistical approaches do you know?: We assume that some underlying model generates the data. We can consider the outliers as the objects that would be generated with low probability - in the tail of normal distribution, for example. We can use a Z-score: subtract the mean from the value, divide with sigma. If Z-score is low - 0 for mean value - than we have no outlier, and a Z-score of 1.0 would indicate a value that is one standard deviation from the mean. Z-score larger than 2 can be interpreted as outlier.

Therefore, Z score shows us, how many standard deviations are between mean and the value.

Outlier detection can be a side-product of Clustering. All the objects, that we evaluate as noise, can be considered as outliers.
Describe the limitations of statistical methods: It can be very difficult the parameters of distribution, when we have multidimensional data. The estimation of mean and variance is affected by the outlier, chicken-egg problem. In many cases, we do not know the distribution.
Which Distance-based approaches do you know?: The distance-based approaches can be used to overcome the main limitations of statistical mathods. They make no assumptions about the data distribution. If the neighbourhood of the object o with some given radius has less objects than a given fraction of the dataset, then we can consider it as outlier. For two given parameters of radius and fraction each point in the database is either not outlier or outlier.

We can also work with free radius parameter, and estimate the radius of the k-neighbourhood for each point o. Then we can sort the list by the value of this radius and find the top outliers.

Is it preferable to have an outlier ranking or rather a yes/no decision?: It is better to have an outlier ranking. Yes/No decisions can lose some outliers, that are located near dence region, for example. It is better to have the sorted list with outliers and their parameters, and the decide, where to cut this list.
Which Density-based approaches do you know?: Outlierness is relative to neighbors. The outlierness of o depends on its density and density of its neighbors.

There is a popular Density-Based Approach, that is called LOF - Local Outlier Factor. It is the measure to quantify the outlierness based on the density around the objects. Basic idea: compare the radius of k-Neighbourhood of o to the radiuses of k-Neighbourhood distances of its neighbours. Output list contains all objects by decreasing their LOF score. We calculate a LOF score for each object in data set. If LOF(o) is high - o is probably an outlier. k is the given parameter, like MinPts in DBSCAN.

LOF create a ranked list of outliers. Threshold must be set empirically. But the parameter k is hard to set. I think, it was mensioned, that the good start is 2 times the dimension, but the best value varies from one data set to another.
Which Neural-based approaches do you know?:
1. Autoencoders: That are the neural networks, that encode the input data and then decode it. We want to learn the mapping from input layer to the bottleneck layer, and the mapping from bootleneck layer to the output layer. In the case with these 3 layers we call the those network Shallow Networks, otherwise we call it Deep Learning. Bottleneck layer has less parameters, then the amount of dimensions in input and output data.
  
  We can train them, and then put the real data. Normal data will de encoded and decoded correctly, and outliers will be encoded, but the decoding results will be different from the input.
2. Self-Organizing Maps: A projection of an n-dimensional data on usualy 2d grid of neurons. Each neuron contains the n-dimensional weight vector. For each data point we adjust the weights stored in neurons. At first, we have to find the closest neuron to our point. and then adjust the weights of all neighbour neurons in the map. We move the weights in the direction of input data, proportional to the distance. If the weights are near the data points - move them a little bit, if they are far away - make a big move.
  
  We can use the for outlier detection. After the training, Self Organized Map is adjusted to the majority of the data objects. Outliers tend to be further away from their nearest neurons, than inliers.
3. Restricted Boltzmann Machines: Stochastic neuronal neutwork that learns a probability distribution over a data set.
How can one use Neural Networks to detect outliers?: We want to learn a model, that can encode and decode normal data well. The abnormal objects will not correctly reconstructed. We put the data in the neuronal network. It encodes the data, and then encodes it. Most the points will be correctly decoded, but the reconstuction of some of the points will not correspond with the original point data. These points are the outliers.
Why is it harder to find outliers in high-dimensional spaces?: There are two problems with high-dimensional data. The outliers can simply hide in the subspaces. The points, thar are normal core objects in some dimensions, are complete outlier in another cases. For example, we can formulate a question “Find the outlier students at KIT”. What is an outlier student? Student with 1.0 in all exams? Or student, that makes 20k in month? Or student, that is two and half meter high? We have to look only on the interesting substapces, and apply the basic data mining algorithms for each space.

Second problem is so called curse of dimentionality. If we have many dimensions, the traditional distance functions are useles, because all data points tend to be equally far away from each other in sparse space. We have tp use more robust distance funtions, for example the angle-based functions.
How does Angle-Based outlier detection work?: For each object o calculate the angles xoy for every pair of points x and y. Then calculate the variance of this angles, for each object in database. The outliers would likely have a small variance of angles, and the inliers would have a large variance of the angles.
Trivial vs Non-trivial outlier: o is trivial outlier in some space, if there exist a subspace of this space, such that o is an outlier in this subspace. Trivial outlier is an outlier that can be seen for example an the 1-D or 2-D representation of the dataset.

In this example, Chibs and Blockt Herta are the trivial outliers.
Strong vs Weak outlier: o is a strong outlier in some space, if no outlier exists in any subspace of this space. This means, that o can not be seen as outlier at lower dimension.

If o is strong outlier, then o is non-trivial outlier.