Which types of clustering algorithms do you know?:
What is clustering?: Grouping a database of objects into clusters. Objects in the same cluster are similar, and objects in different clusters are not similar. Clustering is like unsupervised classification. Assign classes to objects and find the classes.
Explain k-Means Clustering algorithm:
What is k-Medoid, Mode, Median?:
k-Medoids: representatives are not the means of the data point cloud, but the actual data points in the middle.
k-Modes: representatives defined by the most frequent values. We can compute the mode in each dimension of data independently, and form a mode vector. K-Mode works like K-Means for categorical data. The mode is the most common number in a data set.
k-Median: artificial representative objects in the middle. We compute the median of each dimension independently. Compute median for all values on x-axis, and median for all values on y-axis, put the point.
Comparison:
How can we find the k-Medoids? What is the complexity of k-Medoid? : We can use a PAM algorithm, Partitioning Around Medoids to find the medoids.
Select k objects as initial medoids.
Assign each non-medoid to the closest medoid.
Compute current quality of clustering. The quality of clustering is the sum of distances from all non-medoid objects to the medoid of their cluster (for all clusters). The calculation of quality takes n operations for n data points - O(n)
For each pair of non-medoid and medoid swap the roles and calculate the quality of the clustering. It gives us n candidates of moves. Calculate the total distances of all candidates, O(n) for one candidate , O(n^2) for n candidates. Then select the move which minimizes the total distance. Perform this move on the main tree. Repeat this thep until no change.
High complexity. For t iterations around $O(t * (n^2))$.
How can we improve the performance of k-Medoids?: There are different modification of PAM algorithm. We can for example perform PAM on some small representative sample - the same complexity, but smaller n. But if the sample is too small, resylt can be quite different from PAM.
We can alternatively introduce some randomized search in the PAM algorithm. PAM repeats the loop until convergence, but we can set a maximal number of iterations. PAM evaluates all possible pairs, but we can peek the limited amount of random pairs.
What can we do if we do not know parameter k upfront?: Usualy we don’t know the parameter k. As the simples approach, we can perform many clusterings from k=1 to some large k. The problem is, that we have to choose some quality function to evaluate the clustering results. The Silhouette Coefficient is the way to do it.
Explain the Silhouette coefficient, write down the formula!: We want to have similar object is one cluster, and dissimilar objects in the different clusters. Silhouette Coefficient uses these two measures. The clustering is given, we want to evaluate it. We have to calculate two parameters for each object o.
a(o) is average distance between given object o and all other objects in its cluster.
b(o) is average distance between given object o and all objects in second closest cluster.
We can calculate a silhoette for the object o:
The clustering for this object o is good, when the a(o) tends to be small and b(o) tenst to be large. If this is the case, s(o) will be near 1.
The silhouette coefficient of some given clustering is the normalized sum of the silhoette coefficients of all objects.
Explain Expectation Maximization procedure: We assume that clusters can be generated by a probability distribution. We have to assume some model and then try to learn its parameters. We can do it with Expectation Maximization procedure. We start with some parameter values, and calculate the likelihood of them. Then we iteratively improve the parameters of the model until some quality threshold is reached.
We are also usualy use the log likelihood instead of likelihood, because the sum is more numerically stable than multiplication.
The popular choise is a mixture of gaussian distributions. We assume, that one Gaussian in the distribution is one cluster. We can construct our mixture from univariate gaussians with only two parameters and from multivariate gausians with several parameters for each dimension.
In such case we have to find the means, covariance matrices and mixing coefficients, that are maximizing the likelihood to observe given data.
EM is better than k-Means, because the clusters are more accurate, probabilistic goodness instead of average distances.
Explain DBSCAN algorithm: we want to connect the points with high local density into clusters. The local density of the point q can be defined by two parameters:
$\epsilon$: radius of the neghborhood of point q
$MinPts$: minimum number of point in the neighbourhood
q is called a core object, the neighbourhood with given radius contains more than MinPts points.
DBSCAN: $\epsilon$ and $MinPts$ are given.
For each not yet classified object o:
Complexity: O(n * “Neighbour seqrch query complexity”)
Results are not 100% deterministic: order of traversing maters.
What is the difference between a core/border/noise object?:
core object has more than MinPts points in the neigborhood with given radius.
border object are connected to the core objects but are not core objects
noise objects are located outside of all clusters
How can we work around the sensitivity of DBSCAN’s parameters?: Choise of $\epsilon$ and $MinPts$ defines the sensitivity of DBSCAN. There is a heuristics to do it. We can compute the distances to the k-nearest neighbors. That means, for fixed MinPts compute the required $\epsilon$ for all objects individually. We can create a k-distance plot with k-distances of all objects, sorted in decreasing order.