How can we categorize data by type?: There are two common types of data - categorical data to store some non-numerical data and numerical data. There are two sybtypes of categorical data. First is a nominal data, that stores some categories without order. For examle - Red, Yellow and Green. Second is the ordinal data - there are categories on some scala. For example small, normal and large. There are also two subtypes of the numerical data. The discrete data and continuous data. Discrete data store values from some discretized scala, for example natural numbers from one to ten. Continues data store values from some interval of real numbers.
Consider aggregation function X: is it distributive/algebraic/holistic/self-maintainable?: Distributive function is the function, the computation of that can be distributed - MAX, MIN, SUM, COUNT. Algebraic is not distributive, but computed from a set of distributive aggregations. For example AVERAGE is a SUM divided by COUNT. Holistic functions cannot be computed from a set of distributive aggregations. For example the MEAN, MEDIAN. Full database scan is required. Self-maintainable functions are the functions that we can update in the case of single inserts/deletes/updates. For Example COUNT, SUM, AVERAGE. The functions MAX, MIN, MEAN, MEDIAN are not self maintainable. For example if the current minimum is 10 and i delete a single row with value 10, i cant update the minimum without database lookup or specialized data structures.
Do you see a relationship between distributive/algebraic/holistic and self-maintainable?: Yes, for example, the holistic functions are guaranteed not self-maintainable.
How high is the entropy when all the classes are equally likely? Write down a general formula for this case.: Entrophy is a sum over all discrete probabilities multiplied with their logarithms. Entrophy shows us, how random is the given set of values. Therefore the case when all the classes are equally likely maximizes the entrophy.
What tools do you know to quantify the strength of the relationship between two random variables?: First is the covariance. The problem with covariance is that the covariance is not normalized. Therefore it is more conveniert to use the correlation, which is variance-normalized version of correlation.
Which statistical tests do you know? What can they do? (Differences between each of them): Chi-Squared Test, Kolmogorov-Smirnov Test and the Wilcoxon-Mann-Whitney test.
Chi-Squared Test can be used for categorical data. It determines whether two variables X1 and X2 are statistically dependent. We calculate the value Chi with the Chi-Distribution. This distribution has only one parameter - the degrees of freedom. If this value is greater than some threshold, then the variables are statistically dependent. But if this value is smaller than this threshold, we cannot say anything about the independce of variables. This test cant proof independence, only the dependence. Categorical data, hypothesis to reject - variables are statistically independent.
Kolmogorov-Smirnov Test can be used for numerical data. It checks the largest error between the given data and some expected distribution. If this error is greater than some threshold we can say with given confidence, that this sample is not generated with this distribution. Numerical data - continuous or discrete with many values. Hypothesis to reject: a given sample was drawn from a given distribution (in one-sample case) OR two given samples have the same distribution (in two-sample cases).
Wilcoxon Test can be used for data, that we can sort. Threfefore it can be applied for ordinal or numerical data. It checks, with which probability two datasets are generated by the same distribustion. Sort all datapoints, give them the ranks. Sum up the ranks, compare the sum. We have to sort the list of data, therefore somethning sortable, ordinal data or numerical data. Hypothesis to reject - variables are samples from one distribution.
What is data reduction?: Data volume is often too large. We would like to use some compacter representation, which contains the essential data from the dataset, that would give use the same analysis results. There are three main types of data reduction.
Numeroisity reduction: We can reduce the number of objects. Reducing the amount of used rows. For example, with some sampling or clustering technique.
Dimensionaly reduction: reduce the number of attributes. We can either select some subset of attributes and use only them, or transform the attributes in new synthetic attributes with some dimensionality reduction algorithms, like PCA.
Discretisation: reducte the number of values per attribute. Insted of have the super precise values with some floats that requere 8 bytes we can assign the simple enums or categories which require some bits. For example, insted of using floats for height we can use the categories small, medium and large. Or simply round it to some integer.
Explain how PCA works!: We want to reduce the given data vectors with k dimensions to some smaller amount of orthodonal vectors, that are called the principal components. This principal components will describe the data wits some controlled error rate, but with less amount of dimensions. We have to calculate the covariance matrix, perform the eigenvalue decomposition. Get all eigenvectors and the corresponding eigenvalues. The eigenvalues are like weights of this eigenvectors. Then sort the eigenvectors by eigenvalue decreasing. Get the first c eigenvectors and transfrom the data to a new data space. PCA attributes are difficult to interpret, it has high complexity.
We have to demean the data and divide it by standard deviation.
How can we discretize data? How can we find the best merge/split?: We want to reduce the number of values for some attribute by dividing the range of attributes into intervals and order some ordinal values to them. For example we can discretize the interval [0, 100] to [small, medium, large]. We can emprirically assign different ordinal values for different float ranges, or discretize data by the entrophy. Compute the entropy of current data set. The compute the entrophy of all possible splits. Choose the best split with the best difference between original entrophy and weighted sum of the partitions entrophy. This differense is called information gain. Recursively apply until same threshold, like amount of the sets.