Introduction | Notion

What is Data Science?: extracting the knowlegde from the raw data.
What were the Data Sciense topics?: Some Fundamentals, Data Indexes, different Classificators, Evaluation Methods, Association Rules, Clustering, Outlier Detection, modern Approaches in Data Science.
What is KDD process: Knowledge Discovery from Database. This is the process of extracting knowledge from massive data repositories.
Describe the stages of KDD process: We have a huge amount of raw data. At first we have to integrate a data from different sources and clean it. For example, join all required tables and eliminate the noise from the strings. My name can be written in three different ways, we want to count all Valeries as one name. Then we have to select and project the data, which is relevant for our task. For example, select all rows with the sales data of last year, and peek the required attributes of it. Then we have to transform this data for convenient usage of data mining algorithms. We want to normalize the data from one interval to another, or discretize it, or reduce the dimensions. Then comes the data mining part with all the staff from this lecture - clustering, classification, finding the association rules, detect the outliers. At the end we want to evaluate the data and interpret it.
What is CRISP?: That is a kind of implementation of KDD, that is used in the industry with some extensions for explicit business understanding.
Explain the principle of the One Rule classifier:

One rule classifier can be used for cathegorical labeled data. This algorithm is very simple, and it is based on counting the yes/no labels for distinct cathegorical values and creating the rules with them.
1. For each attribute:
  1. Count how the amount of yes/no labels for each distinct categorical value
  2. Find the most frequent class yes/no for each categorical value
  3. Count the errors for each categorical value
  4. Calculate the total error rate of the attribute.
2. Use the attribute with minimal rate of total errors for predictions.
What is overfitting?: We can achieve a 100% accuracy of the classifier predictions on the learning set, for example, if we create the desicion tree that contains one element in each leaf partition. But such classifiers would predict very inaccurate when getting the new data.
What is the difference between supervised/unsupervised learning?: In supervised learning we have the labeled data, and we try to learn the classifier, that assigns the correct labels to this data. Typical task is Classification. Unsupervised learning is the learning with data without labels. Typical tasks are Clustering, Outlier Detection, Association Rules.