What is an index?: Indexes are data structures that store some ordered information about the data set. There are needed to speed up the queries. With index we can win the query execution time for some memory, that is needed to store the index. We have also to maintain these indexes.
Which types of indexes do you know?: There are several data structures used for the indexes in databases: the Inverted Lists, B-Trees, and the spatial indexes as R-Trees and Kd-Trees
Inverted lists: For each non-primary column create a list, which uses this column as primary key.
B-Tree: use some index attribute to construct the binary search tree, that contains the page numbers at leaves. Nodes represent the interval borders.
k-d-Tree: split the space recursively, make a balanced binary tree with the split results. Nodes are split lines, leaves are the parts of the space.
R-Tree: similar to BVH from computer graphics. R stays for rectangle. We calculate the bounding volumes for each objects, and construct recursively the hierarchy of these volumes, based on the minimizing the volumes of the bounding.
Why can’t we answer spatial queries by simply constructing one B tree per dimension?: We cant compute the distances from the data point to our quiery point in all dimensions, only in the one dimension of this index. It can’t be used to compute the distances in the multidimensionally space.
Explain what is the R tree and its elements?: R-Trees are Rectangle-Trees. The elements of these trees are the bounding volumes. We build the tree, that minimizes the volume of this bounding volumes. This volumes may overlap. Construction is a little bit difficult, but can be represented as multiple insertions one after the other. We insert element in the leaf that requires the least entlargement. If maximum capacity reached - split this volume.
How do we search for the nearest neighbor with the R tree?: Insert the Tree Root in the priority queue. We take a first element from the queue. If this node is a leaf, then we insert all points from this node in the queue, the priorities are the distances from this points to the query point. Small distance means high priority. If this node is not a leaf, then insert all child nodes, sorted by distance, in priority queue. Whole tree will be traversed.
Note: learn this algorithm!
How can we deal with volumetric objects?: we can caculate a rectangle/cube etc. with few equations for every volumetric object. It could save us the difficult equations of the intersection of vector with complex figures
Is there any problem with the overlaps in R trees?: yes, we have to traverse multiple paths in the R-Tree, if the query point is located in the overlap region.
What are the differences between B trees, k-d trees and R trees?: k-d Trees split the space, it is useful when we work with data points. k-d Trees ane not balanced. R-Trees creates a structure of volumetric objects. It can be useful when we have not the data points, but some objects, or if the data is clustered, and the structure can ignore much free space. B-trees are very good to efficiently store one-dimenstional data. R-Trees and B-Trees are balanced, but the balancing of R-Trees can be difficult.
How can we insert a new object in the R tree? (procedure): Insert into the leaf rectangle/page that requires the least enlargement. If the maximum capacity is reached, then split the rectangle.
What queries can be handled by spatial indexes?: calculation of the nearest neighbor, or the k-nearest neighborhood.
What do we understand under “Lazy Learning”?: predict the values without building any model, without learning phase.
What are pros and cons of the k-NN classifier?: Attributes can have the same influence, works well with dynamic data, results are interpretable. It is sensitive to noise, and k is not easy to set.