• What is an index?: Indexes are data structures that store some ordered information about the data set. There are needed to speed up the queries. With index we can win the query execution time for some memory, that is needed to store the index. We have also to maintain these indexes.

  • Which types of indexes do you know?: There are several data structures used for the indexes in databases: the Inverted Lists, B-Trees, and the spatial indexes as R-Trees and Kd-Trees

    1. Inverted lists: For each non-primary column create a list, which uses this column as primary key.

      2022-03-09 14_48_02-03-DS1-Indexes.pdf - Adobe Acrobat Pro DC (32-bit).png

    2. B-Tree: use some index attribute to construct the binary search tree, that contains the page numbers at leaves. Nodes represent the interval borders.

      2022-03-09 14_48_12-03-DS1-Indexes.pdf - Adobe Acrobat Pro DC (32-bit).png

    3. k-d-Tree: split the space recursively, make a balanced binary tree with the split results. Nodes are split lines, leaves are the parts of the space.

      2022-03-09 17_14_19-03-DS1-Indexes.pdf - Adobe Acrobat Pro DC (32-bit).png

    4. R-Tree: similar to BVH from computer graphics. R stays for rectangle. We calculate the bounding volumes for each objects, and construct recursively the hierarchy of these volumes, based on the minimizing the volumes of the bounding.

      2022-03-09 17_14_44-03-DS1-Indexes.pdf - Adobe Acrobat Pro DC (32-bit).png

  • Why can’t we answer spatial queries by simply constructing one B tree per dimension?: We cant compute the distances from the data point to our quiery point in all dimensions, only in the one dimension of this index. It can’t be used to compute the distances in the multidimensionally space.

  • Explain what is the R tree and its elements?: R-Trees are Rectangle-Trees. The elements of these trees are the bounding volumes. We build the tree, that minimizes the volume of this bounding volumes. This volumes may overlap. Construction is a little bit difficult, but can be represented as multiple insertions one after the other. We insert element in the leaf that requires the least entlargement. If maximum capacity reached - split this volume.

  • How do we search for the nearest neighbor with the R tree?: Insert the Tree Root in the priority queue. We take a first element from the queue. If this node is a leaf, then we insert all points from this node in the queue, the priorities are the distances from this points to the query point. Small distance means high priority. If this node is not a leaf, then insert all child nodes, sorted by distance, in priority queue. Whole tree will be traversed.

    Note: learn this algorithm!

    2022-03-09 16_22_10-2.Indexes.mp4.png

    2022-03-09 15_42_12-03-DS1-Indexes.pdf - Adobe Acrobat Pro DC (32-bit).png

  • How can we deal with volumetric objects?: we can caculate a rectangle/cube etc. with few equations for every volumetric object. It could save us the difficult equations of the intersection of vector with complex figures

  • Is there any problem with the overlaps in R trees?: yes, we have to traverse multiple paths in the R-Tree, if the query point is located in the overlap region.

  • What are the differences between B trees, k-d trees and R trees?: k-d Trees split the space, it is useful when we work with data points. k-d Trees ane not balanced. R-Trees creates a structure of volumetric objects. It can be useful when we have not the data points, but some objects, or if the data is clustered, and the structure can ignore much free space. B-trees are very good to efficiently store one-dimenstional data. R-Trees and B-Trees are balanced, but the balancing of R-Trees can be difficult.

  • How can we insert a new object in the R tree? (procedure): Insert into the leaf rectangle/page that requires the least enlargement. If the maximum capacity is reached, then split the rectangle.

  • What queries can be handled by spatial indexes?: calculation of the nearest neighbor, or the k-nearest neighborhood.

  • What do we understand under “Lazy Learning”?: predict the values without building any model, without learning phase.

  • What are pros and cons of the k-NN classifier?: Attributes can have the same influence, works well with dynamic data, results are interpretable. It is sensitive to noise, and k is not easy to set.