Lecture 11: Orthogonal Range Searching

Reading: Chapter 5 in the 4M's.

Range Queries: 

We shift our focus from algorithm problems to data structures for the
next few lectures. In general, we consider the question, given a
collection of objects, preprocess them (storing the results in a data
structure of some variety) so that queries of a particular form can be
answered efficiently. Generally we measure data structures in terms of
two quantities, the time needed to answer a query and the amount of
space needed by the data structure.

Often there is a tradeoff between these two quantities, but most of
the structures that we will be interested in will have either linear
or near linear space. Preprocessing time is an issue of secondary
importance, but most of the algorithms we will consider will have
either linear or O(n log n) preprocessing time.

In a range queries we are given a set P of points and region R in
space (e.g., a rectangle, polygon, halfspace, or disk) and are asked
list (or count or compute some accumulation function of) the subset of
P lying within the region. To get the best possible performance, the
design of the data structure is tailored to the particular type of
region, but there are some data structures that can be used for a wide
variety of regions.

An important concept behind all geometric range searching is that the
subsets that can be formed by simple geometric ranges is much smaller
than the set of possible subsets (called the power set) of P.  

  (Cool fact that you will not be tested on, and is only marginally
   relevant: There is something called the vapnik-chernovenkis
   dimension of a "shape" which asks: what is the maximum number N
   such that you can place N points in the plane and seperate any
   possible subset of those points using (scaled or rotated versions
   of ) that shape?  ie, if you put 3 points on the plane, you can
   choose any subset of those three points using an appropriately
   placed half-plane.  But you can't do the same thing for four
   points.)

We can define any range search problem abstractly as follows. Given a
particular class of ranges, a range space is a pair (P, R) consisting
of the points P and the collection R of all subsets of P that be
formed by ranges of this class. For example, the following figure
shows the range space assuming rectangular ranges for a set of points
in the plane. In particular, note that the sets {1, 4} and {1, 2, 4}
cannot be formed by rectangular ranges.

Figure 42: Rectangular range space. Today we consider orthogonal rectangular range queries, that is, ranges defined by rectangles whose sides are aligned with the coordinate axes. One of the nice things about rectangular ranges is that they can be decomposed into a collection 1 dimensional searches. One dimensional range queries: Before consider how to solve general range queries, let us consider how to answer 1 dimensional range queries, or interval queries. We are given a set of points P = {p(1), p(2), ... p(n)} on the line, and given an interval [x(lo), x(hi)], report all the points lying within the interval. Our goal is to find a data structure that can answer these queries in O(log(n) + k) time, where k is the number of points reported (an output sensitive result). [Why is it important that this be an output sensitive result?] Naive Solutions: What if we don't pre-process our input at all? Then for an input [x(lo), x(hi)] we can take O(n) time to check to see if each point p(i) is in [x(lo), x(hi)], and report it or count it if it is. Ok ok, so sort the points first. apply binary search to find the first point of P that is greater than or equal to x(lo) , and less than or equal to x(hi), and then list all the points between. So, range counting queries can be answered in O(log n) time, with minor modifications, with pre-processing consisting of sorting the sort the points. However, this will not generalize to higher dimensions. (ie, what is "sorting the points" when the points are not one-dimensional?). A basic approach to solving almost all range queries is to represent P as a collection of canonical subsets {S(1), S(2), ... S(k)}, each S(i) is a subset of the set S (where k is generally a function of n and the type of ranges), such that the answer to any query can be expressed as a disjoint union of a small number of canonical subsets. Note that these subsets may generally overlap each other. The trick to solving a range searching problem is to identify a collection of canonical subsets having these properties, and then finding the proper subsets for a given range. Suppose that the points of P are sorted in increasing order and then stored in the leaves of a balanced binary tree. Each internal node of the tree is labeled with the largest key appearing in its left child. We can associate each node of this tree (implicitly or explicitly) with the subset of points stored in the leaves that are descendents of this node. This gives rise to the O(n) canonical subsets. For now, these canonical subsets will not be stored explicitly as part of the data structure, but this will change later when we talk about range trees. We claim that the canonical subsets corresponding to any range can be identified in O(log n) time from this structure. Given any interval [x(lo), x(hi)], we search the tree to find the leftmost leaf u whose key is greater than or equal to x(lo) and the rightmost leaf v whose key is greater than or equal to x(hi). Clearly all the leaves between u and v, together possibly with u and v, constitute the points that lie within the range. If key(u) = x(lo) then we include u's canonical (single point) subset and if key(v) = x(hi) then we do the same for v. To form the remaining canonical subsets, we take the subsets of all the maximal subtrees lying between u and v.
Figure 43: Canonical sets for interval queries. Here is how to compute these subtrees. The search paths to u and v may generally share some common subpath, starting at the root of the tree. Once the paths diverge, as we follow the left path to u, whenever the path goes to the left child of some node, we add the canonical subset associated with its right child. Similarly, as we follow the right path to v, whenever the path goes to the right child, we add the canonical subset associated with its left child. To answer a range reporting query we simply traverse these canonical subtrees, reporting the points of their leaves. Each tree can be traversed in time proportional to the number of leaves in each subtree. To answer a range counting query we store the total number of points in each subtree (as part of the preprocessing) and then sum all of these over all the canonical subtrees. Since the search paths are of length O(log n), it follows that there are O(log n) total canonical subsets. Thus the range counting query can be answered in O(log n) time. For reporting queries, since the leaves of each subtree can be listed in time that is proportional to the number of leaves in the tree (a basic fact about binary trees), it follows that the total time in the search is O(log n + k), where k is the number of points reported. Thus, 1 dimensional range queries can be answered in O(logn) time, using O(n) storage. This concept of finding maximal subtrees that are contained within the range is fundamental to all range search data structures. The only question is how to organize the tree and how to locate the desired sets. Let see next how can we extend this to higher dimensional range queries. Kd trees: The natural question is how to extend 1 dimensional range searching to higher dimen sions. First we will consider kd trees. This data structure is easy to implement and quite practical and useful for many different types of searching problems (nearest neighbor search ing for example). However it is not the asymptotically most efficient solution for the orthogonal range searching, as we will see later. Our terminology is a bit nonstandard. The data structure was designed by Jon Bentley. In his notation, these were called k d trees, short for ``k dimensional trees''. The value k was the dimension, and thus there are 2 d trees, 3 d trees, and so on. However, over time, the term kd-tree has become so standard that we simply drop the pretense of using k as a variable. The idea behind a kd tree is to extend the notion of a one dimension tree, but alternate in using the x or y coordinates to split on. In general dimension, the kd-tree cycles among the various possible splitting dimensions, but you might imagine many variations where the choice of the splitting dimension depends on the structure of the data set. Each internal node of the kd-tree is associated with two quantities, a splitting dimension (either x or y), and a splitting value s. It has two subtrees. Since the splitting dimension alternates between x and y, some implementations do not store this explicitly, but keep track of it while traversing the tree. If the splitting dimension is x, then all points whose x coordinates are less than or equal to s are stored in the left subtree and points whose x coordinates are greater than or equal to s are stored in the right subtree. (If a point's coordinate is equal to s, then we reserve the right to store it on either side. This is done to allow us to balance the number of points in the left and right subtrees.) When a single point (or more generally a small constant number of points) remains, we store it in a leaf node. Note that in our definition the data points are stored only in the leaves of the tree. The internal nodes only contain splitting information. How is the splitting value chosen? To guarantee that the tree is balanced, the most common method is to let the splitting value be the median splitting coordinate. If there are an even number of points in the subtree, we may take either the upper or lower median, or we may simply take the midpoint between these two points. As a result, the tree will have O(log n) height. How is the splitting dimension chosen? As mentioned earlier one simple strategy is to cycle through the dimensions one by one. Another is to look at the data points, and select the dimension along which they have the greatest spread, defined to be the difference between the largest and smallest coordinates. Bentley refered to the tree resulting from this splitting rule to be the optimized kd-tree. In the example below, we simply alternate splitting dimensions. If there are a odd number of points, the median value goes with the left (or lower) subtree.
Figure 44: A kd tree and the associated spatial subdivision. A kd-tree naturally defines a subdivision of space, where each splitting node can be thought of as introducing a vertical or horizontal splitting line. Points lying to one side of the line are stored in the left subtree and the other points are stored in the right subtree. This subdivides space into regions, called cells, which are (possibly unbounded) rectangles. In fact this is a special case of a more general class of data structures, called binary space partition or BSP trees, in which the splitting lines may be oriented in any direction. In the case of BSP trees, the cells are convex polygons. Of course, both data structures can be generalized to higher dimensions, where splitting is done using d-1 dimensional hyperplanes. It is possible to build a kd tree in O(n log n) time by a simple recursive procedure. The most costly step of the process is determining the median coordinate. One trick for speeding up processing time is to presort the points into two cross referenced lists, one sorted by x and the other sorted by y. Using these lists, it is an easy matter to find the median at each step in constant time. The two lists can then be split in O(n) time each, where n is the number of remaining points. This leads to a recurrence of the form T (n) = 2 T(n/2) + n, which solves to O(n log n). KD-tree construction applet Searching the kd tree: To answer an orthogonal range query we proceed as follows. Let R denote the query rectangle. Assume that the current splitting line is vertical. (The horizontal case is similar.) Let v denote the current node in the tree. Each node of the tree is associated with a rectangular region of space (its cell). The search proceeds as follows. If v is a leaf, then we check the point(s) stored in v as to whether it lies in R. If so we report/count them. If v is an internal node, then we first consider the left subtree. If its region lies entirely within R then we call a different routine to enumerate all the points in this subtree (or for a counting query we return a precomputed count of the number of points in this subtree). If the left child's region partially overlaps R then we search it recursively. (If the left child's region does not overlap R at all, then we ignore it.) We do the same for the right child of v. The figure below illustrates the nodes that would be visited for the given query range. The leaf nodes with heavy outlines are the points that are finally reported.
Figure 45: A kd tree and the associated spatial subdivision. How many nodes does this method visit altogether? We claim that the total number is O( sqrt(n) + k), where k is the number of points reported. Theorem: The times needed to answer an orthogonal range quep ry using kd tree with n points is O(sqrt(n) + k) where k is the number of reported points. Proof: First observe that for each subtree with m points, we can enumerate the points of this subtree in O(m) time, by a simple traversal. Thus, it suffices to bound the number of nodes visited in the search (not counting subtree enumeration). To count the number of nodes visited, first observe that if we visit a node, then its associated region must intersect one of the four sides of the rectangle range. (Otherwise it lies entirely inside or outside, and so is not visited.) We will do this separately for each of the four sides, and multiply the result by four. Rather than consider the line segment defining a side of the range rectangle, we consider the number of nodes visited if we had applied an orthogonal range query to the entire line on which this side lies. This will only overestimate the number of nodes visited. Because the kd-tree processes even and odd levels differently, it will be important to analyze two levels at a time (or generally d levels at a time for dimension d). Let Q(n) denote the query time for visiting a node v that has n points descended from it. Consider any orthogonal line that intersects the region associated with v. The key observation is that any horizontal or vertical line can intersect at most two of the four subregions associated with the grandchildren of v. (The other two will either be entirely contained within the halfspace or lie entirely outside the halfspace. In either case the algorithm will not make a recursive call on them.) Since each child has half as many points (because we split at the median), the number of points in each grandchild is roughly n/4. Thus we make at most two recursive calls on subtrees of size n/4. This gives the following recurrence: Q(n) = 1 if n = 1, 2Q(n/4) + 1 otherwise. It is an easy process to expand this recursion, and derive the fact that it solves to O(n * (log base 4 of 2 )) = O(sqrt(n)). (This also follows from the Master Theorem in CLR.) Including the factor of four, we still have an O(sqrt(n)) bound on the total number of nodes visited, and adding the enumeration time the total time is O(sqrt(n) + k) as desired. Other spatial data structures links. A Quad-Tree http://www.cs.umd.edu/~brabec/quadtree/index.html A Binary Space Partition Tree http://symbolcraft.com/graphics/bsp/index.html