Lecture 12: More Orthogonal Range Searching

Reading: Chapter 5 in the 4M's.

Orthogonal Range Trees: 

We saw that kd trees could be used to answer orthogonal range queries
in the plane in O(sqrt(n) + k) time.

Now we consider a better data structure, called orthogonal range
trees.

An orthogonal range tree is a data structure which, in all dimensions
d >= 2, uses O(n (log n)^(d-1)) space, and can answer orthogonal
rectangular range queries in O((log n)^(d-1)+k) time, where k is the
number of points reported. Preprocessing time is the same as the space
bound. Thus, in the plane, we can answer range queries in time O(logn)
and space O(n log n). We will present the data structure in two parts,
the first is a version that can answer queries in O((log n)^2) time in
the plane, and then we will show how to improve this in order to strip
off a factor of log n from the query time.

The data structure is based on the concept of a multi level search
tree. In this method, a complex search is decomposed into a constant
number of simpler range searches. We cascade a number of search
structures for simple ranges together to answer the complex range
query.  In this case we will reduce a d dimensional range search to a
series of 1 dimensional range searches.

Suppose you have a query which can be stated as the intersection of a
small number of simpler queries. For example, a range query in the
plane can be stated as two range queries: Find all the points whose x
coordinates are in the range [x(lo), x(hi)] and all the points whose y
coordinates are in the range [y(lo), y(hi)]. Let us consider how to do
this for 2 dimensional range queries, and then consider how to
generalize the process. First, build a range tree for the first range
query, which in this case is just a 1 dimensional range tree for the x
range. Recall that this is just a balanced binary tree on these points
sorted by x coordinates. Also recall that each node of this binary
tree is implicitly associated with a canonical subset of points. These
are the points lying in the subtree rooted at this node. The answer to
any 1 dimensional range query can be represented as the disjoint union
of a small collection of m = O(log n) canonical subsets, {S(1), S(2),
... S(m)}, where each subset corresponds to a node in the
search. This constitutes the first level of the search tree. For the
second level, for each node v in this x range tree, we build an
auxiliary tree, each of which is a y coordinate range tree, which
contains all the points in the canonical subset associated with v.

Thus the data structure consists of a x-range tree, such that each
node points to auxiliary y range tree. This notion of a tree of trees
is basic to solving range queries by leveling. (For example, for d
dimensional range trees, we will have d levels of trees.)

To answer a query, we determine the canonical sets that satisfy the
first query (there will be O(logn) of them). Each of these sets is
just represented as a node in the range tree. We know that that these
sets are disjoint, and that every point in these sets lies within the
range of x coordinates. Thus to answer the query, we just need to find
out which points from each canonical subset lies within the range of y
coordinates. To do this, for each canonical subset, we access the
auxiliary tree for this node, and perform a 1 dimensional range search
on the y range. This process is illustrated in the following figure.



Figure 46: Range tree search.

What is the query time for a range tree? Recall that it takes O(logn)
time to locate the nodes representing the canonical subsets for the 1
dimensional range query. For each, we invoke a 1 dimensional range
search. Thus there O(logn) canonical sets, each invoking an O(log n)
range search, for a total time of O((log n)^2). As before, listing the
elements of these sets can be performed in additional k time by just
traversing the trees. Counting queries can be answered by precomputing
the subtree sizes, and then just adding them up.


The space used by the data structure is O(n log n) in the plane (and
O(n log (d-1) n) in dimension d). The reason comes by summing the
sizes of the two data structures. The tree for the x coordinates
requires only O(n) storage. But we claim that the total storage in all
the auxiliary trees is O(n log n). We want to count the total sizes of
all these trees. 

The number of nodes in a tree is proportional to the number of leaves,
and hence the number of points stored in this tree. Rather than count
the number of points in each tree separately, instead let us count the
number of trees in which each point appears. This will give the same
total. Observe that a point appears in the auxiliary trees of each of
its ancestors. Since the tree is balanced, each point has O(log n)
ancestors, and hence each point appears in O(logn) auxiliary trees.

It follows that the total size of all the auxiliary trees is O(n log n).
By the way, observe that because the auxiliary trees are just 1
dimensional trees, we could just store them as a sorted array.

We claim that it is possible to construct a 2 dimensional range tree
in O(n log n) time. Constructing the 1 dimensional range tree for the
x coordinates is easy to do in O(n log n) time.  However, we need to
be careful in constructing the auxiliary trees, because if we were to
sort each list of y coordinates separately, the running time would be
O(n (logn)^2). Instead, the trick is to construct the auxiliary trees
in a bottom up manner. The leaves, which contain a single point are
trivially sorted. Then we simply merge the two sorted lists for each
child to form the sorted list for the parent. Since sorted lists can
be merged in linear time, the set of all auxiliary trees can be
constructed in time that is linear in their total since, or O(n log
n). Once the lists have been sorted, then building a tree from the
sorted list can be done in linear time.


Summarizing, here is the basic idea to this (and many other query
problems based on leveled searches). Let (S, R(1)) denote the range
space, consisting of points S and range sets R(1). Construct a data
structure, which represents S by a collection of canonical sets,
{S(1), S(2), ...  S(m)}.  For each canonical subset, construct a data
structure for answering the second type of range query (and so
on). The main property of the canonical subsets is that, for any range
query, we can efficiently determine a small number of canonical sets
whose disjoint union is equal to the answer to the query (in our case
this was O(logn) subsets). Furthermore, this collection of canonical
subsets can be determined efficiently (in our case in O(logn)
time). 

To answer a range query, we solve the first range query, resulting in
a collection of canonical subsets whose union is the answer to this
query. We then invoke the second range query problem on each of these
subsets, and so on. Finally we take the union of all the answers to
all these queries.


Fractional Cascading: Can we improve on the O((log n)^2) query time?
We would like to reduce the query time to O(log n). As we descend the
search the x interval tree, for each node we visit, we need to search
the corresponding y interval tree. It is this combination that leads
to the squaring of the logarithms. If we could search each y interval
in O(1) time, then we could eliminate this second log factor. The
trick to doing this is used in a number of places in computational
geometry, and is generally a nice idea to remember.

We are repeatedly searching different lists, but always with the same
key. The idea is to merge all the different lists into a single
massive list, do one search in this list in O(logn) time, and then use
the information about the location of the key to answer all the
remaining queries in O(1) time each. This is a simplification of a
more general search technique called fractional cascading.

In our case, the massive list on which we will do one search is the
entire set of points, sorted by y coordinate. In particular, rather
than store these points in a balanced binary tree, let us assume that
they are just stored as sorted arrays. (The idea works for either
trees or arrays, but the arrays are a little easier to visualize.)
Call these the auxiliary lists. We will do one (expensive) search on
the auxiliary list for the root, which takes O(logn) time. However,
after this, we claim that we can keep track of the position of the y
range in each auxiliary list in constant time as we descend the tree
of x coordinates.


Here is how we do this. Let v be an arbitrary internal node in the
range tree of x coordinates, and let v(L) and v(R) be its left and
right children. Let A v be the sorted auxiliary list for v and let
A(L) and A(R) be the sorted auxiliary lists for its respective
children. Observe that A(v)is the disjoint union of AL and AR
(assuming no duplicate y coordinates). For each element in A(v) , we
store two pointers, one to the item of equal or larger value in A(L) and
the other to the item of equal or larger value in A(R). (If there is no
larger item, the pointer is null.) Observe that once we know the
position of an item in A(v), then we can determine its position in
either A(L) or A(R) in O(1) additional time.

Here is a quick illustration of the general idea. Let v denote a node
of the x tree, and let v(L)and v(R) denote its left and right
children. Suppose that (in bottom to top order) the associated nodes
within this range are: , and
suppose that in v(L)we store the points  and in v(R)
we store . This is shown below. For each point in
the auxiliary list for v, we store a pointer to the lists v(L) and
v(R), to the position this element would be inserted in the other list
(assuming sorted by y values). That is, we store a pointer to the
largest element whose y value is less than or equal to this point.

At the root of the tree, we need to perform a binary search against
all the y values to determine which points lie within this interval,
for all subsequent levels, once we know where the y interval falls
with respect to the order points here, we can drop down to the next
level in O(1) time.


Figure 47: Cascaded search in range trees.

Thus (as with fractional cascading) the running time is O(2 log n),
rather than O( (log n)^2). It turns out that this trick can only be
applied to the last level of the search structure, because all other
levels need the full tree search to compute canonical sets.

Theorem: Given a set of n points in R d , orthogonal rectangular range
queries can be answered in O(log (d-1) n + k) time, from a data
structure of size O(n log (d-1) n) which can be constructed in
O(n log (d-1) n) time.


This course is modeled on the Computational Geometry Course taught by Dave Mount, at the University of Maryland. These notes are modifications of his Lecture Notes, which are copyrighted as follows: Copyright, David M. Mount, 2000, Dept. of Computer Science, University of Maryland, College Park, MD, 20742. These lecture notes were prepared by David Mount for the course CMSC 754, Computational Geometry, at the University of Maryland, College Park. Permission to use, copy, modify, and distribute these notes for educational purposes and without fee is hereby granted, provided that this copyright notice appear in all copies.