Lecture 11: Orthogonal Range Searching
Reading: Chapter 5 in the 4M's.
Range Queries:
We shift our focus from algorithm problems to data structures for the
next few lectures. In general, we consider the question, given a
collection of objects, preprocess them (storing the results in a data
structure of some variety) so that queries of a particular form can be
answered efficiently. Generally we measure data structures in terms of
two quantities, the time needed to answer a query and the amount of
space needed by the data structure.
Often there is a tradeoff between these two quantities, but most of
the structures that we will be interested in will have either linear
or near linear space. Preprocessing time is an issue of secondary
importance, but most of the algorithms we will consider will have
either linear or O(n log n) preprocessing time.
In a range queries we are given a set P of points and region R in
space (e.g., a rectangle, polygon, halfspace, or disk) and are asked
list (or count or compute some accumulation function of) the subset of
P lying within the region. To get the best possible performance, the
design of the data structure is tailored to the particular type of
region, but there are some data structures that can be used for a wide
variety of regions.
An important concept behind all geometric range searching is that the
subsets that can be formed by simple geometric ranges is much smaller
than the set of possible subsets (called the power set) of P.
(Cool fact that you will not be tested on, and is only marginally
relevant: There is something called the vapnik-chernovenkis
dimension of a "shape" which asks: what is the maximum number N
such that you can place N points in the plane and seperate any
possible subset of those points using (scaled or rotated versions
of ) that shape? ie, if you put 3 points on the plane, you can
choose any subset of those three points using an appropriately
placed half-plane. But you can't do the same thing for four
points.)
We can define any range search problem abstractly as follows. Given a
particular class of ranges, a range space is a pair (P, R) consisting
of the points P and the collection R of all subsets of P that be
formed by ranges of this class. For example, the following figure
shows the range space assuming rectangular ranges for a set of points
in the plane. In particular, note that the sets {1, 4} and {1, 2, 4}
cannot be formed by rectangular ranges.
Figure 42: Rectangular range space.
Today we consider orthogonal rectangular range queries, that is,
ranges defined by rectangles whose sides are aligned with the
coordinate axes. One of the nice things about rectangular ranges is
that they can be decomposed into a collection 1 dimensional searches.
One dimensional range queries:
Before consider how to solve general range queries, let us consider
how to answer 1 dimensional range queries, or interval queries.
We are given a set of points P = {p(1), p(2), ... p(n)} on the line,
and given an interval [x(lo), x(hi)], report all the points lying
within the interval.
Our goal is to find a data structure that can answer these queries in
O(log(n) + k) time, where k is the number of points reported (an
output sensitive result). [Why is it important that this be an output
sensitive result?]
Naive Solutions:
What if we don't pre-process our input at all?
Then for an input [x(lo), x(hi)] we can take O(n) time to check to see if
each point p(i) is in [x(lo), x(hi)], and report it or count it if it is.
Ok ok, so sort the points first.
apply binary search to find the first point of P that is greater
than or equal to x(lo) , and less than or equal to x(hi), and then
list all the points between.
So, range counting queries can be answered in O(log n) time, with
minor modifications, with pre-processing consisting of sorting the
sort the points. However, this will not generalize to higher
dimensions. (ie, what is "sorting the points" when the points are not
one-dimensional?).
A basic approach to solving almost all range queries is to represent P
as a collection of canonical subsets {S(1), S(2), ... S(k)}, each S(i)
is a subset of the set S (where k is generally a function of n and the
type of ranges), such that the answer to any query can be expressed as
a disjoint union of a small number of canonical subsets. Note that
these subsets may generally overlap each other. The trick to solving a
range searching problem is to identify a collection of canonical
subsets having these properties, and then finding the proper subsets
for a given range.
Suppose that the points of P are sorted in increasing order and then
stored in the leaves of a balanced binary tree. Each internal node of
the tree is labeled with the largest key appearing in its left
child. We can associate each node of this tree (implicitly or
explicitly) with the subset of points stored in the leaves that are
descendents of this node. This gives rise to the O(n) canonical
subsets. For now, these canonical subsets will not be stored
explicitly as part of the data structure, but this will change later
when we talk about range trees.
We claim that the canonical subsets corresponding to any range can be
identified in O(log n) time from this structure. Given any interval
[x(lo), x(hi)], we search the tree to find the leftmost leaf u whose
key is greater than or equal to x(lo) and the rightmost leaf v whose
key is greater than or equal to x(hi). Clearly all the leaves between
u and v, together possibly with u and v, constitute the points that
lie within the range. If key(u) = x(lo) then we include u's canonical
(single point) subset and if key(v) = x(hi) then we do the same for
v. To form the remaining canonical subsets, we take the subsets of all
the maximal subtrees lying between u and v.
Figure 43: Canonical sets for interval queries.
Here is how to compute these subtrees. The search paths to u and v may
generally share some common subpath, starting at the root of the
tree. Once the paths diverge, as we follow the left path to u,
whenever the path goes to the left child of some node, we add the
canonical subset associated with its right child. Similarly, as we
follow the right path to v, whenever the path goes to the right child,
we add the canonical subset associated with its left child.
To answer a range reporting query we simply traverse these canonical
subtrees, reporting the points of their leaves. Each tree can be
traversed in time proportional to the number of leaves in each
subtree. To answer a range counting query we store the total number of
points in each subtree (as part of the preprocessing) and then sum all
of these over all the canonical subtrees. Since the search paths are
of length O(log n), it follows that there are O(log n) total canonical
subsets. Thus the range counting query can be answered in O(log n)
time. For reporting queries, since the leaves of each subtree can be
listed in time that is proportional to the number of leaves in the
tree (a basic fact about binary trees), it follows that the total time
in the search is O(log n + k), where k is the number of points
reported.
Thus, 1 dimensional range queries can be answered in O(logn) time,
using O(n) storage. This concept of finding maximal subtrees that are
contained within the range is fundamental to all range search data
structures. The only question is how to organize the tree and how to
locate the desired sets. Let see next how can we extend this to higher
dimensional range queries.
Kd trees:
The natural question is how to extend 1 dimensional range searching to
higher dimen sions. First we will consider kd trees. This data
structure is easy to implement and quite practical and useful for many
different types of searching problems (nearest neighbor search ing for
example). However it is not the asymptotically most efficient solution
for the orthogonal range searching, as we will see later.
Our terminology is a bit nonstandard. The data structure was designed
by Jon Bentley. In his notation, these were called k d trees, short
for ``k dimensional trees''. The value k was the dimension, and thus
there are 2 d trees, 3 d trees, and so on. However, over time, the
term kd-tree has become so standard that we simply drop the pretense
of using k as a variable.
The idea behind a kd tree is to extend the notion of a one dimension
tree, but alternate in using the x or y coordinates to split on. In
general dimension, the kd-tree cycles among the various possible
splitting dimensions, but you might imagine many variations where the
choice of the splitting dimension depends on the structure of the data
set.
Each internal node of the kd-tree is associated with two quantities, a
splitting dimension (either x or y), and a splitting value s. It has
two subtrees. Since the splitting dimension alternates between x and
y, some implementations do not store this explicitly, but keep track
of it while traversing the tree. If the splitting dimension is x, then
all points whose x coordinates are less than or equal to s are stored
in the left subtree and points whose x coordinates are greater than or
equal to s are stored in the right subtree. (If a point's coordinate
is equal to s, then we reserve the right to store it on either
side. This is done to allow us to balance the number of points in the
left and right subtrees.) When a single point (or more generally a
small constant number of points) remains, we store it in a leaf
node. Note that in our definition the data points are stored only in
the leaves of the tree. The internal nodes only contain splitting
information.
How is the splitting value chosen? To guarantee that the tree is
balanced, the most common method is to let the splitting value be the
median splitting coordinate. If there are an even number of points in
the subtree, we may take either the upper or lower median, or we may
simply take the midpoint between these two points. As a result, the
tree will have O(log n) height. How is the splitting dimension chosen?
As mentioned earlier one simple strategy is to cycle through the
dimensions one by one. Another is to look at the data points, and
select the dimension along which they have the greatest spread,
defined to be the difference between the largest and smallest
coordinates. Bentley refered to the tree resulting from this splitting
rule to be the optimized kd-tree. In the example below, we simply
alternate splitting dimensions. If there are a odd number of points,
the median value goes with the left (or lower) subtree.
Figure 44: A kd tree and the associated spatial subdivision.
A kd-tree naturally defines a subdivision of space, where each
splitting node can be thought of as introducing a vertical or
horizontal splitting line. Points lying to one side of the line are
stored in the left subtree and the other points are stored in the
right subtree. This subdivides space into regions, called cells, which
are (possibly unbounded) rectangles. In fact this is a special case of
a more general class of data structures, called binary space partition
or BSP trees, in which the splitting lines may be oriented in any
direction. In the case of BSP trees, the cells are convex polygons. Of
course, both data structures can be generalized to higher dimensions,
where splitting is done using d-1 dimensional hyperplanes.
It is possible to build a kd tree in O(n log n) time by a simple
recursive procedure. The most costly step of the process is
determining the median coordinate. One trick for speeding up
processing time is to presort the points into two cross referenced
lists, one sorted by x and the other sorted by y. Using these lists,
it is an easy matter to find the median at each step in constant
time. The two lists can then be split in O(n) time each, where n is
the number of remaining points. This leads to a recurrence of the form
T (n) = 2 T(n/2) + n, which solves to O(n log n).
KD-tree construction applet
Searching the kd tree:
To answer an orthogonal range query we proceed as follows. Let R
denote the query rectangle. Assume that the current splitting line is
vertical. (The horizontal case is similar.) Let v denote the current
node in the tree. Each node of the tree is associated with a
rectangular region of space (its cell). The search proceeds as
follows.
If v is a leaf, then we check the point(s) stored in v as to
whether it lies in R. If so we report/count them.
If v is an internal node, then we first consider the left subtree.
If its region lies entirely within R then we call a different
routine to enumerate all the points in this subtree (or for a
counting query we return a precomputed count of the number of points
in this subtree).
If the left child's region partially overlaps R then we search it
recursively. (If the left child's region does not overlap R at all,
then we ignore it.) We do the same for the right child of v. The
figure below illustrates the nodes that would be visited for the
given query range. The leaf nodes with heavy outlines are the points
that are finally reported.
Figure 45: A kd tree and the associated spatial subdivision.
How many nodes does this method visit altogether?
We claim that the total number is O( sqrt(n) + k), where k is the
number of points reported.
Theorem:
The times needed to answer an orthogonal range quep ry using
kd tree with n points is O(sqrt(n) + k) where k is the number of
reported points.
Proof:
First observe that for each subtree with m points, we can enumerate
the points of this subtree in O(m) time, by a simple traversal. Thus,
it suffices to bound the number of nodes visited in the search (not
counting subtree enumeration).
To count the number of nodes visited, first observe that if we visit a
node, then its associated region must intersect one of the four sides
of the rectangle range. (Otherwise it lies entirely inside or outside,
and so is not visited.) We will do this separately for each of the
four sides, and multiply the result by four.
Rather than consider the line segment defining a side of the range
rectangle, we consider the number of nodes visited if we had applied
an orthogonal range query to the entire line on which this side
lies. This will only overestimate the number of nodes visited.
Because the kd-tree processes even and odd levels differently, it
will be important to analyze two levels at a time (or generally d
levels at a time for dimension d). Let Q(n) denote the query time for
visiting a node v that has n points descended from it. Consider any
orthogonal line that intersects the region associated with v. The key
observation is that any horizontal or vertical line can intersect at
most two of the four subregions associated with the grandchildren of
v. (The other two will either be entirely contained within the
halfspace or lie entirely outside the halfspace. In either case the
algorithm will not make a recursive call on them.) Since each child
has half as many points (because we split at the median), the number
of points in each grandchild is roughly n/4. Thus we make at most two
recursive calls on subtrees of size n/4. This gives the following
recurrence:
Q(n) =
1 if n = 1,
2Q(n/4) + 1 otherwise.
It is an easy process to expand this recursion, and derive the fact
that it solves to O(n * (log base 4 of 2 )) = O(sqrt(n)). (This also
follows from the Master Theorem in CLR.) Including the factor of four,
we still have an O(sqrt(n)) bound on the total number of nodes
visited, and adding the enumeration time the total time is O(sqrt(n) + k)
as desired.
Other spatial data structures links.
A Quad-Tree
http://www.cs.umd.edu/~brabec/quadtree/index.html
A Binary Space Partition Tree
http://symbolcraft.com/graphics/bsp/index.html