Studio 10: Building, Using, and Augmenting Binary Search Trees

Today, we'll be looking at binary search trees. After some initial review and practice, we'll think a little bit about how to show that a tree is balanced, then investigate tree augmentation -- a strategy to keep track of more information than just keys as you dynamically modify a tree.

Part A: Warm-Up

Let's start by building ourselves a tree to play with.

  • Insert the following keys into an initially empty binary search Tree (BST):

10, 4, 15, 8, 12, 9, 5, 3, 11, 13, 16, 1, 17

Sketch the resulting tree and check with your TA to make sure it is correct.

Question A1: what is the resulting tree?

A little aside: how can you write down a tree in a text file? Rather than engage in ASCII art, you can unambiguously translate a binary tree $T$ to a linearized representation $R(T)$ as follows:

  • An empty tree is written as "-".
  • A single node with key $x$ is written as "x".
  • A general tree whose root has key $x$ and left and right subtrees $T_1, T_2$ is written as $(x R(T_1) R(T_2))$.

So, a tree that looks like this figure:

serializable tree

is linearized as $(5 (3 (2 1 -) 4) (9 6 -))$.

Question A2: what keys are associated with the nodes that your traversal passes when computing succ(10)? When computing succ(9)?

  • Now delete the following keys, in order, from the tree:

3, 17, 4, 10

Question A3: What trees do you obtain after each deletion?

If you want to check your answers after trying to do the work yourself, you may find this BST visualizer helpful.

Part B: Creating a Balanced Tree from Scratch

Next week, we will spend a lot of time thinking about how to have binary search trees that are balanced, that is, trees whose height is at most logarithmic in the number of nodes they contain. Balanced trees are challenging to maintain under arbitrary insertions and deletions, but they aren't very hard to create if you are building a new tree from scratch.

Suppose we have the integers $[1..n]$, and we want to insert all of these values into a binary search tree.

Question B1: in what order could you insert these values in order to maximize the height of the resulting tree?

The typical approach to getting a balanced tree is to find some property $P$ such that, if $P$ holds for every subtree of a tree, then the tree is balanced. We'll see some properties of this kind next week that are used for trees that support dynamic insertion and deletion. For now, let's consider the following simpler property $P$: the numbers of nodes to the left and right of the root node differ by at most 1.

We claim that, if every subtree of a binary tree $T$ with $n > 0$ nodes satisfies property $P$, then the height of $T$ is at most $\log n$. Let's prove this claim inductively on $n$.

Question B2: give one or more suitable base cases for the proof. Why might you want to specify more than one base case?

In the inductive case, we assume that $T$ has a root node and (nonempty) left and right subtrees $T_1$ and $T_2$. If every subtree of $T$ satisfies property $P$, then in particular, $P$ holds for the whole tree and for subtrees $T_1$ and $T_2$. Let's suppose (without loss of generality) that $T_1$ has at most as many nodes as $T_2$.

Question B3: let $n_\ell$ and $n_r$ be the sizes (number of nodes) of subtrees $T_L$ and $T_R$, respectively. How are the sizes of these two trees related to each other? How are their sizes related to their heights?

Question B4: given the facts about subtree sizes and heights you observed in B3, can you prove that the height of $T$ is at most $\log n$?

I'd suggest checking the desired inequality separately for each possible value of $n_r$.

Question B5: is the claim if-and-only-if? That is, does every balanced tree satisfy property $P$? Why or why not?

If we believe the claim, then to build a balanced BST, it is sufficient to build one in which every subtree satisfies property $P$.

Question B6: which value from the input set $[1..n]$ would you insert first so that, if the tree is a valid BST, then property $P$ holds at the root of the tree?

We can extend our intuition from B6 to build the tree in such a way that every subtree of it satisfies property $P$. The basic idea is to recursively choose the insertion orders for the two subsets of inputs left after we insert the node that belongs at the root of the tree.

Question B7: sketch some pseudocode that performs the $n$ insertions needed to construct the BST in such a way that every subtree satisfies property $P$.

Question B8: to conclude this section, how long does it take to build a balanced binary search tree from scratch by your procedure in B7, given an unsorted array of $n$ keys?

Part C: Augmenting a BST With Statistics

For many algorithms that use BSTs, we sometimes want to measure certain statistics of parts of a tree. For example, we might want to know, for a certain subtree of a BST $T$, how many nodes it contains or how tall it is. However, simply computing these statistics for a subtree directly may be very expensive.

Question C1: given a BST $T$, specifically a pointer to its root node, give an efficient recursive algorithm to compute the total number of nodes in $T$.

Question C2: how long does your algorithm take to compute this quantity?

At any time, we can compute statistics for each subtree of a BST and store them in the subtree's root node, where they will be available for future needs. We say that the BST is augmented to hold these statistics.

If, as in Part B, we built our BST all at once and did not change it thereafter, we could compute the desired statistics once on the completed tree and use them forever after. But real BSTs are not static! They support an arbitrary sequence of insertions and deletions, and all the other BST operations (e.g. find() and max()) must run efficiently and must yield a correct result if called at any point during this sequence. We would like these nice properties to be true for statistical queries about the tree as well. Hence, the challenge is to maintain the desired statistics for each subtree of the BST as nodes are inserted and deleted.

Let's suppose that each node $x$ of a BST is augmented with a data member size, which is the number of nodes in the subtree rooted at $x$. Suppose further that we cannot modify nodes' stored sizes until after the insertion actually completes. For example, we might have to check whether the key is already present in the tree before we can actually insert it, so as to prevent duplicates.

Question C3: give a one-line formula for computing the size of a node $r$ with left and right subtrees $T_1$ and $T_2$, given the sizes stored at the roots of $T_1$ and $T_2$.

Question C4: write pseudocode for a recursive insert() procedure that inserts a key into a BST, then updates any necessary size fields so that all such fields are correct.

Question C5: now explain how you would modify the remove() operation for a BST to update the stored sizes in the tree after a node is removed.

Be sure to separately address the cases of removing nodes with 0, 1, and 2 children.

Other statistics can be maintained similarly to the size. For example, suppose we want to maintain the height of each node $x$ in the tree, that is, the maximum number of edges on any path between $x$ and a leaf in the subtree rooted at $x$.

Question C6: give a one-line formula for computing the height of a node $r$ with left and right subtrees $T_1$ and $T_2$, given the heights stored at the roots of $T_1$ and $T_2$. How is this formula similar to the one you developed in C3?

Question C7: using your result from C6, how would you modify your procedures in C4 and C5 to update the height fields on insertion and deletion?

What should the height be for a single leaf? For an empty subtree?

Part D: Using Subtree Statistics

We maintain subtree statistics in a BST not because they are inherently interesting but because they help us do useful things with the tree. Next week, we'll see that maintaining the height of each subtree is useful for efficiently balancing the tree under dynamic insertion and deletion. Today, we'll look at a use for subtree sizes.

A nice use for a BST with subtree sizes is computing the median element of a dynamic, ordered collection. The median of a collection of $n$ distinct elements is the element $x$ for which there are exactly $\lfloor n/2 \rfloor$ elements in the collection less than $x$. As usual, we can do any sequence of insertions and deletions from the collection, and we want to be able to efficiently return its median at any point during this sequence.

Suppose we maintain our collection as a BST with subtree sizes.

Question D1: how can you tell in constant time whether the median is at the root of the BST, to the left of the root, or to the right of the root?

Question D2: using your answer to D1, sketch a recursive algorithm for locating the median element in a BST with subtree sizes. Your algorithm should run in worst-case time proportional to the height of the tree.

It might help to generalize your algorithm from finding the median to finding an element $x$ such that there are exactly $m$ smaller elements in the collection, for any $m < n$. We call this the element of rank $m$ in the collection.

Because we can efficiently maintain the statistics needed to compute the median, we can efficiently query it at any time. Of course, the cost here depends on the height of the tree, and we'd rather it depend on the size of (number of nodes in) the tree. We'll achieve that next week by guaranteeing, as we did in Part B, that the height of the tree is always logarithmic in its size.

A more advanced example of maintaining and using per-subtree statistics to good effect is the interval tree. An interval tree stores a collection of intervals on the real line. It efficiently supports queries such as, "Is a point $p$ inside any interval in this collection?" and "Which intervals in the collection overlap a given interval $[s,e)$?"

The multidimensional analog of an interval tree, called a k-d tree, can be used to test whether a point falls in a collection of (possibly overlapping) rectilinear volumes in space. For example, a 2-D tree can test containment in a collection of rectangles.