Posted by: alyssaleone | July 29, 2011

Sections 7.1 – 7.3

Chapter 7 explores divide-and-conquer algorithms, which are exactly what they sound like: breaking a problem into smaller problems, then finding a solution to the original problem using the solutions to the smaller problems.

Secion 7.1 revisits the Sorting problem from chapter 2, when we looked at an algorithm requiring quadratic time.  With the new divide-and-conquer approach, we will be able to find an algorithm requiring O(n\log{n}) time.  It looks like there are two keys to this divide-and-conquer sorting algorithm: first, if we’re given two sorted lists of integers it is relatively easy (O(n)) to combine the two into one sorted list; second, a single-element list is, by definition, already sorted. Therefore it makes sense to construct a recursive algorithm which breaks down a list of integers into single-element lists, then take these sorted lists and repeatedly combine them to form the desired sorted list.

The algorithms themselves are fairly straightforward, and I can follow the first explanation of the O(n\log{n}) running time, but given the recurrence relation on page 230 I don’t think I could have come up with O(n\log{n}) on my own.  The second explanation is more intuitive, but the semantics in the last paragraph of section 7.1 seemed inconsistent.  He uses the term “recursion tree” to refer to the entire graph (which is not a mathematical tree, based on my understanding) in Figure 7.1, but goes on to state that it has 3 levels, which would indicate that either the Divide or the Conquer portion is the actual recursion tree.   Also, it looks like he uses “top level” (line 4) to refer to the level that is, strictly speaking, the second level from the bottom.  For those more familiar with computer science, is this typical?

In 7.2, the goal is to reduce the space complexity of the sequence alignment algorithm.  It is important to note that the algorithm presented will only compute the score of an alignment, not the alignment itself.  The implication seems to be that if we need the actual alignment, the space complexity remains O(mn).

Previously, in order to find the longest path in an edit graph, we needed the entire backtracking matrix, which again requires O(mn) space.  In order to do this, we need to come up with a way to find the longest path in the edit graph without using a backtracking matrix.  The new technique depends on the fact that we can find the middle vertex of the longest path without actually knowing the longest path – all we need to do is find the length of the longest path from the source to each vertex in the middle column (prefix(i), where i indexes each vertex in the middle column), and the length of the longest path from each vertex in the middle column to the sink (suffix(i)).  Then, since the length of the longest path from source to sink passing through vertex i in the middle column is length(i) = prefix(i) + suffix(i), maximizing length(i) over all i gives the length of the longest path from source to sink and, more importantly, determines the middle vertex in this path.

It took me a while to understand why this reduces the space requirement – i.e., why no backtracking matrix is required – but I think the key is that in finding this middle vertex, all we need are the lengths of two longest paths.  Since we don’t need the actual paths, we don’t need a backtracking matrix.

It’s easy to see that continuing this process – repeatedly finding middle vertices – will eventually give us each vertex in the longest path, hence the path itself.  Since this process must terminate, and each stage requires time equal to the rectangles to the left and right of the current middle column, the time required to compute the longest path is a partial sum of the geometric series with common ration 1/2 and first term equal to the area of the edit graph, nm, hence is less than 2nm.  So this algorithm requires O(mn) time.  We’re also told that it requires O(n) space, since to compute the alignment scores in a column we only use the scores from the preceding column, so we only need 2n values.

Section 7.3 revisits the Longest Common Subsequence problem and develops an algorithm with O(\frac{n^2}{\log{n}}) running time, using block alignment and the Four-Russians technique.  As a side note, I looked up an article on the Four-Russians technique to find out why it’s called that – it turns out that the technique comes from a paper by four authors, but only one is Russian…

Block alignment of two DNA sequences – in our case, both are assumed to have length n – is simply dividing the grid formed by the sequences into subgrids of size t x t, where t divides n.  A block path is one which enters and leaves each square at the corners.  The block alignment problem is to find the longest block path through the graph.  To do this, we must find the alignment score for each of the squares, which means solving \frac{n^2}{t^2} alignment problems, then piecing together the alignment scores to find the longest path.  Correct me if I’m wrong, but I think this ends up requiring O(n^2 + \frac{n^2}{t^2}) time. However, we’re looking for an algorithm with subquadratic time.

This is where the Four-Russians technique comes in, and in order to apply it we must have t \approxeq \log{n}.  The key lies in constructing 4^t \times 4^t minialignments and storing each alignment score in a lookup table – if we assume t=\frac{\log{n}}{4} (which, since it’s a scalar multiple I think that for running time this counts as t \approxeq \log{n}), this lookup table will have only n entries.  The running time calculations make use of t \approxeq \log{n}:

Computing the score for each t x t minialignment takes O(t^2)=O((\log{n})^2) time, so with n entries it takes O(n(\log{n})^2) running time.  Apparently, though, the overall running time is dominated by the \frac{n^2}{t^2} accesses (could someone explain why there’s that number of accesses?), and each access takes O(\log{n}) time, so the overall running time is O(\log{n}\cdot\frac{n^2}{t^2})=O(\log{n}\cdot\frac{n^2}{(\log{n})^2})=O(\frac{n^2}{\log{n}}).

Advertisement

Responses

  1. Alyssa,

    I think they were being a little loose with the word “tree” which is what caused confusion. In the Divide and Conquer method one actually uses two trees (and they are trees in the mathematical sense).

    First, in the Divide stage, you start with a single input (which is the root of the tree), and then break it into subcases (which are the branches from that root), then each of these subcases is further broken up (giving the next level of branches), and this is repeated until you get down to however small you need to be to solve the problem. The end result is a tree with the root being our initial input, and the leaves being the small subcases which are the ones actually solved. In Figure 7.1, the root of the Divide tree is at the top, and the leaves are the single boxes in the middle.

    There is a second Conquer tree. This has the leaves being the solved subcases, and the branches at each level merge together the small solutions of the previous level into a single larger solution. The root then is the termination of this process which is the final single solution. In figure 7.1 the Conquer tree starts with the leaves in the middle and ends with the root at the bottom.

    In terms of levels, it depends on if you think of figure 7.1 as one big tree (which wouldn’t be a mathematical tree), or two trees which are attached at the leaves. And then which way is up in each of the trees.

    There were several points in these sections which reminded me of our calculations from last year. In particular, we ran into problems with running out of memory long before running out of time, and some of our workarounds were rather like what they suggest here. For example, deleting intermediate diagrams once they were no longer needed. Or the 4-Russians method of pre-computing the needed small cases (in our case, the basic diagrams) so that we could call on them as needed later on.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Categories

Follow

Get every new post delivered to your Inbox.