In section 8.6, the book introduces the concept of sequencing by hybridization. It describes the spectrum, or l-mer composition, of a DNA sequence. For example, when s= TATGGTGC and l=3, the spectrum is {TAT, ATG, TGG, GGT, GTG, TGC}. The sequencing by hybridization problem is the task of constructing the DNA sequence that produces the given spectrum. Of course, the spectrum would generally not be given in the correct order.
The book mentions that a linear time algorithm exists to solve this problem. It then compares SBH to using DNA arrays as mentioned in the previous section. A main difference is that in SBH, the target DNA sequence is unknown while with DNA arrays, the target is either known or “almost” known. Of course, there are problems with the real life application of the algorithm, since it may be difficult to distinguish between perfect matches and highly stable mismatches.
In section 8.7, the book discusses SBH as a Hamiltonian path problem. The idea is to construct a directed graph by introducing a vertex for every l-mer in the spectrum and then connecting two vertices if they overlap. Two l-mers p and q overlap if the first l – 1 letters of p coincide with the last l – 1 letters of q. However, with complicated examples, there may be more than one Hamiltonian path, each corresponding with a different possible reconstruction.
In section 8.8, the book discusses SBH as an Eulerian path problem. This technique is more fruitful than the Hamiltonian path problem, since it can be solved using a linear time algorithm. This time, a graph is constructed whose edges, not vertices, correspond to l-mers from the spectrum. Then a path must be found that visits every edge exactly once. The book thoroughly describes how to produce such a graph on page 273. Finally, after proving that a connected graph is Eulerian if and only if each of its vertices is balanced, it describes the algorithm that is used to find an Eulerian cycle.
Rhyker,
The thing which most struck me is that although the Hamiltonian and Eulerian path problems seem more or less equally difficult, in fact the first is very hard (Ie. it’s NP-Hard) while the second is easy. As 8.7 and 8.8 explain, if you encode the problem in the most direct way you end up with a Hamiltonian path problem, but by being a little more clever at the start, you can instead turn it into a Eulerian problem. Which, in a way, is the moral of the whole subject: try to be clever enough at the beginning to save yourself work at the end.
And, I guess, that just because a problem _looks_ to be hard, it could be that the problem is hard, or just that you’re not being clever enough!
By: jonkujawa on August 16, 2011
at 5:24 pm