Section 8.9 delves further into the complications of fragment assembly, and how those complications are currently addressed. While there are complications from errors in reads as well as not knowing if a read comes from a target strand or its compliment, the major problem is that the current length of the reads (500 – 700 base pairs) is not long enough to deal with repeats in DNA. They state that increasing the length of reads would solve this problem, but the sequencing technology has not yet advanced that far.
They illustrate this problem quite well using the example of the frog puzzle (though I’m probably biased since I worked that puzzle many, many times growing up) – since all of the frogs are repeated, you can’t know which one goes where. Likewise, if you have several 500 bp reads that all happen to be within a 1000 bp repeat, you have no way of knowing which read goes with which copy.
The original method, BAC-by-BAC sequencing, simplified the computation by reducing the number of repeats, but complicated the sequencing project. The current method, mate-pair reads sequencing, attempts to ensure that each read has some kind of unique identifying information associated with it. To do this, they use inserts of a given length (longer than the read and most repeats), and sequence both ends of the insert. This makes it almost certain that at least one of the ends will contain a unique, non-repeated portion of the DNA. Armed with these fragments, they describe the method for obtaining the DNA sequence.
Section 8.10 on protein sequencing and identification leans more toward being a biological overview of the subject and its challenges. It alludes to some computational problems, but does not delve into any. Several applications of protein sequencing and identification are given, as well as an intro to mass spectrometry.
Section 8.11 introduces the peptide sequencing problem. The idea behind sequencing peptides seems to be to break the peptide into two parts (fragment it) and record the masses and ion types: in the fragmenting process peptides can lose molecules such as ammonia or water, and these partial peptides with pieces missing are called ion types. If is a set of numbers representing the masses of possible chemical groups removed during fragmentation, then
is called the set of ion types.
To sequence the peptides, two spectra are generated: given a known peptide, the theoretical spectrum is the set of masses obtained by subtracting all ion types from the masses of all partial peptides. Conversely, not knowing the peptide, the experimental spectrum is a set of numbers obtained by mass spectrometry. The goal of the peptide sequencing problem, then, is to find a peptide whose theoretical spectrum is the best match to some experimental spectrum.