Sorry for the delay. I’ve been busier than expected these past few days, and I found I had way more to say about this than expected.
The overarching goal of these 3 sections is to find a way to measure how ‘similar’ two strings of either nucleotides or of amino acids is, continuing the ideas from the previous sections. The real mathematical difficulty here is to capture the notion of similarity with a mathematical definition which is both robust and efficiently computable.
In sections 6.4 and 6.5 we considered adding ‘gaps’ to the Hamming distance to get a better notion of the distance between two strings. In 6.6, we consider a generalization of this method. Let δ be a (k+1) x (k+1) matrix, called the scoring matrix, where k is the number of letters in the alphabet we are using (whether they be nucleotides or amino acids). We’ll abuse notation a bit and write δ(x,y) for the entries of δ, where x and y are either letters or the gap character ‘-’.
The idea is that we’ll parse through the strings as we did before, either diagonally or treating the character as a gap, and at each step we add δ(x,y) to a running total, where x and y are characters extracted from the respective strings, or the gap. For a particular way to parse the strings, we get a number, which I’ll call the similarity of the two strings, and then we minimize that number over all possible paths. Algorithmically, there’s a nice way to do this, which I won’t discuss since it’s rather prominently placed in the chapter.
While this certainly is better than any of the previous methods, and is probably in a form that is usable as is, it isn’t terribly satisfying to me. At the very least, using it to predict the probability of a particular mutation occurring over a long period of time is certainly not very accurate, so there’s a question as to whether this represents the actual “biological similarity” of the sequences (if such a thing actually exists). To explain better, I’d need to draw an analogy to statistical mechanics, which I expect won’t be familiar to most people here. If there’s any interest in this analogy (which I think is pretty interesting, but extremely tangential) then I can post it in the comments. In the analogy, the similarity we have come up with corresponds to the limit that the ‘temperature’ (analogous to mutation rate) goes to 0. On the other hand, my idea is really of only theoretical interest, since it doesn’t lend itself to an algorithmic solution.
A special case of this solution is the case where δ has diagonal elements 1 in all but the ‘-’ row and column, -μ off-diagonal elements in these rows and columns, and -σ in the last row and column, for real μ, σ (typically nonnegative). This doesn’t seem to actually speed up the algorithm significantly, but it certainly reduces the amount of memory required. It’s useful for DNA sequence comparisons, but less so for protein sequences because these can often change quite a lot and still be qualitatively very similar.
An observation I made when reading is that, if the diagonal elements (other δ(-,-), which is unimportant) are all equal and positive, and if δ is symmetric, and all the off-diagonal elements are less than the diagonal elements, then given the set of strings of fixed length n, we can do an affine transformation on the similarity function to make it into a metric space. If we allow the affine function do also be a function of the lengths of the two strings (affine in each of these parameters as well) then we can make the set of all strings into a metric space. The assumptions all seem somewhat reasonable, but I’m no expert in biology. I don’t know if this is useful, but metric spaces are nice enough things that it’s plausible that it has some application.
Section 6.7 deals with the problem of comparing protein sequences. They first explain an algorithm that one might think up to compute good values of δ. The big problem with this “algorithm” is that it needs δ to compute δ. So if we don’t already know what δ should be, the algorithm won’t actually give us δ. We might expect that this means it’s useless, but in fact in many cases we can start with a guess, and if it’s good enough the algorithm will make it better. We can think of the algorithm as a continuous function
, and study the dynamical system generated by repeated application of
. Our desired δ is a fixed point of
. If it is attracting on some “large” subset of
then the algorithm applied iteratively will converge to δ if we start with a member of this “large” subset. There are several interesting analytical issues here (most notably proving that the fixed point is attracting), but this is digressing a lot. I did try to prove this for about 5 minutes, but wasn’t successful.
Section 6.7 also gives an alternate way to get a good δ, which is essentially based on statistical analysis of protein sequences. It should be fairly straightforward to understand, and I don’t have much to say about it.
Section 6.8 discusses the problem of finding substrings of two given strings which are very similar. In many cases, we can’t expect the strings to be similar overall, but certain parts should be very similar. This is called the local alignment problem (as opposed to the global problem). The problem has the same complexity as the global problem. I honestly didn’t find that too surprising, but apparently the authors did. It only requires a small modification of the method from section 6.6, so I’ll leave out the in-depth description since everyone will read it in the book.
In all honesty, this solution isn’t terribly mathematically elegant to me, because I don’t see any reason to expect that the similarities of pairs of strings of different lengths are comparable the way we’ve defined it. If anyone sees a good argument as to why they should be, that’d be interesting.
These algorithms are probably robust enough to work for biological data, but I expect more specialized algorithms will outperform them most of the time.