Besides the tedium of using emacs over a dialup, we kept running into
an annoying complication in bioinformatics: sequence numberings.
Suppose you have the sequence CCAAAC. What is the
subsequence range matching AAA? There are three common answers:
- biological numbering - 3 to 5
Position 1 is the first character in the sequence, and the range
i to j starts at i and includes j.
This is favored by biologists and most of the rest of the world. All
of the common formats, including GenBank, SWISS-PROT, or PDB, use this
range definition.
- computational biology numbering - 3 to 6
Position 1 is the first character in the sequence, and the range
i to j starts at i and goes up to but does
not include j.
Once you start working with biological numbering for a bit you realize
there are a lot of +1 and -1 terms. Forgetting the term creates a
fencepost error
in programming vernacular. For example, suppose a PROSITE pattern
matches residues 5 to 10; how long is the match? It's
10-5+1==6 not 5 which is the first thought many have.
Suppose one block of an alignment goes from 10 to 15 and the next goes
from 20 to 25; how long is the gap? It's 20-15-1==4 and not
5.
These fencepost errors are reduced by using a range definition which
does not include the last number in the range. If the PROSITE goes
from residues 5 to 10 then the range is given as 5 to 11, and the
length is 11-5==6, which is the obvious answer. In the
alignment example, the two blocks are given as (10,16) and (20, 26)
and the gap size between them is 20-16==4.
I called this computational biology numbering because I
couldn't think of a better name. It's used in the Life Sciences CORBA
spec and a few other projects. The people who use it tend to be
scientists (chemists or biologists) who become software developers or
programmers who are told they need to have sequence numbers start at
1.
- computer science numbering - 2 to 5
Position 0 is the first character in the sequence, and the range
i to j starts at i and goes up to but does
not include j.
Some tricky problems remain in computational biology numbering.
Suppose you have an alignment from a protein to a DNA sequence; no
gaps, and the first residues of the protein is aligned to the first
three bases in the DNA. What's the range for the three DNA bases
which correspond to the fourth protein residue?
In biological numbering, the protein is at position 4 and the DNA
range is 10 to 12. The equation for the DNA start position is
3*n-2 and for the end position is 3*n. Not to hard
to figure out, but it does take some thought.
In computational biology numbering the range is 10 to 13, and the
formulas are 3*n-2 and 3*n+1. Still requires a bit
of thought, which increases the odds of making an error.
The problem is because the numberings start from 1. Suppose it starts
from 0. This is what most computer scientists and some mathematicians
and physicists use, even to the point of labelling the first chapter
in a book "Chapter 0." These are the sort of people who get a thrill
that European elevators use 0 for the ground floor and are dismayed
that US elevators use 1 instead.
Using computer science numbering, the protein residue is at position 3
(because position 0 is the first character), and the corresponding DNA
range is 9 to 12. The formula is 3*x to 3*(x+1) and
very easy to remember.
In fact, the equation 3*n+2 used for the start position of
the other two numbering can be rewritten 3*(n-1)+1, which
basically moves the protein's 1 numbering to 0 numbering, multiplies
by 3, and moves back to the DNA's 1 numbering.
You can probably guess from my explanation that I prefer the computer
science numbering, at least for the innards of a computer program. It
makes the math so much easier, and since most software is written by
non-biologists, it means I can call them directly instead of doing an
extra conversion step. But when I need to count things manually, I
agree with the biologists and count things starting at 1 and including
the last number in the range.
The right solution is to write software which uses computer science
numbering internally does conversion from other numbering systems in
the I/O layers -- including the GUIs for the biologists. Anything
else causes problems.
I was helping Bill with a simple programming task - convert from one
alignment format to another. BLAST uses biological numbering.
Biopython's BLAST parsing code keeps the numbers as it finds them.
But the output format doesn't document which of the three numberings
it uses. The only way to figure it out is to load the file into a
viewer and do a manual check. It used the computational biology
ordering.
There's one more wrinkle, complementary sequences. One strand of DNA
is the reverse complement of the others (so ATCG is the revese
complement of CGAT). Suppose the protein is aligned to the negative
strand DNA; how is the range on the negative strand represented? Are
the start and end positions swapped? Are they in forward ordering but
with a flag to say "these are on the reverse strand"? Are the
positions recomputed from the other end (so x goes to N-x, where N is
the length of the sequence)?
Three numbering systems... Three different ways to handle reverse
complements... There are nine different ways to represent ranges! No
wonder it's so annoying. And it won't be going away any time soon.