The Artima Developer Community
Sponsored Link

Python Buzz Forum
Sequence Numberings

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Andrew Dalke

Posts: 291
Nickname: dalke
Registered: Sep, 2003

Andrew Dalke is a consultant and software developer in computational chemistry and biology.
Sequence Numberings Posted: Sep 30, 2003 12:54 AM
Reply to this message Reply

This post originated from an RSS feed registered with Python Buzz by Andrew Dalke.
Original Post: Sequence Numberings
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
Latest Python Buzz Posts
Latest Python Buzz Posts by Andrew Dalke
Latest Posts From Andrew Dalke's writings

Advertisement
Spent some time helping Bill with one of his projects. He needed a bit of coding done to make BLAST output into a form importable into UCSC's genome browser. Wouldn't have been too hard except he has no local network, a windows machine w/o Python installed, and only dialup access to the rest of the world, including where the data was. I must have spent half my time waiting for the cursor to catch up to my typing.

Besides the tedium of using emacs over a dialup, we kept running into an annoying complication in bioinformatics: sequence numberings. Suppose you have the sequence CCAAAC. What is the subsequence range matching AAA? There are three common answers:

  • biological numbering - 3 to 5
    Position 1 is the first character in the sequence, and the range i to j starts at i and includes j.

    This is favored by biologists and most of the rest of the world. All of the common formats, including GenBank, SWISS-PROT, or PDB, use this range definition.

  • computational biology numbering - 3 to 6
    Position 1 is the first character in the sequence, and the range i to j starts at i and goes up to but does not include j.

    Once you start working with biological numbering for a bit you realize there are a lot of +1 and -1 terms. Forgetting the term creates a fencepost error in programming vernacular. For example, suppose a PROSITE pattern matches residues 5 to 10; how long is the match? It's 10-5+1==6 not 5 which is the first thought many have. Suppose one block of an alignment goes from 10 to 15 and the next goes from 20 to 25; how long is the gap? It's 20-15-1==4 and not 5.

    These fencepost errors are reduced by using a range definition which does not include the last number in the range. If the PROSITE goes from residues 5 to 10 then the range is given as 5 to 11, and the length is 11-5==6, which is the obvious answer. In the alignment example, the two blocks are given as (10,16) and (20, 26) and the gap size between them is 20-16==4.

    I called this computational biology numbering because I couldn't think of a better name. It's used in the Life Sciences CORBA spec and a few other projects. The people who use it tend to be scientists (chemists or biologists) who become software developers or programmers who are told they need to have sequence numbers start at 1.

  • computer science numbering - 2 to 5
    Position 0 is the first character in the sequence, and the range i to j starts at i and goes up to but does not include j.

    Some tricky problems remain in computational biology numbering. Suppose you have an alignment from a protein to a DNA sequence; no gaps, and the first residues of the protein is aligned to the first three bases in the DNA. What's the range for the three DNA bases which correspond to the fourth protein residue?

    In biological numbering, the protein is at position 4 and the DNA range is 10 to 12. The equation for the DNA start position is 3*n-2 and for the end position is 3*n. Not to hard to figure out, but it does take some thought.

    In computational biology numbering the range is 10 to 13, and the formulas are 3*n-2 and 3*n+1. Still requires a bit of thought, which increases the odds of making an error.

    The problem is because the numberings start from 1. Suppose it starts from 0. This is what most computer scientists and some mathematicians and physicists use, even to the point of labelling the first chapter in a book "Chapter 0." These are the sort of people who get a thrill that European elevators use 0 for the ground floor and are dismayed that US elevators use 1 instead.

    Using computer science numbering, the protein residue is at position 3 (because position 0 is the first character), and the corresponding DNA range is 9 to 12. The formula is 3*x to 3*(x+1) and very easy to remember.

    In fact, the equation 3*n+2 used for the start position of the other two numbering can be rewritten 3*(n-1)+1, which basically moves the protein's 1 numbering to 0 numbering, multiplies by 3, and moves back to the DNA's 1 numbering.

  • You can probably guess from my explanation that I prefer the computer science numbering, at least for the innards of a computer program. It makes the math so much easier, and since most software is written by non-biologists, it means I can call them directly instead of doing an extra conversion step. But when I need to count things manually, I agree with the biologists and count things starting at 1 and including the last number in the range.

    The right solution is to write software which uses computer science numbering internally does conversion from other numbering systems in the I/O layers -- including the GUIs for the biologists. Anything else causes problems.

    I was helping Bill with a simple programming task - convert from one alignment format to another. BLAST uses biological numbering. Biopython's BLAST parsing code keeps the numbers as it finds them. But the output format doesn't document which of the three numberings it uses. The only way to figure it out is to load the file into a viewer and do a manual check. It used the computational biology ordering.

    There's one more wrinkle, complementary sequences. One strand of DNA is the reverse complement of the others (so ATCG is the revese complement of CGAT). Suppose the protein is aligned to the negative strand DNA; how is the range on the negative strand represented? Are the start and end positions swapped? Are they in forward ordering but with a flag to say "these are on the reverse strand"? Are the positions recomputed from the other end (so x goes to N-x, where N is the length of the sequence)?

    Three numbering systems... Three different ways to handle reverse complements... There are nine different ways to represent ranges! No wonder it's so annoying. And it won't be going away any time soon.

Read: Sequence Numberings

Topic: PythonCard and Boa Constructor not ready yet Previous Topic   Next Topic Topic: Hello from my new server

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use