The Artima Developer Community
The C++ Source | C++ Community News | Discuss | Print | Email | First Page | Previous | Next
Sponsored Link

The C++ Source
Introducing the Catenator
by Adam Sanitt
September 30, 2005

Page 1 of 3  >>

Advertisement

Summary
In this article Adam introduces a very sophisticated and useful data structure for efficient string processing, while at the same time revealing some interesting features of C++.
This article presents a new container for arbitrary lists of strings that are made from tree-like combinations of lists of shorter strings. The new container provides a highly compact representation allowing efficient searching through the list. It also raises intriguing questions about the philosophy of C++, the values of semantic and structural conformance and the uses of policy and trait clases. The full source code for the Catenator can be found here.

Introductory Dream Sequence

Your C++ crossword-solving code trembles on the cusp of perfection. It wends its way through a complex search tree designed to unearth the correct twelve-letter solution. It first emerges with a list of 1000 four-letter strings, each of which might be the first four letters of the solution. It then retrieves a further list of 1000 three-letter strings—candidates for the last three letters of the solution. You combine these two lists with another list of 1000 five-letter strings to be inserted in the middle of the existing candidates. You press the button to find matching strings and.....Boom! Twisted shards of beige plastic lie scattered around the room; a thin lick of flame rises from the ghastly remnants of your computer case and an acrid smell of ozone fills the room. Or at least, that's what it feels like. You have fallen victim to a combinatorial explosion. vector<string> is a stalwart ally, but it wilts in the face of a list of one billion twelve-letter strings. Is there any way to deal effectively with a search space of this size? And can we maintain the intuitive semantics of the STL containers while dealing with a list of one billion strings?

We need an entity that presents a container-like interface but is capable of handling arbitrarily large lists of strings made up of combinations of smaller lists [1]. And it has to do it quickly and without using inordinate amounts of memory. And the lists can be added at the beginning or the end or, in fact, anywhere in the middle. And, just in case that's too easy, let's templatize it on the type of entity contained and any traits and policy we can think of.

The Catenator in Action

The Catenator is designed to cope with frighteningly large numbers of sequences, so any toy example is, almost by definition, otiose. With that caveat in mind, here are two examples of the Catenator in action. The first is a programme to solve cryptic crosswords. Afficionados will know that cryptic crosswords, which involve puns, word play, anagrams and manipulation of letters have traditionally been seen as beyond the powers of computers. Assuming that we are either too bold or too stupid to worry about that, let us look at a typical clue:
Voice contains Spring growth (8)
Contrary to surface appearance, this clue instructs us to put a synonym of 'spring' inside a synonym of 'voice' to produce a synonym of 'growth'. A computer programme might generate a list of suitable synonyms of 'voice' and 'spring' and then combine them into a longer list of all possible combinations. For instance, here are the 18 four-letter synonyms of 'voice' produced by the programme's synonym generator:
aver, call, chat, hint, jive, roar, sing, show, song, talk, tell, tone, vent, view, will, wish, word, yell
And here are the 21 four-letter synonyms of 'spring':
bolt, buck, burn, come, flow, flee, free, grow, gush, head, jump, leap, loom, lope, rill, rise, root, skip, stem, trip, well
These can be combined into three separate lists of eight-letter sequences (one each for insertion after the first, second and third letter of the synonym of 'voice'). Even this toy example produces a total of 11340 candidate eight-letter strings. What we want to know is whether any item in these lists matches any of our 15 eight-letter synonyms for 'growth':
addition, amassing, boosting, dilation, dividend, earnings, hoarding, increase, maturing, offshoot, parasite, progress, spurting, swelling, uprising
If so, we will have solved the clue. The following code implements this clue-solving algorithm:
std::vector<std::string> * asyns = synonymizer.get_by_len("spring", 4);
std::vector<std::string> * bsyns = synonymizer.get_by_len("voice", 4);
std::vector<std::string> * csyns = synonymizer.get_by_len("growth", 8);

typedef Catenator<std::string> StringCat_type;

std::vector<StringCat_type *> Catenatorlist;
StringCat_type Catenatorb(bsyns);
StringCat_type Catenatora(asyns);

for (std::size_t i = 1; i < 4 ; ++i) {
    Catenatorlist.push_back( new StringCat_type(&Catenatorb, &Catenatora, i) );
}

for (std::size_t j  = 0; j < csyns.size(); ++j) {
    for (std::size_t k = 0; k < Catenatorlist.size(); ++k) {
        std::vector<std::string> * solutions = Catenatorlist[k]->match((*csyns)[j]);
        if (!solutions->empty()) {
            std::cout << "Solution to clue is: " << (*solutions)[0] << "\n";
        }
        delete solutions;
    }
}  
The programme should print the following [2]: Solution to clue is: swelling The second example is a programme to explore gene sequencing. Amino acids are represented by three-letter sequences of the letters 'A', 'C', 'G' and 'T'. Combinations of these letter sequences make different genes which encode different proteins. Genes can be spliced and new sequences inserted in the middle, or joined together end-to-end. A programme might start with a list of letter sequences which are combined using a certain procedure (or at random) to form a large number of long letter sequences, each of which is a gene. This might be a cheap way to simulate the production of genes using conventional experimentation or an exploration of presumed activity 'in the wild'. The experimenter would then like to know whether particular proteins are encoded by these genes: that is, whether a particular sequence of letters appears in a particular place in any of these genes. This could be implemented using Catenators as follows:
std::size_t max_depth = 20
std::size_t sample_size = 10

typedef Catenator<std::string, WildCardStringTraits> CatenatorGene_type;

std::vector<std::string> amino_acids;
amino_acids.push_back("ACT")
amino_acids.push_back("GAT")

... etc ...

CatenatorGene_type * gene = new( CatenatorGene_type );

for (std::size_t i = 0; i < max_depth; ++i) {
    CatenatorGene_type * swapgene;
    std::vector<std::string> sample = select_random_sample(amino_acids.begin(), amino_acids.end(), sample_size)
    CatenatorGene_type catsample(sample);
    switch (i % 3 ) { // alternately append, prepend and insert
    case 0:
        swapgene = new CatenatorGene_type(gene, &catsample);
        break;
    case 1:
        swapgene = new CatenatorGene_type(&catsample, gene);
        break;
    case 2:
        swapgene = new CatenatorGene_type(gene, &catsample, get_random_insert(sample_size));
        break;
    } 
    swap( swapgene, gene);
    delete swapgene;
}

std::vector<std::string> * matches = gene->match("???????????ACAATTGGTATG???? ... etc ... ?????");
for (std::size_t j = 0; j < matches->size(); ++j) {
    std::cout << "Matching gene: " << (*matches)[j] << "\n";
}
Even this toy example features a list of 1020 genes, each of which is 60 characters long. However, the size of this list represented as a Catenator will be approximately 600 characters and, crucially, the matching algorithm performs in line with the 600 character size, not the 60 * 1020 list size (although as you add more wild cards, performance steadily degrades) [3].

Page 1 of 3  >>

The C++ Source | C++ Community News | Discuss | Print | Email | First Page | Previous | Next

Sponsored Links



Google
  Web Artima.com   
Copyright © 1996-2014 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use - Advertise with Us