|
|
|
Sponsored Link •
|
|
Advertisement
|
vector<string> is a
stalwart ally, but it wilts in the face of a list of one billion twelve-letter strings. Is there any way to deal effectively
with a search space of this size? And can we maintain the intuitive semantics of the STL containers while dealing
with a list of one billion strings?
We need an entity that presents a container-like interface but is capable of handling arbitrarily large lists of strings made up of combinations of smaller lists [1]. And it has to do it quickly and without using inordinate amounts of memory. And the lists can be added at the beginning or the end or, in fact, anywhere in the middle. And, just in case that's too easy, let's templatize it on the type of entity contained and any traits and policy we can think of.
Voice contains Spring growth (8)Contrary to surface appearance, this clue instructs us to put a synonym of 'spring' inside a synonym of 'voice' to produce a synonym of 'growth'. A computer programme might generate a list of suitable synonyms of 'voice' and 'spring' and then combine them into a longer list of all possible combinations. For instance, here are the 18 four-letter synonyms of 'voice' produced by the programme's synonym generator:
aver, call, chat, hint, jive, roar, sing, show, song, talk, tell, tone, vent, view, will, wish, word, yellAnd here are the 21 four-letter synonyms of 'spring':
bolt, buck, burn, come, flow, flee, free, grow, gush, head, jump, leap, loom, lope, rill, rise, root, skip, stem, trip, wellThese can be combined into three separate lists of eight-letter sequences (one each for insertion after the first, second and third letter of the synonym of 'voice'). Even this toy example produces a total of 11340 candidate eight-letter strings. What we want to know is whether any item in these lists matches any of our 15 eight-letter synonyms for 'growth':
addition, amassing, boosting, dilation, dividend, earnings, hoarding, increase, maturing, offshoot, parasite, progress, spurting, swelling, uprisingIf so, we will have solved the clue. The following code implements this clue-solving algorithm:
std::vector<std::string> * asyns = synonymizer.get_by_len("spring", 4);
std::vector<std::string> * bsyns = synonymizer.get_by_len("voice", 4);
std::vector<std::string> * csyns = synonymizer.get_by_len("growth", 8);
typedef Catenator<std::string> StringCat_type;
std::vector<StringCat_type *> Catenatorlist;
StringCat_type Catenatorb(bsyns);
StringCat_type Catenatora(asyns);
for (std::size_t i = 1; i < 4 ; ++i) {
Catenatorlist.push_back( new StringCat_type(&Catenatorb, &Catenatora, i) );
}
for (std::size_t j = 0; j < csyns.size(); ++j) {
for (std::size_t k = 0; k < Catenatorlist.size(); ++k) {
std::vector<std::string> * solutions = Catenatorlist[k]->match((*csyns)[j]);
if (!solutions->empty()) {
std::cout << "Solution to clue is: " << (*solutions)[0] << "\n";
}
delete solutions;
}
}
The programme should print the following [2]:
Solution to clue is: swelling
The second example is a programme to explore gene sequencing. Amino acids are represented by three-letter
sequences of the letters 'A', 'C', 'G' and 'T'. Combinations of these letter sequences make different genes which
encode different proteins. Genes can be spliced and new sequences inserted in the middle, or joined together
end-to-end. A programme might start with a list of letter sequences which are combined using a certain procedure
(or at random) to form a large number of long letter sequences, each of which is a gene. This might be a cheap
way to simulate the production of genes using conventional experimentation or an exploration of presumed activity
'in the wild'. The experimenter would then like to know whether particular proteins are encoded by these genes:
that is, whether a particular sequence of letters appears in a particular place in any of these genes. This could be
implemented using Catenators as follows:
std::size_t max_depth = 20
std::size_t sample_size = 10
typedef Catenator<std::string, WildCardStringTraits> CatenatorGene_type;
std::vector<std::string> amino_acids;
amino_acids.push_back("ACT")
amino_acids.push_back("GAT")
... etc ...
CatenatorGene_type * gene = new( CatenatorGene_type );
for (std::size_t i = 0; i < max_depth; ++i) {
CatenatorGene_type * swapgene;
std::vector<std::string> sample = select_random_sample(amino_acids.begin(), amino_acids.end(), sample_size)
CatenatorGene_type catsample(sample);
switch (i % 3 ) { // alternately append, prepend and insert
case 0:
swapgene = new CatenatorGene_type(gene, &catsample);
break;
case 1:
swapgene = new CatenatorGene_type(&catsample, gene);
break;
case 2:
swapgene = new CatenatorGene_type(gene, &catsample, get_random_insert(sample_size));
break;
}
swap( swapgene, gene);
delete swapgene;
}
std::vector<std::string> * matches = gene->match("???????????ACAATTGGTATG???? ... etc ... ?????");
for (std::size_t j = 0; j < matches->size(); ++j) {
std::cout << "Matching gene: " << (*matches)[j] << "\n";
}
Even this toy example features a list of 1020 genes, each of which is 60 characters long. However, the
size of this list represented as a Catenator will be approximately 600 characters and, crucially, the matching
algorithm performs in line with the 600 character size, not the 60 * 1020 list size (although as you add
more wild cards, performance steadily degrades) [3].
|
Sponsored Links
|