The Artima Developer Community
Sponsored Link

The C++ Source
Introducing the Catenator
by Adam Sanitt
September 30, 2005

<<  Page 2 of 3  >>

Advertisement

Implementation

As the examples show, most of the effort is in creating the concatenations of sub-strings. Doing this compactly and efficiently is the first half of the problem. We have seen that the na�ve solution does not work:
std::vector<std::string> * concatenate(
       const std::vector<std::string> & first,
       const std::vector<std::string> & second)
{
    result = new std::vector<std::string>;
    for (std::size_t i = 0; i < first.size(); ++i) {
        for (std::size_t j = 0; j < second.size(); ++j) {
            //blows up when vector gets too large
            result->push_back( first[i] + second[j] );
        }
    }
    return result;
}
Our mistake is to confuse interface and implementation. We want a nice, tidy vector-like object and so we flatten our list of strings into a nice, tidy vector. Unfortunately, this tramples over the inherent structure in the data. Instead, we should work with the structure—we will deal with interfaces later [4]. Many complex computing problems are reduced to simple computing problems by an appropriate choice of data structure—once the data structure is fixed; the code flows effortlessly [5]. The list of sub-strings is generated by search through a problem space—usually this will involve traversal of a tree-like structure. Let's take a look at a typical string that is constructed from the sub-strings resulting from this type of traversal:
This tree was assembled by taking a two-letter string ('AD') and putting another two-letter string inside it after position one ('BC') to make a four-letter string ('ABCD') . Then another two-letter string ('EI') was appended to make a six-letter string ('ABCDEI') and a two-letter string ('FG') was inserted after position five to make an eight-letter string ('ABCDEFGI') and finally a one-letter string ('H') was inserted after position seven to make a nine-letter string ('ABCDEFGHI'). The final insertion was accomplished by appending the one-letter string to the previous insertion. Although, for simplicity, we have only shown one possible string in each position, the Catenator actually contains a list of strings in each position—this example is not just a tree of five sub-strings, but a tree of five lists of sub-strings.

The data structure underneath the flat representation of the string is a directed acyclic graph, but it is not the usual type of tree. Note that it contains both 'horizontal arrows' representing appending (or prepending) of different lists of strings and 'vertical arrows' representing insertion. Also, there may be a number of insertions in a particular list of sub-strings, up to a maximum of one at each possible insertion point. Therefore, we should expect our algorithms to contain a combination of iteration as well as the recursion more normally associated with trees.

Information about where in the flat representation of the string each individual letter is located is not stored 'locally'. In other words, a particular list of sub-strings contained at some level in the hierarchy cannot know where in the ultimate string it will end up—it only knows where to put the sub-strings that it contains. The 'AD' segment knows that it contains another two-letter segment after the first letter, and so the letter 'D' will appear three slots after the letter 'A', but it does not know where in the ultimate string its complete four-letter section will be slotted. This is a necessary trade-off: it means that sub-trees are also complete, well-formed trees themselves and we can use the same techniques to manipulate parts of trees and entire trees. In other words, the 'EI' segment above is itself a five-letter Catenator consisting of a two-letter list with another three letters inserted at position one.

The resulting data structure is a sequence of lists of sub-strings, one after the other (represented as a vector of vectors of strings ordered by index), together with a map for each sequence showing where in the sub-string other trees are inserted (represented as a map of insertion indices to pointers to sub-trees):

typedef std::vector<value_type> vect_type;
typedef std::map<std::size_t, class_type *> cqmap_type;
typedef typename std::pair<vect_type *, cqmap_type *> cqpair_type;
std::vector<cqpair_type> c_; //private member variable
Creation of a Catenator now involves only filing the lists of strings in the appropriate place. The expense in time and space of iterating through the lists and combining them disappears [6]. Appending and prepending new lists is straightforward. Insertion of a new list of strings in an existing Catenator is slightly trickier—we must make sure that insertion at the end of a string is actually represented as appending to a string (and that insertion at the beginning is prepending) and that two insertions in the same place are combined into a single insertion. The insertion constructor takes two Catenators, one to be inserted inside the other, and the position of the insertion as arguments:
Catenator(const Catenator& cqfirst, const Catenator& cqsecond, std::size_t inpos) : s_(0), l_(0) 

To construct the new Catenator, we first advance through each constituent Catenator within the first Catenator until we reach the one into which the insertion is to be made, copying the constituents as we go along:

//copy prior to insert, and find insert point
std::size_t i = 0;
while (inpos > cqfirst.length(i)) {
    inpos -= cqfirst.length(i);
    cqfirst.copypart(this, i); 
}

Once this is completed, inpos will contain the position within the i'th Catenator in which to make the insertion. This is straightforward if the insertion occurs at the beginning:

if (inpos == 0) { // insert before next item
    for (std::size_t k = 0; k < cqsecond.c_.size(); ++k) {
        cqsecond.copypart(this, k);
    }
    cqfirst.copypart(this, i);
}
Or at the end:
f (inpos == cqfirst.length(i)) { //insert after next item
    cqfirst.copypart(this, i);
    for (std::size_t k = 0; k < cqsecond.c_.size(); ++k) {
         cqsecond.copypart(this, k);
    }
}
But if the insertion occurs in the middle of this Catenator, it might occur in the middle of an insertion within that Catenator. Also, we do not know how many characters are in the top level of this Catenator, and how many are buried in insertions. Just because the insertion is at character number five, for example, it does not mean that the required insertion index is five: there might be a three letter insertion at character one, so that an insertion at character five is accomplished with an index of two. And to cap it all, because there is only one insertion allowed at each index, we need to cater separately not only for insertions in the middle of other insertions, but also insertions at the beginning or end of other insertions. In short, all the following are possible:
  1. A brand new insertion;
  2. An insertion at the beginning of another insertion, which means prepending the new part to the existing part;
  3. An insertion at the end of another insertion, which means appending the new part to the existing part; or
  4. An insertion in the middle of another insertion, which means inserting the new part in the existing part (which we handle with a recursive call to the insertion constructor).

The solution is to step through each insertion until we reach the insertion index, and then take the appropriate action:

typename cqmap_type::iterator mpos = cqfirst.c_[i].second->begin();
while ((mpos != cqfirst.c_[i].second->end()) && (inpos >= mpos->first)) {
  if (inpos > mpos->first + mpos->second->length()) { //after end of this insertion
    inpos -= mpos->second->length();
    mpos++;
  } else if (inpos == mpos->first) { //at beginning of this insertion
    Catenator * prepcq = new Catenator(&cqsecond, mpos->second);
    cqfirst.copypartinsert(this, i, prepcq, mpos->first);
    inpos = 0;
  } else if (inpos == mpos->first + mpos->second->length()) { //at end of this insertion
    Catenator * prepcq = new Catenator(mpos->second, &cqsecond);
    cqfirst.copypartinsert(this, i, prepcq, mpos->first);
    inpos = 0;
  } else { //in middle of this insertion
    Catenator * newcq = new Catenator(*(mpos->second), cqsecond, inpos - mpos->first);
    cqfirst.copypartinsert(this, i, newcq, mpos->first);
    inpos = 0;
  }
}
if (inpos > 0) { //need brand new insertion
    cqfirst.copypartinsert(this, i, cqsecond.clone(), inpos);
}

Finally, the constituent Catenators following the insertion are copied into the new Catenator:

//copy items after insertion
for ( ++i ; i < cqfirst.c_.size(); ++i) {
  cqfirst.copypart(this, i);
}

Putting all these pieces together, we obtain the complete insertion constructor:

//constructor inserting cqsecond in cqfirst at position inpos
Catenator(const Catenator& cqfirst, const Catenator& cqsecond, std::size_t inpos) : s_(0), l_(0) 
{
    //copy prior to insert, and find insert point
    std::size_t i = 0;
    while (inpos > cqfirst.length(i)) {
        inpos -= cqfirst.length(i);
        cqfirst.copypart(this, i); 
        if (++i == cqfirst.c_.size())
            throw CatenatorException("Out of range in catenator insert");
    }
    //perform insertion
    if (inpos == 0) { // insert before next item
        for (std::size_t k = 0; k < cqsecond.c_.size(); ++k) {
            cqsecond.copypart(this, k);
        }
        cqfirst.copypart(this, i);
    } else if (inpos == cqfirst.length(i)) { //insert after next item
        cqfirst.copypart(this, i);
        for (std::size_t k = 0; k < cqsecond.c_.size(); ++k) {
            cqsecond.copypart(this, k);
        }
    } else { //insert in middle of next item
        typename cqmap_type::iterator mpos = cqfirst.c_[i].second->begin();
        while ((mpos != cqfirst.c_[i].second->end()) && (inpos >= mpos->first)) {
            //not before next insertion
            if (inpos > mpos->first + mpos->second->length()) {
                //after end of this insertion
                inpos -= mpos->second->length();
                mpos++;
            } else if (inpos == mpos->first) { //at beginning of this insertion
                Catenator * prepcq = new Catenator(&cqsecond, mpos->second);
                cqfirst.copypartinsert(this, i, prepcq, mpos->first);
                inpos = 0;
            } else if (inpos == mpos->first + mpos->second->length()) {
                //at end of this insertion
                Catenator * prepcq = new Catenator(mpos->second, &cqsecond);
                cqfirst.copypartinsert(this, i, prepcq, mpos->first);
                inpos = 0;
            } else {
                //in middle of this insertion
                Catenator * newcq = new Catenator(*(mpos->second), cqsecond, inpos - mpos->first);
                cqfirst.copypartinsert(this, i, newcq, mpos->first);
                inpos = 0;
            }
        }
        if (inpos > 0) { //need brand new insertion
            cqfirst.copypartinsert(this, i, cqsecond.clone(), inpos);
        }
    }
    //copy items after insertion
    for ( ++i ; i < cqfirst.c_.size(); ++i) {
        cqfirst.copypart(this, i);
    }
  }
The second half of the problem is to present an intuitive interface into this daunting data structure. Our intuition is to treat the structure as a vector: a flat list of strings [7]. The interface to the Catenator connives in this pretence. The indexing operator uses a combination of iteration, recursion and modular arithmetic to assemble the indexed string on demand:
value_type operator[](std::size_t ind) const {
    value_type t = traits_type::createempty();
    for (std::size_t i = 0; i < c_.size(); ++i) {
        std::size_t nextind = ind % c_[i].first->size();
        value_type newt((*c_[i].first)[nextind]);
        ind -= nextind;
        ind /= c_[i].first->size();
        if (c_[i].second) {
            typename cqmap_type::const_reverse_iterator it(c_[i].second->rbegin());
            while (!(it == (typename cqmap_type::const_reverse_iterator) c_[i].second->rend())) {
                std::size_t mind = ind % it->second->size();
                value_type mt = (*it->second)[mind];
                ind -= mind;
                ind /= it->second->size();
                newt = traits_type::insert(newt, mt, it->first);
                it++;
            }
        }
        t = traits_type::join(t, newt);
    }
    if (ind > 0) throw CatenatorException("Catenator buffer overflow.");
        return t;
}
Obviously, we incur the cost of concatenating sub-strings when we access an individual member of the Catenator—but only for the particular one that we need. The cost of iterating through every member of a Catenator using the array operator:
for (std::size_t j = 0; j < cq.size(); ++j) {
    std::cout << cq[j];
}
Is the same as the cost of creating the entire flat vector—plus a small additional constant cost for each instance [8]. However, it shouldn't crash your computer: the cost in space of storing the entire vector is still averted.

Together with the array operator, associated methods such as 'size' and 'length' are provided. Mutable private member variables are used to speed up these member functions without jeopardizing their const-ness.

As the examples demonstrate, we usually create a Catenator because we are looking for something in it—not because we are equally interested in every single member of the impossibly long sequence of strings it contains. The most elegant use of a Catenator is to create the entire list, search for something in it and extract all the useful parts—all without ever incurring the cost of creating or even iterating through its member strings. For this purpose, a search routine is provided to find all strings that match according to the criteria set out by the traits class. The public routine merely despatches the request to a private helper class:

t_type * mightmatch(value_type tofind) {
    if (length() != traits_type::length(tofind)) return 0;
    return findpart(tofind, tofind);
}
The findpart routine divides and conquers. It loops through the individual constituent Catenators. For each one, it assembles two sets of strings from the original search string: firstly, the part that corresponds to the constituent Catenator itself, and, secondly, the parts that correspond to each insertion within that Catenator. For instance, if the search string was "abcdef" and the constituent Catenator was two letters long with a four-letter insertion at position one, the top level string to match would be "af" and the single insertion string to match would be "bcde". The heavy lifting for this division is outsourced to the topmatcher routine:
vect_type * accum = new(vect_type);
vect_type * newaccum = new(vect_type);
for (std::size_t i = 0; i < c_.size(); ++i) { //for each top-level constituent T
    std::map<std::size_t, value_type> inserts;
    value_type confind = topmatcher(thisfind, i, inserts); 
Then we actually need to get the list of matches for these candidates. Once again, we rely on the good offices of two other routines, couldmatch and addinsertmatches, to parcel up the results into maybes:
vect_type * maybes = couldmatch(i, confind, allfind); //top level match only
maybes = addinsertmatches(maybes, inserts, i, allfind); //put insert matches into maybes
Now we just have to combine our results for this constituent Catenator with the previous constituents. This requires us to combine our list of matches so far with the current list of matches to create a new list which the next iteration of the loop will use as the list of matches so far. There are many ways to implement this using an extravagant amount of memory, cohorts of obscure pointers and pointer arithmetic or inefficient insertions at the beginning of sequences. The findpart routine simply uses two accumulating vectors which it then swaps as it goes along. The two vectors have to be allocated on the heap and explicitly deleted when no longer needed. That portion looks like this (full code appears further down):
    if (i == 0) {
        newaccum->insert(newaccum->begin(), maybes->begin(), maybes->end());
    } else {
        for (std::size_t n = 0; n < accum->size(); ++n) {
            for (std::size_t m = 0; m < maybes->size(); ++m) {
                newaccum->push_back(traits_type::join((*accum)[n], (*maybes)[m]));
            }
        }
    }
    delete maybes;
    swap(accum, newaccum);
    newaccum->clear();
}
delete newaccum;
return accum;
And finally, here is the whole findpart routine:
//
//top level helper routine for search
//
private:
vect_type * findpart(value_type thisfind, value_type allfind) {
    vect_type * accum = new(vect_type);
    vect_type * newaccum = new(vect_type);
    for (std::size_t i = 0; i < c_.size(); ++i) { //for each top-level constituent T
        std::map<std::size_t, value_type> inserts;
        value_type confind = topmatcher(thisfind, i, inserts); 
        //get top level to match and fill map of inserts to match
        vect_type * maybes = couldmatch(i, confind, allfind); //top level match only
        maybes = addinsertmatches(maybes, inserts, i, allfind); //put insert matches into maybes
        if (i == 0) {
            newaccum->insert(newaccum->begin(), maybes->begin(), maybes->end());
        } else {
            for (std::size_t n = 0; n < accum->size(); ++n) {
                for (std::size_t m = 0; m < maybes->size(); ++m) {
                    newaccum->push_back(traits_type::join((*accum)[n], (*maybes)[m]));
                }
            }
        }
        delete maybes;
        swap(accum, newaccum);
        newaccum->clear();
    }
    delete newaccum;
    return accum;
}    
The topmatcher routine runs into the familiar problem of working out where an insertion point is actually located within a particular Catenator. As usual, the solution is to step through the Catenator, keeping track of the index and slicing up the search string as we go along: adding the part before the insert to the top-level search string and the part within the insert to the map of insertion search strings. The different slices create a string for the top-level match and a string for matching each insertion:
//
//top level constituent finding helper routine
//
value_type topmatcher(value_type thisfind, std::size_t index, std::map<std::size_t, value_type>& inserts) {
    std::size_t offset = 0;
    for (std::size_t j = 0; j < index; ++j) {
        offset += length(j);
    }
    value_type confind = traits_type::createempty();
    std::size_t orig = 0;
    for (typename cqmap_type::const_iterator it = c_[index].second->begin();
      it != c_[index].second->end(); ++it) {
        //for each insert in that constituent
        std::size_t gap = it->first - orig;
        confind = traits_type::join(confind, traits_type::substr(thisfind, offset, gap));
        //add next bit to match string
        inserts[it->first] = traits_type::substr(thisfind, offset + gap,
                                it->second->length()); //get matching bit for insert
      offset += gap + it->second->length(); //advance index into match string past insert
      orig = it->first;
    }
    confind = traits_type::join(confind, traits_type::substr(thisfind, offset, 
                                traits_type::length((*(c_[index].first))[0]) - orig)); //add final match part
    return confind;
}
Finally, addinsertmatches takes the top-level search string and the list of inserts and finds all the possible matches, which it then combines into a single list. It calls findpart recursively to discover the list of matching strings for each insert. To avoid painful index juggling when combining the list of results with the lists of matching insertions, it uses reverse_iterator rather than iterator, adding each list of insertions backwards starting at the end. Once again, it uses heap allocation and swap to handle accumulation of lists within a loop:
//
//add insert matches to top level matches, helper routine
//
vect_type * addinsertmatches(vect_type * maybes, std::map<std::size_t, value_type>& inserts,
                             std::size_t index, value_type allfind) {
    for (typename std::map<std::size_t, value_type>::reverse_iterator mrit =
      inserts.rbegin(); mrit != inserts.rend(); ++mrit) { //combine with matches for inserts
        vect_type * maybeaccum = new(vect_type);
        vect_type * nextins = 
          (*(c_[index].second))[mrit->first]->findpart(mrit->second, allfind);
        for (std::size_t b = 0; b < maybes->size(); ++b) {
            for (std::size_t c = 0; c < nextins->size(); ++c) {
                maybeaccum->push_back(traits_type::insert((*maybes)[b], (*nextins)[c], mrit->first)); 
            }
      }
      delete nextins;
      swap(maybeaccum, maybes);
      delete maybeaccum;
    }
    return maybes;
}
Finally, here is the complete picture:
//
//interface routine, dispatches to private helper
//parameter is concatenation of individual T units 
//that could include wildcards depending on 'match' 
//traits routine
//
public:
vect_t * mightmatch(value_t tofind) {
    if (length() != traits_t::length(tofind)) return 0;
    return findpart(tofind, tofind);
}
//
//top level helper routine for search
//
private:
vect_type * findpart(value_type thisfind, value_type allfind) {
    vect_type * accum = new(vect_type);
    vect_type * newaccum = new(vect_type);
    for (std::size_t i = 0; i < c_.size(); ++i) { //for each top-level constituent T
        std::map<std::size_t, value_type> inserts;
        value_type confind = topmatcher(thisfind, i, inserts); 
        //get top level to match and fill map of inserts to match
        vect_type * maybes = couldmatch(i, confind, allfind); //top level match only
        maybes = addinsertmatches(maybes, inserts, i, allfind); //put insert matches into maybes
        if (i == 0) {
            newaccum->insert(newaccum->begin(), maybes->begin(), maybes->end());
        } else {
            for (std::size_t n = 0; n < accum->size(); ++n) {
                for (std::size_t m = 0; m < maybes->size(); ++m) {
                    newaccum->push_back(traits_type::join((*accum)[n], (*maybes)[m]));
                }
            }
        }
        delete maybes;
        swap(accum, newaccum);
        newaccum->clear();
    }
    delete newaccum;
    return accum;
}    
//
//top level constituent finding helper routine
//
value_type topmatcher(value_type thisfind, std::size_t index, std::map<std::size_t,
                      value_type>& inserts) {
    std::size_t offset = 0;
    for (std::size_t j = 0; j < index; ++j) {
        offset += length(j);
    }
    value_type confind = traits_type::createempty();
    std::size_t orig = 0;
    for (typename cqmap_type::const_iterator it = c_[index].second->begin(); 
      it != c_[index].second->end(); ++it) { //for each insert in that constituent
        std::size_t gap = it->first - orig;
        confind = traits_type::join(confind, traits_type::substr(thisfind, offset, gap));
        //add next bit to match string
        inserts[it->first] = traits_type::substr(thisfind, offset + gap,
                                it->second->length()); //get matching bit for insert
        offset += gap + it->second->length(); //advance index into match string past insert
        orig = it->first;
    }
    confind = traits_type::join( confind, traits_type::substr(thisfind, offset, 
              traits_type::length((*(c_[index].first))[0]) - orig)); //add final match part
    return confind;
}
//
//add insert matches to top level matches, helper routine
//
vect_type * addinsertmatches(vect_type * maybes, std::map<std::size_t, value_type>& inserts,
                                std::size_t index, value_type allfind) {
    for (typename std::map<std::size_t, value_type>::reverse_iterator mrit =
      inserts.rbegin(); mrit != inserts.rend(); ++mrit) { //combine with matches for inserts
        vect_type * maybeaccum = new(vect_type);
        vect_type * nextins = 
          (*(c_[index].second))[mrit->first]->findpart(mrit->second, allfind);
        for (std::size_t b = 0; b < maybes->size(); ++b) {
            for (std::size_t c = 0; c < nextins->size(); ++c) {
                maybeaccum->push_back(traits_type::insert((*maybes)[b], (*nextins)[c], mrit->first)); 
            }
        }
        delete nextins;
        swap(maybeaccum, maybes);
        delete maybeaccum;
    }
    return maybes;
}

All these gymnastics are fortunately concealed within the private implementation of the Catenator. The user has a container that looks and feels like a vector, but handles the combinations of sub-strings without complaining. Problem solved; everyone can go home.

<<  Page 2 of 3  >>


Sponsored Links



Google
  Web Artima.com   
Copyright © 1996-2014 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use - Advertise with Us