|
|
|
Sponsored Link •
|
|
Advertisement
|
std::vector<std::string> * concatenate(
const std::vector<std::string> & first,
const std::vector<std::string> & second)
{
result = new std::vector<std::string>;
for (std::size_t i = 0; i < first.size(); ++i) {
for (std::size_t j = 0; j < second.size(); ++j) {
//blows up when vector gets too large
result->push_back( first[i] + second[j] );
}
}
return result;
}
Our mistake is to confuse interface and implementation. We want a nice, tidy vector-like object and so we flatten
our list of strings into a nice, tidy vector. Unfortunately, this tramples over the inherent structure in the data.
Instead, we should work with the structure—we will deal with interfaces later [4]. Many complex computing
problems are reduced to simple computing problems by an appropriate choice of data structure—once the
data structure is fixed; the code flows effortlessly [5]. The list of sub-strings is generated by search through a
problem space—usually this will involve traversal of a tree-like structure. Let's take a look at a typical string
that is constructed from the sub-strings resulting from this type of traversal:
The data structure underneath the flat representation of the string is a directed acyclic graph, but it is not the usual type of tree. Note that it contains both 'horizontal arrows' representing appending (or prepending) of different lists of strings and 'vertical arrows' representing insertion. Also, there may be a number of insertions in a particular list of sub-strings, up to a maximum of one at each possible insertion point. Therefore, we should expect our algorithms to contain a combination of iteration as well as the recursion more normally associated with trees.
Information about where in the flat representation of the string each individual letter is located is not stored 'locally'. In other words, a particular list of sub-strings contained at some level in the hierarchy cannot know where in the ultimate string it will end up—it only knows where to put the sub-strings that it contains. The 'AD' segment knows that it contains another two-letter segment after the first letter, and so the letter 'D' will appear three slots after the letter 'A', but it does not know where in the ultimate string its complete four-letter section will be slotted. This is a necessary trade-off: it means that sub-trees are also complete, well-formed trees themselves and we can use the same techniques to manipulate parts of trees and entire trees. In other words, the 'EI' segment above is itself a five-letter Catenator consisting of a two-letter list with another three letters inserted at position one.
The resulting data structure is a sequence of lists of sub-strings, one after the other (represented as a vector of vectors of strings ordered by index), together with a map for each sequence showing where in the sub-string other trees are inserted (represented as a map of insertion indices to pointers to sub-trees):
typedef std::vector<value_type> vect_type; typedef std::map<std::size_t, class_type *> cqmap_type; typedef typename std::pair<vect_type *, cqmap_type *> cqpair_type; std::vector<cqpair_type> c_; //private member variableCreation of a Catenator now involves only filing the lists of strings in the appropriate place. The expense in time and space of iterating through the lists and combining them disappears [6]. Appending and prepending new lists is straightforward. Insertion of a new list of strings in an existing Catenator is slightly trickier—we must make sure that insertion at the end of a string is actually represented as appending to a string (and that insertion at the beginning is prepending) and that two insertions in the same place are combined into a single insertion. The insertion constructor takes two Catenators, one to be inserted inside the other, and the position of the insertion as arguments:
Catenator(const Catenator& cqfirst, const Catenator& cqsecond, std::size_t inpos) : s_(0), l_(0)
To construct the new Catenator, we first advance through each constituent Catenator within the first Catenator until we reach the one into which the insertion is to be made, copying the constituents as we go along:
//copy prior to insert, and find insert point
std::size_t i = 0;
while (inpos > cqfirst.length(i)) {
inpos -= cqfirst.length(i);
cqfirst.copypart(this, i);
}
Once this is completed, inpos will contain the position within the i'th
Catenator in which to make the insertion. This is straightforward if the insertion occurs at the beginning:
if (inpos == 0) { // insert before next item
for (std::size_t k = 0; k < cqsecond.c_.size(); ++k) {
cqsecond.copypart(this, k);
}
cqfirst.copypart(this, i);
}
Or at the end:
f (inpos == cqfirst.length(i)) { //insert after next item
cqfirst.copypart(this, i);
for (std::size_t k = 0; k < cqsecond.c_.size(); ++k) {
cqsecond.copypart(this, k);
}
}
But if the insertion occurs in the middle of this Catenator, it might occur in the middle of an insertion within that
Catenator. Also, we do not know how many characters are in the top level of this Catenator, and how many are
buried in insertions. Just because the insertion is at character number five, for example, it does not mean that the
required insertion index is five: there might be a three letter insertion at character one, so that an insertion at
character five is accomplished with an index of two. And to cap it all, because there is only one insertion allowed
at each index, we need to cater separately not only for insertions in the middle of other insertions, but also
insertions at the beginning or end of other insertions. In short, all the following are possible:
The solution is to step through each insertion until we reach the insertion index, and then take the appropriate action:
typename cqmap_type::iterator mpos = cqfirst.c_[i].second->begin();
while ((mpos != cqfirst.c_[i].second->end()) && (inpos >= mpos->first)) {
if (inpos > mpos->first + mpos->second->length()) { //after end of this insertion
inpos -= mpos->second->length();
mpos++;
} else if (inpos == mpos->first) { //at beginning of this insertion
Catenator * prepcq = new Catenator(&cqsecond, mpos->second);
cqfirst.copypartinsert(this, i, prepcq, mpos->first);
inpos = 0;
} else if (inpos == mpos->first + mpos->second->length()) { //at end of this insertion
Catenator * prepcq = new Catenator(mpos->second, &cqsecond);
cqfirst.copypartinsert(this, i, prepcq, mpos->first);
inpos = 0;
} else { //in middle of this insertion
Catenator * newcq = new Catenator(*(mpos->second), cqsecond, inpos - mpos->first);
cqfirst.copypartinsert(this, i, newcq, mpos->first);
inpos = 0;
}
}
if (inpos > 0) { //need brand new insertion
cqfirst.copypartinsert(this, i, cqsecond.clone(), inpos);
}
Finally, the constituent Catenators following the insertion are copied into the new Catenator:
//copy items after insertion
for ( ++i ; i < cqfirst.c_.size(); ++i) {
cqfirst.copypart(this, i);
}
Putting all these pieces together, we obtain the complete insertion constructor:
//constructor inserting cqsecond in cqfirst at position inpos
Catenator(const Catenator& cqfirst, const Catenator& cqsecond, std::size_t inpos) : s_(0), l_(0)
{
//copy prior to insert, and find insert point
std::size_t i = 0;
while (inpos > cqfirst.length(i)) {
inpos -= cqfirst.length(i);
cqfirst.copypart(this, i);
if (++i == cqfirst.c_.size())
throw CatenatorException("Out of range in catenator insert");
}
//perform insertion
if (inpos == 0) { // insert before next item
for (std::size_t k = 0; k < cqsecond.c_.size(); ++k) {
cqsecond.copypart(this, k);
}
cqfirst.copypart(this, i);
} else if (inpos == cqfirst.length(i)) { //insert after next item
cqfirst.copypart(this, i);
for (std::size_t k = 0; k < cqsecond.c_.size(); ++k) {
cqsecond.copypart(this, k);
}
} else { //insert in middle of next item
typename cqmap_type::iterator mpos = cqfirst.c_[i].second->begin();
while ((mpos != cqfirst.c_[i].second->end()) && (inpos >= mpos->first)) {
//not before next insertion
if (inpos > mpos->first + mpos->second->length()) {
//after end of this insertion
inpos -= mpos->second->length();
mpos++;
} else if (inpos == mpos->first) { //at beginning of this insertion
Catenator * prepcq = new Catenator(&cqsecond, mpos->second);
cqfirst.copypartinsert(this, i, prepcq, mpos->first);
inpos = 0;
} else if (inpos == mpos->first + mpos->second->length()) {
//at end of this insertion
Catenator * prepcq = new Catenator(mpos->second, &cqsecond);
cqfirst.copypartinsert(this, i, prepcq, mpos->first);
inpos = 0;
} else {
//in middle of this insertion
Catenator * newcq = new Catenator(*(mpos->second), cqsecond, inpos - mpos->first);
cqfirst.copypartinsert(this, i, newcq, mpos->first);
inpos = 0;
}
}
if (inpos > 0) { //need brand new insertion
cqfirst.copypartinsert(this, i, cqsecond.clone(), inpos);
}
}
//copy items after insertion
for ( ++i ; i < cqfirst.c_.size(); ++i) {
cqfirst.copypart(this, i);
}
}
The second half of the problem is to present an intuitive interface into this daunting data structure. Our intuition is
to treat the structure as a vector: a flat list of strings [7]. The interface to the Catenator connives in this pretence.
The indexing operator uses a combination of iteration, recursion and modular arithmetic to assemble the
indexed string on demand:
value_type operator[](std::size_t ind) const {
value_type t = traits_type::createempty();
for (std::size_t i = 0; i < c_.size(); ++i) {
std::size_t nextind = ind % c_[i].first->size();
value_type newt((*c_[i].first)[nextind]);
ind -= nextind;
ind /= c_[i].first->size();
if (c_[i].second) {
typename cqmap_type::const_reverse_iterator it(c_[i].second->rbegin());
while (!(it == (typename cqmap_type::const_reverse_iterator) c_[i].second->rend())) {
std::size_t mind = ind % it->second->size();
value_type mt = (*it->second)[mind];
ind -= mind;
ind /= it->second->size();
newt = traits_type::insert(newt, mt, it->first);
it++;
}
}
t = traits_type::join(t, newt);
}
if (ind > 0) throw CatenatorException("Catenator buffer overflow.");
return t;
}
Obviously, we incur the cost of concatenating sub-strings when we access an individual member of the
Catenator—but only for the particular one that we need. The cost of iterating through every member of a
Catenator using the array operator:
for (std::size_t j = 0; j < cq.size(); ++j) {
std::cout << cq[j];
}
Is the same as the cost of creating the entire flat vector—plus a small additional constant cost for each
instance [8]. However, it shouldn't crash your computer: the cost in space of storing the entire vector is still
averted.
Together with the array operator, associated methods such as 'size' and 'length' are provided. Mutable private member variables are used to speed up these member functions without jeopardizing their const-ness.
As the examples demonstrate, we usually create a Catenator because we are looking for something in it—not because we are equally interested in every single member of the impossibly long sequence of strings it contains. The most elegant use of a Catenator is to create the entire list, search for something in it and extract all the useful parts—all without ever incurring the cost of creating or even iterating through its member strings. For this purpose, a search routine is provided to find all strings that match according to the criteria set out by the traits class. The public routine merely despatches the request to a private helper class:
t_type * mightmatch(value_type tofind) {
if (length() != traits_type::length(tofind)) return 0;
return findpart(tofind, tofind);
}
The findpart routine divides and conquers. It loops through the individual constituent Catenators.
For each one, it assembles two sets of strings from the original search string: firstly, the part that corresponds to
the constituent Catenator itself, and, secondly, the parts that correspond to each insertion within that Catenator.
For instance, if the search string was "abcdef" and the constituent Catenator was two letters long with a four-letter
insertion at position one, the top level string to match would be "af" and the single insertion string to match would
be "bcde". The heavy lifting for this division is outsourced to the topmatcher routine:
vect_type * accum = new(vect_type);
vect_type * newaccum = new(vect_type);
for (std::size_t i = 0; i < c_.size(); ++i) { //for each top-level constituent T
std::map<std::size_t, value_type> inserts;
value_type confind = topmatcher(thisfind, i, inserts);
Then we actually need to get the list of matches for these candidates. Once again, we rely on the good offices of
two other routines, couldmatch and addinsertmatches, to parcel up the results into maybes:
vect_type * maybes = couldmatch(i, confind, allfind); //top level match only maybes = addinsertmatches(maybes, inserts, i, allfind); //put insert matches into maybesNow we just have to combine our results for this constituent Catenator with the previous constituents. This requires us to combine our list of matches so far with the current list of matches to create a new list which the next iteration of the loop will use as the list of matches so far. There are many ways to implement this using an extravagant amount of memory, cohorts of obscure pointers and pointer arithmetic or inefficient insertions at the beginning of sequences. The
findpart routine simply uses two accumulating vectors which it then
swaps as it goes along. The two vectors have to be allocated on the heap and explicitly deleted when no longer
needed. That portion looks like this (full code appears further down):
if (i == 0) {
newaccum->insert(newaccum->begin(), maybes->begin(), maybes->end());
} else {
for (std::size_t n = 0; n < accum->size(); ++n) {
for (std::size_t m = 0; m < maybes->size(); ++m) {
newaccum->push_back(traits_type::join((*accum)[n], (*maybes)[m]));
}
}
}
delete maybes;
swap(accum, newaccum);
newaccum->clear();
}
delete newaccum;
return accum;
And finally, here is the whole findpart routine:
//
//top level helper routine for search
//
private:
vect_type * findpart(value_type thisfind, value_type allfind) {
vect_type * accum = new(vect_type);
vect_type * newaccum = new(vect_type);
for (std::size_t i = 0; i < c_.size(); ++i) { //for each top-level constituent T
std::map<std::size_t, value_type> inserts;
value_type confind = topmatcher(thisfind, i, inserts);
//get top level to match and fill map of inserts to match
vect_type * maybes = couldmatch(i, confind, allfind); //top level match only
maybes = addinsertmatches(maybes, inserts, i, allfind); //put insert matches into maybes
if (i == 0) {
newaccum->insert(newaccum->begin(), maybes->begin(), maybes->end());
} else {
for (std::size_t n = 0; n < accum->size(); ++n) {
for (std::size_t m = 0; m < maybes->size(); ++m) {
newaccum->push_back(traits_type::join((*accum)[n], (*maybes)[m]));
}
}
}
delete maybes;
swap(accum, newaccum);
newaccum->clear();
}
delete newaccum;
return accum;
}
The topmatcher routine runs into the familiar problem of working out where an insertion point is
actually located within a particular Catenator. As usual, the solution is to step through the Catenator, keeping
track of the index and slicing up the search string as we go along: adding the part before the insert to the top-level
search string and the part within the insert to the map of insertion search strings. The different slices create a
string for the top-level match and a string for matching each insertion:
//
//top level constituent finding helper routine
//
value_type topmatcher(value_type thisfind, std::size_t index, std::map<std::size_t, value_type>& inserts) {
std::size_t offset = 0;
for (std::size_t j = 0; j < index; ++j) {
offset += length(j);
}
value_type confind = traits_type::createempty();
std::size_t orig = 0;
for (typename cqmap_type::const_iterator it = c_[index].second->begin();
it != c_[index].second->end(); ++it) {
//for each insert in that constituent
std::size_t gap = it->first - orig;
confind = traits_type::join(confind, traits_type::substr(thisfind, offset, gap));
//add next bit to match string
inserts[it->first] = traits_type::substr(thisfind, offset + gap,
it->second->length()); //get matching bit for insert
offset += gap + it->second->length(); //advance index into match string past insert
orig = it->first;
}
confind = traits_type::join(confind, traits_type::substr(thisfind, offset,
traits_type::length((*(c_[index].first))[0]) - orig)); //add final match part
return confind;
}
Finally, addinsertmatches takes the top-level search string and the list of inserts and finds all the
possible matches, which it then combines into a single list. It calls findpart recursively to discover
the list of matching strings for each insert. To avoid painful index juggling when combining the list of results with
the lists of matching insertions, it uses reverse_iterator rather than iterator,
adding each list of insertions backwards starting at the end. Once again, it uses heap allocation and
swap to handle accumulation of lists within a loop:
//
//add insert matches to top level matches, helper routine
//
vect_type * addinsertmatches(vect_type * maybes, std::map<std::size_t, value_type>& inserts,
std::size_t index, value_type allfind) {
for (typename std::map<std::size_t, value_type>::reverse_iterator mrit =
inserts.rbegin(); mrit != inserts.rend(); ++mrit) { //combine with matches for inserts
vect_type * maybeaccum = new(vect_type);
vect_type * nextins =
(*(c_[index].second))[mrit->first]->findpart(mrit->second, allfind);
for (std::size_t b = 0; b < maybes->size(); ++b) {
for (std::size_t c = 0; c < nextins->size(); ++c) {
maybeaccum->push_back(traits_type::insert((*maybes)[b], (*nextins)[c], mrit->first));
}
}
delete nextins;
swap(maybeaccum, maybes);
delete maybeaccum;
}
return maybes;
}
Finally, here is the complete picture:
//
//interface routine, dispatches to private helper
//parameter is concatenation of individual T units
//that could include wildcards depending on 'match'
//traits routine
//
public:
vect_t * mightmatch(value_t tofind) {
if (length() != traits_t::length(tofind)) return 0;
return findpart(tofind, tofind);
}
//
//top level helper routine for search
//
private:
vect_type * findpart(value_type thisfind, value_type allfind) {
vect_type * accum = new(vect_type);
vect_type * newaccum = new(vect_type);
for (std::size_t i = 0; i < c_.size(); ++i) { //for each top-level constituent T
std::map<std::size_t, value_type> inserts;
value_type confind = topmatcher(thisfind, i, inserts);
//get top level to match and fill map of inserts to match
vect_type * maybes = couldmatch(i, confind, allfind); //top level match only
maybes = addinsertmatches(maybes, inserts, i, allfind); //put insert matches into maybes
if (i == 0) {
newaccum->insert(newaccum->begin(), maybes->begin(), maybes->end());
} else {
for (std::size_t n = 0; n < accum->size(); ++n) {
for (std::size_t m = 0; m < maybes->size(); ++m) {
newaccum->push_back(traits_type::join((*accum)[n], (*maybes)[m]));
}
}
}
delete maybes;
swap(accum, newaccum);
newaccum->clear();
}
delete newaccum;
return accum;
}
//
//top level constituent finding helper routine
//
value_type topmatcher(value_type thisfind, std::size_t index, std::map<std::size_t,
value_type>& inserts) {
std::size_t offset = 0;
for (std::size_t j = 0; j < index; ++j) {
offset += length(j);
}
value_type confind = traits_type::createempty();
std::size_t orig = 0;
for (typename cqmap_type::const_iterator it = c_[index].second->begin();
it != c_[index].second->end(); ++it) { //for each insert in that constituent
std::size_t gap = it->first - orig;
confind = traits_type::join(confind, traits_type::substr(thisfind, offset, gap));
//add next bit to match string
inserts[it->first] = traits_type::substr(thisfind, offset + gap,
it->second->length()); //get matching bit for insert
offset += gap + it->second->length(); //advance index into match string past insert
orig = it->first;
}
confind = traits_type::join( confind, traits_type::substr(thisfind, offset,
traits_type::length((*(c_[index].first))[0]) - orig)); //add final match part
return confind;
}
//
//add insert matches to top level matches, helper routine
//
vect_type * addinsertmatches(vect_type * maybes, std::map<std::size_t, value_type>& inserts,
std::size_t index, value_type allfind) {
for (typename std::map<std::size_t, value_type>::reverse_iterator mrit =
inserts.rbegin(); mrit != inserts.rend(); ++mrit) { //combine with matches for inserts
vect_type * maybeaccum = new(vect_type);
vect_type * nextins =
(*(c_[index].second))[mrit->first]->findpart(mrit->second, allfind);
for (std::size_t b = 0; b < maybes->size(); ++b) {
for (std::size_t c = 0; c < nextins->size(); ++c) {
maybeaccum->push_back(traits_type::insert((*maybes)[b], (*nextins)[c], mrit->first));
}
}
delete nextins;
swap(maybeaccum, maybes);
delete maybeaccum;
}
return maybes;
}
All these gymnastics are fortunately concealed within the private implementation of the Catenator. The user has a container that looks and feels like a vector, but handles the combinations of sub-strings without complaining. Problem solved; everyone can go home.
|
Sponsored Links
|