
Re: I think I figured out my answer

Posted: Nov 28, 2005 5:00 PM


> /* I think a hashchaining implementation can be > persistent if you dup the table, and insert at the head of > the list. Then, you get to reuse all existing data, and > its only the table that mutates. > > A Vlist is a form of such a structure. But, the table is > overhead in the hashchaining implementation that the > Vlist might, not necessarily have. > */ > > Originally, I had an image of Linus's git > (http://www.kernel.org/pub/software/scm/git/docs/). That > image was almost definitely wrong. > > Thinking it over, I would imagine that the "pointers" part > of the array of pointers would be replaced with indexes > into a VList. The index would be incremented on each > insert, so that each index would also identify when it was > inserted. > > However, that leaves me feeling that I'm missing > something, because that could have been done with a > std::vector. There's apparently something in the VList > implementation that would make a persistant hash possible. > I just can't figure out what.
From page 10 of Bagwell's paper at http://lampwww.epfl.ch/papers/techlists.pdf is the description of an algorithm to use a VList to create a persistent hash list:
"The problem forming associations is endemic to all programming languages. A modified VList structure can provide the basis for a functional hash list with insert and lookup times that approach those of a standard chained hash table with hash table doubling. In Fig 6 the basic form of the VList has been modified so that each block contains two portions, the data and a hash table, to form a hashlist. Thus each time the hashlist grows both the data area and the hashtable grow by the same factor. Each data element is a pair including a list pointer to the entry, normally a list containing the associated items, and a hashtable chain link pointer offset. The hashtable portion is a set of offsets from the block base to the first in a chain of associated items. The link is hidden, a hashlist behaves as a standard homogeneous list of lists allowing all the standard list operations cons, cdr, etc. Assuming a suitable hash function, each time a new element is consed to the hash list the hash value for the key, normally the first in the association pair, is computed. The offset value located at the corresponding hash table entry is stored in the new hash list entry and this entry offset stored in the hash table entry thus Fast Functional Lists, HashLists, Deques and Variable Length Arrays creating a standard chained hash table for all the entries in this memory block.
The search for an association is a little more complicated but can be accomplished in constant time, except in the case of degenerate list formation. Given a hash list, which may be an arbitrary tail of a hash list, then an association is found by first hashing the key. This hash value is used to index the blocks hash table and the chain followed until a match is found, using a provided comparison function. If this lies in the tail specified, that is the pointer is less than or equal to the list pointer, then return the entry tail list. If no match is found or the hash table entry was zero, then follow the block previous pointer and repeat for the next block in the list. However, the key hash value does not need to be recomputed. If the end of the list is reached then return the empty list.
As with the VList the probability of finding the association in the first block is higher than in the next block and so on. Hence if the growth is geometric then lookup will be, on average, a constant times the basic hashtable lookup time. This in turn will be governed by the hash table size, number of data entries and lookup time for chained hash tables, a well document relationship. Knuth [1998] and Sedgewick [1998]
Notice the structure is fully functional and persistent though there is a time penalty for degenerate hash lists, e.g. when every tail is extended by one. Also that the performance can be optimized further by allocating the majority of the block to the hash table and reducing it as the actual entries grow. Duplicate entries are found by taking the tail of the returned list and repeating the search on the tail, the order is maintained in the chaining sequence. The structure can be considered garbage collection friendly. All the chain links are offsets contained within a block and can be safely ignored by the garbage collector.
Naturally, the VList concept can be used to create nonfunctional hash tables too. It is common practice to speed up hash table implementations by periodically doubling the hash table size as the load factor increases. Since each entry must be rehashed and copied to the new table there is a high price to pay. The alternative is to avoid this by growing the hash table as described above. Shrinking also becomes more efficient. When the total load factor is such that the largest hash table block is no longer required table entries can be copied to the lower blocks. No rehashing is required and the operation can be integrated with deletes to become incremental."

