Weblogs Forum - Persistent Hash-Tables and Why Wikipedia Still Sucks at Computer Science

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Weblogs Forum
Persistent Hash-Tables and Why Wikipedia Still Sucks at Computer Science

40 replies on 3 pages. Most recent reply: Dec 17, 2005 3:41 PM by Gregg Wonderly

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 40 replies on 3 pages [ « | 1 2 3 | » ]

Gregg Wonderly

Posts: 317
Nickname: greggwon
Registered: Apr, 2003

Re: Persistent Hash-Tables and Why Wikipedia Still Sucks at Computer Scienc

Posted: Nov 27, 2005 5:30 PM

> > I'm with Michael on this one (and was about to reply
> the
> > same, until I saw it had already been said): All the
> > definitions I've seen (including doing a search now),
> > points out that the "hash" in "hash table"/"hash map"
> > comes from it being the name of an implementation
> method
> > or data structure, that works by applying a
> mathematical
> > function (a hash function) to the item being searched
> for,
> > to locate it.
>
> I can live with that. So a hash-map is a map ADT with O(1)
> complexity element insertion and element access by key on
> average, which by definition requires the usage of
> a non-trivial hash-function in its implementation. There
> is however nothing said about the underlying
> representation of the data type other than it is a table.

A hash table as several attributes. First, there is the hashing function. The complexity of that function controls the collision factor associated with the keys. A simple function will execute quickly, in general, but may arrive at a high collision factory. For example, one could use the first byte of the key as the hash function.

More complex hash functions can include red-black tree navigation, complex byte manipulations etc. The more typical hashing implementation has a bounded table that the hash function computes an address within. That table then contains buckets that the data is stored into. The buckets can be of arbitrary size. A bucket might be of size one, or it might be of infinite size.

The implementation of the bucket might include another table, a linked list, a tree structure etc. The algorithm involved in place the data into its associated bucket is called the collision resolution algorithm.

The classical, in memory implementation of a hashtable is the hash-chaining implementation. The hashing function varies based on details deemed important to the implementer. The collision resolution algorithm involves a linked list.

The proportion of the bucket size to the total size of the hashtable is called the loading factor. In a hash chaining implementation, the collision resolution effeciency depends on a small loading factor.

In other implementations where bucket access might use a sorted structure such as a red-black tree, the loading factor can be much high, perhaps an order of magnitude higher.

The loading factor value reflects some form of space efficiency of the overall structure. A hash-chaining implementation with a low loading factor may have a fairly large array of pointers to chains that would be expensive in terms of total memory use.

The Java HashTable implementation is a hash-chaining implementation. More recent versions creates a table of 11 pointers by default (it used to be 101), and adds hash entries into that table until the loading factor, defaulting to .75) reaches the preset (or application set) level. It will then resize the table, and rehash it automatically. Applications which reach the loading factor value repeatedly while inserting large numbers of elements will encounter rehashing delays, repeatedly. This is a compromise between having to known the size of the data vs having an effecient hash table.

In a red-black tree collision resolution implementation, a loading factor that is much larger, would allow a relatively small table and larger trees for buckets. But, there is still traversal time involved for the red-black tree.

Persistent hashtable implementations typically store the buckets on disk, and cache them in memory with write through caches. A chaining bucket implementation can write the chains to individual files, or can always append to the end, or track empty places and reuse them etc. More complex bucket implementations might have a higher cost cache loading and unloading cost because of the complexity of the structure. But, that is really an implementation detail.

In languages such as C, or others that utilize static structure sizes in software more than dynamic sized structures, one might just use a bucket number as an lseek() offset to get the data items. In the Java Hashtable() and Hashmap() implementations, the content is so arbitrary, that it would be difficult to optimize disk usage absolutely, without some fairly complex algorithms to handle nested hash tables filed with nested lists etc.

Some may be familar with one of nicest complex disk persistence implementations in Java. There used to be a company called Object Design that developed a product like that stated with PSE (Persistent Storage Engine), and ended with the highend ObjectStore database. More about this is visible on the web and at http://research.sun.com/forest/COM.Sun.Labs.Forest.PJava.PJW2.14_pdf.pdf.

In the end, hashtables are hashtables. They are an ADT has suggested, and can have a wide range of implementations. Persistence is an attribute and is closely associated with the implementation. It has to be because of the close association with the bucket implementation.

The fact that there are not too many hash storage implementations is probably largely a fact related to the availability of RDBMS implementations which already provide high quality storage of the typically stored data values which have largely been string oriented.

We all know, of course, that this impedence mismatch between the types of data stored in persistence implementations and those effeciently used by software has long provided a large market for RDBMS vendors through associated conversion lockin.

Marcin Kowalczyk

Posts: 40
Nickname: qrczak
Registered: Oct, 2004

Re: Persistent Hash-Tables and Why Wikipedia Still Sucks at Computer Scienc

Posted: Nov 28, 2005 7:37 AM

Huh? Hash tables have nothing to do with VLists. Hash tables are dictionaries, VLists are sequences. Hash tables are indeed not persistent, and I indeed don't know a persistent variant of hash tables. VLists have a different interface: they are indexed by numbers rather than by arbitrary keys, and support efficient extension by an element, not replacement of an element. Wikipedia is right here.

Marcin Kowalczyk

Posts: 40
Nickname: qrczak
Registered: Oct, 2004

Re: Persistent Hash-Tables and Why Wikipedia Still Sucks at Computer Scienc

Posted: Nov 28, 2005 7:41 AM

Hmm, actually VLists aren't even persistent; here Wikipedia is wrong. You can't extend the same VList twice by a different element and have both versions usable at the same time (except by copying, which is O(n) and thus can be supported by any data structure).

Christopher Diggins

Posts: 1215
Nickname: cdiggins
Registered: Feb, 2004

Re: Persistent Hash-Tables and Why Wikipedia Still Sucks at Computer Scienc

Posted: Nov 28, 2005 7:52 AM

> Huh? Hash tables have nothing to do with VLists.

A hash table can be implemented using a VList.

> Hash
> tables are dictionaries, VLists are sequences. Hash tables
> are indeed not persistent,

Why? What property of hash tables, makes them unable to be persisten? I suggest you first take a look at http://lampwww.epfl.ch/papers/techlists.pdf before answering.

> and I indeed don't know a
> persistent variant of hash tables.

If you don't know something exists, it doesn't prove it doesn't or can't exist.

> VLists have a different
> interface:

I don't believe I ever made any claims to the contrary.

Christopher Diggins

Posts: 1215
Nickname: cdiggins
Registered: Feb, 2004

Re: Persistent Hash-Tables and Why Wikipedia Still Sucks at Computer Scienc

Posted: Nov 28, 2005 8:11 AM

> Hmm, actually VLists aren't even persistent; here
> Wikipedia is wrong. You can't extend the same VList twice
> by a different element and have both versions usable at
> the same time (except by copying, which is O(n) and thus
> can be supported by any data structure).

As far as I can tell this is incorrect. Wikipedia does a good job of describing how to extend VLists by single elements:

"The trickier case, however, is adding a new item to the front of a list, call it A, which starts somewhere in the middle of the array-list data structure. This is the operation that allows VLists to be persistent. To accomplish this, we create a new array, and we link it to the array containing the first element of A. The new array must also store the offset of the first element of A in that array. Then, we can proceed to add any number of items we like to our new array, and any references into this new array will point to VLists which share a tail of values with the old array."

I fail to see any reason why multiple vlists can't share the same tail.

Marcin Kowalczyk

Posts: 40
Nickname: qrczak
Registered: Oct, 2004

Re: Persistent Hash-Tables and Why Wikipedia Still Sucks at Computer Scienc

Posted: Nov 28, 2005 8:56 AM

How would you implement a persistent hash table on top of a VList? It doesn't work when done as usually done on top of an array.

Let's suppose that the given VList underlying the representation of a hash table has 64 elements: the first block has 32 elements, the second has 16 etc. Some of them are filled with (key,value) pairs, some are free. We want to construct a hash table with one (key,value) pair added, such that the old hash table is still usable. Let's suppose that the new key hashes to the value 40 modulo 64, and the 40th slot is free. How to add it?

You can't overwrite the old slot in place, because it would change the old hash table.

Copying the part of the table from the beginning to the changed element (rounded to a block) costs O(n), the same as is archievable in turning any non-persistent data structure into a persistent one by copying it as a whole. Well, on average only half must be copied, but it's just a constant factor. Not impressive. VList doesn't provide any benefit here relative to plain arrays.

If it's not done this way, then how?

VLists provide a good cost only when adding elements at the front, and only when the same VList is not extended multiple times. Imagine that l is a VList, and you construct a VList with x added to the front of l (this overwrites the entry before the old beginning), and then a VList with y added to the front of l (this can't overwrite the same entry because it would change the previous VList, so a new block must be allocated). This doesn't have amortized O(1) cost of all insertions. The cost is amortized O(1) only when the given VList is extended at most once in its history. This is only partial persistency. And it applies only to adding at the front, which is not relevant for hash tables because their underlying arrays are updated in a random order.

Gregg Wonderly

Posts: 317
Nickname: greggwon
Registered: Apr, 2003

Re: Persistent Hash-Tables and Why Wikipedia Still Sucks at Computer Scienc

Posted: Nov 28, 2005 9:25 AM

The term persistent is used here repeatedly and immutable was used before as well. Are we talking about not moving the data that was referenced from the old version of the vlist, or are we talking about persisting data to storage where it can be loaded back on a subsequent invocation of the application?

I think you are talking about immutable references rather then persistent data.

In Java, it doesn't matter what you do to the data structure that provides access to the objects. The objects are always referencable. Are you more concerned about breaking access to objects already held by pointers in C, C++ or other address based reference languages? If so, then the hashtable implementation can solve that problem for you by not storing objects, but storing references to objects right?

Christopher Diggins

Posts: 1215
Nickname: cdiggins
Registered: Feb, 2004

Re: Persistent Hash-Tables and Why Wikipedia Still Sucks at Computer Scienc

Posted: Nov 28, 2005 9:40 AM

> The term persistent is used here repeatedly and immutable
> was used before as well Are we talking about not moving
> the data that was referenced from the old version of the
> vlist, or are we talking about persisting data to storage
> where it can be loaded back on a subsequent invocation of
> the application?
>
> I think you are talking about immutable references rather
> then persistent data.

We are using the term persistence as described at http://en.wikipedia.org/wiki/Persistent_data_structure

> In Java, it doesn't matter what you do to the data
> structure that provides access to the objects. The
> objects are always referencable. Are you more concerned
> about breaking access to objects already held by pointers
> in C, C++ or other address based reference languages?

At this point, we are discussing with the property of being able to efficiently sharing copies of a data structure, as is common in functional programming languages where all structures are immutable. In Java the data structures are all mutable.

Myself, I'm particularly interested in exactly the property of VLists which you describe for my own work.

> If
> so, then the hashtable implementation can solve that
> problem for you by not storing objects, but storing
> references to objects right?

Yes, that is the most widely advocated solution. In my work I don't like this solution because of the extra overhead of space and time. My data values are typically the same size as object references. There are other properties of the array doubling strategy used by most hash-tables which I don't like, particularly the horrible worst case performance and mediochre average case performance of insertion into a doubling array.

Christopher Diggins

Posts: 1215
Nickname: cdiggins
Registered: Feb, 2004

Re: Persistent Hash-Tables and Why Wikipedia Still Sucks at Computer Scienc

Posted: Nov 28, 2005 9:58 AM

> If it's not done this way, then how?

Check out section 3 of Hagwell's paper.

Max Lybbert

Posts: 314
Nickname: mlybbert
Registered: Apr, 2005

Re: Persistent Hash-Tables and Why Wikipedia Still Sucks at Computer Scienc

Posted: Nov 28, 2005 11:12 AM

OK, I'm curious what CDiggins has in mind when he says:

"A hash table can be implemented using a VList."

I know I must be overlooking something. My understanding of common hash [map | table] implementations is that they are either (1) an array of pointers, or (2) an array of buckets that are themselves arrays. The hash function only determines the offset of the underlying array. For instance:

#include <string>
#include <list>
#include <locale>

std::cout.imbue(std::locale("")); // needed for toupper()

int hash_fun(std::string input)
{ char hash_index = std::toupper(input[0], std::cout.getloc());
return static_cast<int>(hash_index) -= static_cast<int>('A');
}

std::list<std::string>* array_of_pointers[26];

std::string* array_of_buckets[26][50];

***

Please note that this example was not sent through a compiler.

Storing a value in either hash table involves using the hash function to figure out which slot to add the data to, and then appending it to whatever is there. The first method is similar to what Stroustrup uses in "The C++ Programming Language," the second is useful to make sure the whole table fits within a certain memory footprint.

Based on this, I don't see why using a VList would be an improvement over an array.

Perhaps we're talking about using a VList instead of a std::list (first example)? In that case the improvement would be for a hash function with lots of collissions (as the VList becomes more effecient as the VList gets larger).

What am I missing?

Max Lybbert

Posts: 314
Nickname: mlybbert
Registered: Apr, 2005

I think I figured out my answer

Posted: Nov 28, 2005 11:14 AM

It's that "persistancy" issue we keep talking about, isn't it(g)?

Kay Schluehr

Posts: 302
Nickname: schluehk
Registered: Jan, 2005

Re: Persistent Hash-Tables and Why Wikipedia Still Sucks at Computer Scienc

Posted: Nov 28, 2005 11:31 AM

> > Huh? Hash tables have nothing to do with VLists.
>
> A hash table can be implemented using a VList.

Christopher, do you know the figure Humpty Dumpty of Lewis Carrols "Through looking glass"? It should be of course clear that the Humpty Dumpties in this world do not only have problems with Wikipedia...

Kay

Gregg Wonderly

Posts: 317
Nickname: greggwon
Registered: Apr, 2003

Re: I think I figured out my answer

Posted: Nov 28, 2005 11:36 AM

> It's that "persistancy" issue we keep talking about, isn't
> it(g)?

A Vlist as a bucket implementation would not solve the problem. The initial "table" would have to be persistant too because the "pointers" to the buckets would need to be persistent as well.

But, I think a hash-chaining implementation can be persistent if you dup the table, and insert at the head of the list. Then, you get to reuse all existing data, and its only the table that mutates.

A Vlist is a form of such a structure. But, the table is overhead in the hash-chaining implementation that the Vlist might, not necessarily have.

Christopher Diggins

Posts: 1215
Nickname: cdiggins
Registered: Feb, 2004

Re: Persistent Hash-Tables and Why Wikipedia Still Sucks at Computer Scienc

Posted: Nov 28, 2005 11:54 AM

> > > Huh? Hash tables have nothing to do with VLists.
> >
> > A hash table can be implemented using a VList.
>
> Christopher, do you know the figure Humpty Dumpty of Lewis
> Carrols "Through looking glass"? It should be of course
> clear that the Humpty Dumpties in this world do not only
> have problems with Wikipedia...

And people wonder why I lose my temper sometimes.

Max Lybbert

Posts: 314
Nickname: mlybbert
Registered: Apr, 2005

Re: I think I figured out my answer

Posted: Nov 28, 2005 12:29 PM

/* I think a hash-chaining implementation can be persistent if you dup the table, and insert at the head of the list. Then, you get to reuse all existing data, and its only the table that mutates.

A Vlist is a form of such a structure. But, the table is overhead in the hash-chaining implementation that the Vlist might, not necessarily have.
*/

Originally, I had an image of Linus's git (http://www.kernel.org/pub/software/scm/git/docs/). That image was almost definitely wrong.

Thinking it over, I would imagine that the "pointers" part of the array of pointers would be replaced with indexes into a VList. The index would be incremented on each insert, so that each index would also identify when it was inserted.

However, that leaves me feeling that I'm missing something, because that could have been done with a std::vector. There's apparently something in the VList implementation that would make a persistant hash possible. I just can't figure out what.

Flat View: This topic has 40 replies on 3 pages [ « | 1 2 3 | » ]

Previous Topic

Next Topic