Weblogs Forum - Persistent Hash-Tables and Why Wikipedia Still Sucks at Computer Science

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Weblogs Forum
Persistent Hash-Tables and Why Wikipedia Still Sucks at Computer Science

40 replies on 3 pages. Most recent reply: Dec 17, 2005 3:41 PM by Gregg Wonderly

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 40 replies on 3 pages [ « | 1 2 3 ]

Max Lybbert

Posts: 314
Nickname: mlybbert
Registered: Apr, 2005

oops

Posted: Nov 28, 2005 12:32 PM

I should have linked to http://www.kernel.org/pub/software/scm/git/docs/#Discussion

Christopher Diggins

Posts: 1215
Nickname: cdiggins
Registered: Feb, 2004

Re: I think I figured out my answer

Posted: Nov 28, 2005 5:00 PM

> /* I think a hash-chaining implementation can be
> persistent if you dup the table, and insert at the head of
> the list. Then, you get to reuse all existing data, and
> its only the table that mutates.
>
> A Vlist is a form of such a structure. But, the table is
> overhead in the hash-chaining implementation that the
> Vlist might, not necessarily have.
> */
>
> Originally, I had an image of Linus's git
> (http://www.kernel.org/pub/software/scm/git/docs/). That
> image was almost definitely wrong.
>
> Thinking it over, I would imagine that the "pointers" part
> of the array of pointers would be replaced with indexes
> into a VList. The index would be incremented on each
> insert, so that each index would also identify when it was
> inserted.
>
> However, that leaves me feeling that I'm missing
> something, because that could have been done with a
> std::vector. There's apparently something in the VList
> implementation that would make a persistant hash possible.
> I just can't figure out what.

From page 10 of Bagwell's paper at http://lampwww.epfl.ch/papers/techlists.pdf is the description of an algorithm to use a VList to create a persistent hash list:

"The problem forming associations is endemic to all programming languages. A modified VList structure can provide the basis for a functional hash list with insert
and lookup times that approach those of a standard chained hash table with hash table doubling. In Fig 6 the basic form of the VList has been modified so that each block contains two portions, the data and a hash table, to form a hash-list. Thus each time the hash-list grows both the data area and the hash-table grow by the same factor. Each data element is a pair including a list pointer to the entry,
normally a list containing the associated items, and a hash-table chain link pointer offset. The hash-table portion is a set of offsets from the block base to the first in
a chain of associated items. The link is hidden, a hash-list behaves as a standard homogeneous list of lists allowing all the standard list operations cons, cdr, etc.
Assuming a suitable hash function, each time a new element is consed to the hash list the hash value for the key, normally the first in the association pair, is
computed. The offset value located at the corresponding hash table entry is stored in the new hash list entry and this entry offset stored in the hash table entry thus
Fast Functional Lists, Hash-Lists, Deques and Variable Length Arrays creating a standard chained hash table for all the entries in this memory block.

The search for an association is a little more complicated but can be accomplished in constant time, except in the case of degenerate list formation. Given a hash list,
which may be an arbitrary tail of a hash list, then an association is found by first hashing the key. This hash value is used to index the blocks hash table and the
chain followed until a match is found, using a provided comparison function. If this lies in the tail specified, that is the pointer is less than or equal to the list pointer, then return the entry tail list. If no match is found or the hash table entry was zero, then follow the block previous pointer and repeat for the next block in the
list. However, the key hash value does not need to be recomputed. If the end of the list is reached then return the empty list.

As with the VList the probability of finding the association in the first block is higher than in the next block and so on. Hence if the growth is geometric then
lookup will be, on average, a constant times the basic hash-table lookup time. This in turn will be governed by the hash table size, number of data entries and lookup time for chained hash tables, a well document relationship. Knuth [1998] and Sedgewick [1998]

Notice the structure is fully functional and persistent though there is a time penalty for degenerate hash lists, e.g. when every tail is extended by one. Also
that the performance can be optimized further by allocating the majority of the block to the hash table and reducing it as the actual entries grow. Duplicate entries are found by taking the tail of the returned list and repeating the search on the tail, the order is maintained in the chaining sequence. The structure can be considered garbage collection friendly. All the chain links are offsets contained within a block and can be safely ignored by the garbage collector.

Naturally, the VList concept can be used to create non-functional hash tables too. It is common practice to speed up hash table implementations by periodically
doubling the hash table size as the load factor increases. Since each entry must be re-hashed and copied to the new table there is a high price to pay. The alternative is
to avoid this by growing the hash table as described above. Shrinking also becomes more efficient. When the total load factor is such that the largest hash table block
is no longer required table entries can be copied to the lower blocks. No re-hashing is required and the operation can be integrated with deletes to become incremental."

Max Lybbert

Posts: 314
Nickname: mlybbert
Registered: Apr, 2005

Re: I think I figured out my answer

Posted: Nov 28, 2005 5:39 PM

Thank you. I had started reading the paper, but dropped off well before page 10 because the information appeared Lisp-centric much earlier on (or, rather, Visp-centric).

Simon Francis

Posts: 2
Nickname: steves
Registered: Dec, 2005

Re: Persistent Hash-Tables and Why Wikipedia Still Sucks at Computer Scienc

Posted: Dec 1, 2005 10:38 AM

A hash table is a data structure. A dictionary is an abstract data type that can be implemented by a hash table, VList, binary search tree, or many other data types.

An abstract data type (like dictionary) specifies an interface and possibly computational compelxity (although not nessiarily).

A data structure (like VList or hash table) is a concrete method of storing data. A data structure can implement an abstract data type. Because this is a concrete method of orginizing data, all computational complexity measurements can be stated.

Simon Francis

Posts: 2
Nickname: steves
Registered: Dec, 2005

Re: Persistent Hash-Tables and Why Wikipedia Still Sucks at Computer Scienc

Posted: Dec 1, 2005 10:43 AM

Sorry that is incorrect, an abstract data type does not make any computational complexity guarentees.

Vincent O'Sullivan

Posts: 724
Nickname: vincent
Registered: Nov, 2002

Wikipedia Survives Research Test

Posted: Dec 15, 2005 7:18 AM

> Wikipedia and I have a hate-hate relationship. What drives
> me nuts about it, is the sheer idea that an anarchic
> shoutocracy (whoever shouts the loudest is right) can be
> trusted to provide the reader with anything resembling
> fact.

This article - from the BBC - may be of some interest:
http://news.bbc.co.uk/1/hi/technology/4530930.stm

In order to test its reliability, Nature conducted a peer review of scientific entries on Wikipedia and the well-established Encyclopedia Britannica.

Kannan Goundan

Posts: 18
Nickname: cakoose
Registered: Nov, 2005

Re: I think I figured out my answer

Posted: Dec 16, 2005 12:57 AM

I tried reading that section of the paper but couldn't follow what the guy was saying. The description of the memory structures and the insertion operation in particular made no sense to me. Could someone who understood it please explain?

Also, the characterization of performance is worrying: "Hence if the growth is geometric, then lookup will be, on average, a constant times the basic hash-table lookup time. [...] ...though there is a penalty for degenerate hash lists, e.g. when every tail is extended by one." With this restriction, it seems like one could achieve the same performance simply by chaining together regular old hashtables. No VList necessary. If you happen to only take "snapshots" of it at exponentially growing intervals, you'll get constant-time lookup. I'm missing something, aren't I?

Gregg Wonderly

Posts: 317
Nickname: greggwon
Registered: Apr, 2003

Re: I think I figured out my answer

Posted: Dec 16, 2005 7:24 AM

> Also, the characterization of performance is worrying:
> "Hence if the growth is geometric, then lookup will be, on
> average, a constant times the basic hash-table lookup
> time. [...]

The constant is the number of linked tables in the vlist. Which will be minimized by reasonable key distribution. Your concern is valid though. In computer science, we can never assume that keys are "reasonable distributed" for the general case. So, you'll have to know something about the key set to make vlist performance interesting. Also, remember that the main focus here is persistence. That is the creation of a data structure which any thread instance can alter, independently of any other thread and not have to worry about changes in other threads. The implementation of vlists also is thread safe because you always create a new head, never moving the tail, only point to the tail of the list that is after your point of mutation.

> though there is a penalty for degenerate
> hash lists, e.g. when every tail is extended by one."
> With this restriction, it seems like one could achieve
> e the same performance simply by chaining together regular
> old hashtables. No VList necessary.

A vlist is an implementation of chained together tables. There are implementation details that cause that table to automatically split into multiple tables (as a btree might) as it grows. This allows tails to be reasonably large, on the average about half the total data size.

Again, persistence is the issue, so there must be immutability of all parts of the structure as part of the total picture of implementation.

> If you happen to
> only take "snapshots" of it at exponentially growing
> intervals, you'll get constant-time lookup. I'm missing
> something, aren't I?

I think you are not putting persistence into the picture. That is a part of the puzzle that limits the choices of structure of the data.

Kannan Goundan

Posts: 18
Nickname: cakoose
Registered: Nov, 2005

Re: I think I figured out my answer

Posted: Dec 16, 2005 5:15 PM

> > Also, the characterization of performance is worrying:
> > "Hence if the growth is geometric, then lookup will be,
> > on average, a constant times the basic hash-table lookup
> > time. [...]

> The constant is the number of linked tables in the vlist.
> Which will be minimized by reasonable key distribution.
> Your concern is valid though. In computer science, we
> can never assume that keys are "reasonable distributed"
> for the general case.

Actually, I wasn't worried about key distribution as that is a problem for the standard collision chaining implementation as well. The worrying part was the "if the growth is geometric." Does that mean you have to add a geometrically increasing number of items to it as you hand out "persistent references" to it? If so, I don't think this can be fairly considered a persistent data structure.

> Also, remember that the main focus here is persistence.
> That is the creation of a data structure which any
> thread instance can alter, independently of any other
> thread and not have to worry about changes in other
> threads. The implementation of vlists also is thread safe
> because you always create a new head, never moving the
> tail, only point to the tail of the list that is after
> your point of mutation.

Could you explain exactly how the algorithm works? I just flat-out couldn't understand the paper's description. When you say "create a new head", do you mean allocate a new one of those [next ptr|data section|hashtable section] objects? If so, then how is there any advantage over chaining multiple standard hashtables?

> > "...though there is a penalty for degenerate
> > hash lists, e.g. when every tail is extended by one."
> > With this restriction, it seems like one could achieve
> > the same performance simply by chaining together
> > regular old hashtables. No VList necessary.

> A vlist is an implementation of chained together tables.
> There are implementation details that cause that table to
> automatically split into multiple tables (as a btree
> might) as it grows. This allows tails to be reasonably
> large, on the average about half the total data size.

So I understand how a vlist preserves O(1) indexing, but I don't see how to extend that to hashtables (without, of course, imposing nasty requirements on how many elements the user of a hashtable must insert at a time).

A precise algorithm of how the insert operation works would really help.

Cleo Saulnier

Posts: 77
Nickname: vorlath
Registered: Dec, 2005

Re: Persistent Hash-Tables and Why Wikipedia Still Sucks at Computer Science

Posted: Dec 17, 2005 8:02 AM

I'm not sure what all this is about, and I realise this is besides the point, but I've never had the need for a hash table. I've always used maps or SkipLists (linked list with O(logN) time for insertion, deletion and search). If the list is small, then I don't really care if it's O(1) or O(N^N) for that matter. There aren't enough items to make a difference. And with large amounts of data, the simplicity of a SkipList and the worst case O(logN) is much more attractive.

But what's the deal with this persistance? I don't even get how that could even be a problem for ANY data structure.

I do agree that Wikipedia is wrong a lot of times, or worse just slightly off which is where the real damage is caused. This can't be too surprising though. What field is more opiniated than computer science other than quantum theorists and astrophysicists anyhow? Sure... little invisible neutrinos that don't interact with anything? I'll believe that when I see it.

Gregg Wonderly

Posts: 317
Nickname: greggwon
Registered: Apr, 2003

Re: Persistent Hash-Tables and Why Wikipedia Still Sucks at Computer Scienc

Posted: Dec 17, 2005 3:41 PM

> I'm not sure what all this is about, and I realise this is
> besides the point, but I've never had the need for a hash
> table. I've always used maps or SkipLists (linked list
> with O(logN) time for insertion, deletion and search). If
> the list is small, then I don't really care if it's O(1)
> or O(N^N) for that matter. There aren't enough items to
> make a difference. And with large amounts of data, the
> simplicity of a SkipList and the worst case O(logN) is
> much more attractive.

Hashing has been around for ages. It was, and still is a very important type of "Map" implementation. There are of course times when it's not the right "Map" implementation for a particular problem.

> But what's the deal with this persistance? I don't even
> get how that could even be a problem for ANY data
> structure.

You can read the original note to get the whole perspective. But the main point is that in functional programming languages, no expression in these languages provides any side effects directly visible to other threads. Only through explicit sharing of references can two threads learn what the other has. And, at that point, those copied references are now mutually immutable and only through some future sharing of references can the threads again know what has changed. This, is a very particular aspect of functional programming that is foreign to most developers who haven't used functional languages.

Flat View: This topic has 40 replies on 3 pages [ « | 1 2 3 ]

Previous Topic

Next Topic