The Artima Developer Community
Sponsored Link

Turn, and Face the Strange
Hungry Disks: Storage but no files!
by Calum Shaw-Mackay
September 20, 2004
Summary
Anyone who has read David Gelernter’s The Second Coming: A Manifesto, will have heard of the concept of filesystems that grab the data they want and categorise files based on their content. But how near is the reality of Content Based Data Stores (CBDS)

Advertisement

Notice I said Content Based Data Stores, not Content Based File Systems, which is perhaps a more familiar term but is actually incorrect. You see, in the Manifesto, Gelernter outlines the problems associated with the idea of files and directories, as the current realisation of data storage. Firstly the correlation between disk files and paper files, disk directories and filing cabinets, is flawed because a computer may take action rather than just being some form of passive entity to these collections of bytes. You may say well programs load data and take action, possibly storing the results out differently. But what if the “file system” itself, could take action, as the bytes were being stored.

If it weren’t for the twin concepts of separation and collision, the filing cabinet system of storing files may never have occurred, or at least in the way that it has done. Filing cabinets/Folders and individual files are good for us techies, the programmers amongst you especially. I have two projects ProjectA and ProjectB and I need to enforce separation so I put them in two different folders. I have a file called ‘automate.c’ for both a COM component and an attached system device. So I’ll have “com/automate.c” and “dev/automate.c”. Thus separating the files, removing collision, but also having the side effect of rudimentary namespaces. What’s wrong with this, I may hear you cry.

Imagine now, you have a 55yr old man on his PC, listening to an MP3 Rachmaninov. Does he care if it is an MP3? Does he care that the file is called PnoConcertoNo3.mp3? No. What he cares about is that it is music he likes to listen to.

Now lets take another scenario. Imagine a marketing person, they can talk the talk, they’re computer literate. They download a brochure file for a competitors product. What do they call it? Where do they put it? There is actually no simple answer. I’ll explain why in a moment.

The reason why the cabinet/file concept fails us today, is that in programming there is no room for ambiguity, no multiple meanings, therefore the programmers concept of separation to remove collisions is a good one. I know that dev/automate.c’s sole purpose is the automation of the device.

Now let’s go back to the brochure. ‘What do I call it?’ People get hung up on names, ‘brochure.pdf’, ‘someNewFangledGadgetV1.html’, whatever, but it really shouldn’t matter. What matters is this.

So the file could go in ‘Competitors’ , ‘Products’ , FooBar’s Gadgets’, ‘ourOldFangledWidget’

But we currently have to choose one. I know we could use symlinks, but that really isn’t the point.

Now lets go back to the man. He likes his music, and he likes Piano concertos, he likes his Classical music, and he likes Rachmaninov. Now the only thing that combines all these things is ‘Music’, so if we have a folder called ‘Music’, he could just put a file in there called ‘RachmaninovPnoConcertoNo3.mp3’ and he’s on his way. Or is he?

You see a CBDS really comes to the fore when you’re searching for something. My Program Files folder contains over 32000 files and 2481 directories, and believe me, that isn't very much (my basic 'Eclipse' install has 2831 files and 539 directories). Is this really a good thing? Most of the time I don’t even care what these files are called. Another problem with the filing metaphor, is it’s inherent tree-structure, going from the general to the specific from d:\development to d:\development\java\j2sdk1.4.2_05\jre\bin\java.exe – you see effectively the java.exe belongs to bin (and no other) which in turn belongs to jre (and no other), etc, etc.

What really needs to happen for a CBDS to be realistic is to change the concept to a graph rather than a tree. This would allow us to at least realise some of the things in Gelernter’s paper – A file can have 1 name, many names, or no name – and files can share a name - and it can exist in one directory, many directories or no directory

Instead of storing the data to a directory, the file is just stored to an area. All folders are replaced by categories, and of course these will have sub categrories, but none of these are at all mutually exclusive. So the ‘brochure.html’ could be stored simultaneously in ‘Competitor’s’ , ‘Products’ , FooBar’s Gadgets’, ‘ourOldFangledWidget’, because the filesystem realises that these are all relevant. If the user changes the file, such that it is clear that the product is actually not like ‘ourOldFangledWidget’ then the file would be removed automatically. Our brains work similar to this notion, ‘Remember things based on what happened or what we did, not some kind of symbolic name we give to this event’ In other words, a CBDS will check your data as it is stored (take action) and store it according to the metadata and the content. The user will not have to figure out how to store it, where to store it, maybe not even give it a name. This is basically forgoing the concept of having indexes for files and folders, but using the indexes as the way of storing and retrieving data.

Now to the Hungry Disks. When a ‘term’ (some notion for a folder in a CBDS) is created, either by the user, or by the computer, any file stored to the ‘unspecified’ area, is checked by a number of these high-level terms (Music, Documents, Programs), each one of these terms will check to see if the file is relevant. For instance Cubase should exist in both Music and Programs (or sub-terms thereof, such as Programs|Audio) , moreover, for each term a set of related and sub-terms can be triggered, to see if they also find the data relevant to their topic.

So taking the MP3 concept, the file representing data called ‘Rachmaninov Piano Concerto No.3’ is put into the unspecified area. Instantly, the ‘Music’ term recognises it through the files ID3 tags, loading the file location into the Music Index, it has a number of sub-terms, MP3, WAV, WMA, Jazz, Rock, Classical. The terms that find the file relevant are MP3, and Classical. Classical has two sub-terms ‘PerformedBy’ and ‘Composer’. Composer fires the ‘Rachmaninov’ sub-term, Rachmaninov fires the ‘19th Century’ and ‘20th Century’ in the ‘History’ high-level term, etc, etc.

A quite contrived example, but the point is that this music file, can be found in Music ,MP3, Classical, Composers, Rachmaninov or even History/19th Century and History/20th Century. But the user didn’t have to do anything.

This is an extremely powerful concept, and one that takes a fair bit of ‘getting your head around’. But let’s take another business example, you mail a supplier, ‘Jones Bearings’ to say the shipment of Ball Bearings you requested, dated the 16th August, has not arrived.

You can guess where I’m going with this. Instead of storing this in a pointless ‘Sent Items’ File (after all you already knew they were sent) an email program with a CBDS, would automatically store the reference in the ‘Supplier correspondence’, in ‘Jones Bearings’, in ‘16th August’, ‘Week Ending 22nd August’, in ‘Late deliveries’ etc,etc.

However, you can also jump from one to related term to another, so in the ‘16th August’ I may have a reference to a file that is related to ‘Smith Washers’ and I can jump to that.

The point is to a CBDS it’s all a reference to some set of bytes, and those bytes could be proper files, like a Word Document, it could be an email, an anniversary reminder, it doesn’t matter – it removes the need for storing things in files and folders, in particular places, with certain names or suffixes and gets to the heart of the matter; I have some data and I either want to read it, write it, delete it, create something new, or find it.

It remains to be seen whether CBDS (or something similar to it) will actually arrive.

Talk Back!

Have an opinion? Readers have already posted 8 comments about this weblog entry. Why not add yours?

RSS Feed

If you'd like to be notified whenever Calum Shaw-Mackay adds a new entry to his weblog, subscribe to his RSS feed.

About the Blogger

Calum Shaw-Mackay is an architect on Java and Jini systems, working in the UK. His interests lie in distributed computing, adaptability, and abstraction. Calum has been using Jini for longer than he would care to mention. His main area for taking the blame (some people would call it 'expertise') is systems integration and distributed frameworks, and is an advocate of using Jini's unique strengths to build adaptable enterprise systems. His opinions are his own. He's tried to get other people to take his opinions off him, but they just won't.

This weblog entry is Copyright © 2004 Calum Shaw-Mackay. All rights reserved.

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use