Turn, and Face the Strange
Hungry Disks: Storage but no files!
by Calum Shaw-Mackay
September 20, 2004

Summary
Anyone who has read David Gelernters The Second Coming: A Manifesto, will have heard of the concept of filesystems that grab the data they want and categorise files based on their content. But how near is the reality of Content Based Data Stores (CBDS)

Notice I said Content Based Data Stores, not Content Based File Systems, which is perhaps a more familiar term but is actually incorrect. You see, in the Manifesto, Gelernter outlines the problems associated with the idea of files and directories, as the current realisation of data storage. Firstly the correlation between disk files and paper files, disk directories and filing cabinets, is flawed because a computer may take action rather than just being some form of passive entity to these collections of bytes. You may say well programs load data and take action, possibly storing the results out differently. But what if the file system itself, could take action, as the bytes were being stored.

If it werent for the twin concepts of separation and collision, the filing cabinet system of storing files may never have occurred, or at least in the way that it has done. Filing cabinets/Folders and individual files are good for us techies, the programmers amongst you especially. I have two projects ProjectA and ProjectB and I need to enforce separation so I put them in two different folders. I have a file called automate.c for both a COM component and an attached system device. So Ill have com/automate.c and dev/automate.c. Thus separating the files, removing collision, but also having the side effect of rudimentary namespaces. Whats wrong with this, I may hear you cry.

Imagine now, you have a 55yr old man on his PC, listening to an MP3 Rachmaninov. Does he care if it is an MP3? Does he care that the file is called PnoConcertoNo3.mp3? No. What he cares about is that it is music he likes to listen to.

Now lets take another scenario. Imagine a marketing person, they can talk the talk, theyre computer literate. They download a brochure file for a competitors product. What do they call it? Where do they put it? There is actually no simple answer. Ill explain why in a moment.

The reason why the cabinet/file concept fails us today, is that in programming there is no room for ambiguity, no multiple meanings, therefore the programmers concept of separation to remove collisions is a good one. I know that dev/automate.cs sole purpose is the automation of the device.

Now lets go back to the brochure. What do I call it? People get hung up on names, brochure.pdf, someNewFangledGadgetV1.html, whatever, but it really shouldnt matter. What matters is this.

Its a competitors product
The product is called someNewFangledGadget Version 1
The competitor is called FooBars Gadgets
It does a similar job to ourOldFangledWidget

So the file could go in Competitors , Products , FooBars Gadgets, ourOldFangledWidget

But we currently have to choose one. I know we could use symlinks, but that really isnt the point.

Now lets go back to the man. He likes his music, and he likes Piano concertos, he likes his Classical music, and he likes Rachmaninov. Now the only thing that combines all these things is Music, so if we have a folder called Music, he could just put a file in there called RachmaninovPnoConcertoNo3.mp3 and hes on his way. Or is he?

You see a CBDS really comes to the fore when youre searching for something. My Program Files folder contains over 32000 files and 2481 directories, and believe me, that isn't very much (my basic 'Eclipse' install has 2831 files and 539 directories). Is this really a good thing? Most of the time I dont even care what these files are called. Another problem with the filing metaphor, is its inherent tree-structure, going from the general to the specific from d:\development to d:\development\java\j2sdk1.4.2_05\jre\bin\java.exe you see effectively the java.exe belongs to bin (and no other) which in turn belongs to jre (and no other), etc, etc.

What really needs to happen for a CBDS to be realistic is to change the concept to a graph rather than a tree. This would allow us to at least realise some of the things in Gelernters paper A file can have 1 name, many names, or no name and files can share a name - and it can exist in one directory, many directories or no directory

Instead of storing the data to a directory, the file is just stored to an area. All folders are replaced by categories, and of course these will have sub categrories, but none of these are at all mutually exclusive. So the brochure.html could be stored simultaneously in Competitors , Products , FooBars Gadgets, ourOldFangledWidget, because the filesystem realises that these are all relevant. If the user changes the file, such that it is clear that the product is actually not like ourOldFangledWidget then the file would be removed automatically. Our brains work similar to this notion, Remember things based on what happened or what we did, not some kind of symbolic name we give to this event In other words, a CBDS will check your data as it is stored (take action) and store it according to the metadata and the content. The user will not have to figure out how to store it, where to store it, maybe not even give it a name. This is basically forgoing the concept of having indexes for files and folders, but using the indexes as the way of storing and retrieving data.

Now to the Hungry Disks. When a term (some notion for a folder in a CBDS) is created, either by the user, or by the computer, any file stored to the unspecified area, is checked by a number of these high-level terms (Music, Documents, Programs), each one of these terms will check to see if the file is relevant. For instance Cubase should exist in both Music and Programs (or sub-terms thereof, such as Programs|Audio) , moreover, for each term a set of related and sub-terms can be triggered, to see if they also find the data relevant to their topic.

So taking the MP3 concept, the file representing data called Rachmaninov Piano Concerto No.3 is put into the unspecified area. Instantly, the Music term recognises it through the files ID3 tags, loading the file location into the Music Index, it has a number of sub-terms, MP3, WAV, WMA, Jazz, Rock, Classical. The terms that find the file relevant are MP3, and Classical. Classical has two sub-terms PerformedBy and Composer. Composer fires the Rachmaninov sub-term, Rachmaninov fires the 19th Century and 20th Century in the History high-level term, etc, etc.

A quite contrived example, but the point is that this music file, can be found in Music ,MP3, Classical, Composers, Rachmaninov or even History/19th Century and History/20th Century. But the user didnt have to do anything.

This is an extremely powerful concept, and one that takes a fair bit of getting your head around. But lets take another business example, you mail a supplier, Jones Bearings to say the shipment of Ball Bearings you requested, dated the 16th August, has not arrived.

You can guess where Im going with this. Instead of storing this in a pointless Sent Items File (after all you already knew they were sent) an email program with a CBDS, would automatically store the reference in the Supplier correspondence, in Jones Bearings, in 16th August, Week Ending 22nd August, in Late deliveries etc,etc.

However, you can also jump from one to related term to another, so in the 16th August I may have a reference to a file that is related to Smith Washers and I can jump to that.

The point is to a CBDS its all a reference to some set of bytes, and those bytes could be proper files, like a Word Document, it could be an email, an anniversary reminder, it doesnt matter it removes the need for storing things in files and folders, in particular places, with certain names or suffixes and gets to the heart of the matter; I have some data and I either want to read it, write it, delete it, create something new, or find it.

It remains to be seen whether CBDS (or something similar to it) will actually arrive.

Talk Back!

Have an opinion? Readers have already posted 8 comments about this weblog entry. Why not add yours?

RSS Feed

If you'd like to be notified whenever Calum Shaw-Mackay adds a new entry to his weblog, subscribe to his RSS feed.

Digg |

del.icio.us |

About the Blogger

Calum Shaw-Mackay is an architect on Java and Jini systems, working in the UK. His interests lie in distributed computing, adaptability, and abstraction. Calum has been using Jini for longer than he would care to mention. His main area for taking the blame (some people would call it 'expertise') is systems integration and distributed frameworks, and is an advocate of using Jini's unique strengths to build adaptable enterprise systems. His opinions are his own. He's tried to get other people to take his opinions off him, but they just won't.


	Web Artima.com