Anyone who has read David Gelernters The Second Coming: A Manifesto, will have heard of the concept of filesystems that grab the data they want and categorise files based on their content. But how near is the reality of Content Based Data Stores (CBDS)
Notice I said Content Based Data Stores, not Content Based File Systems, which is perhaps a more familiar term but is actually incorrect. You see, in the Manifesto, Gelernter outlines the problems associated with the idea of files and directories, as the current realisation of data storage. Firstly the correlation between disk files and paper files, disk directories and filing cabinets, is flawed because a computer may take action rather than just being some form of passive entity to these collections of bytes. You may say well programs load data and take action, possibly storing the results out differently. But what if the file system itself, could take action, as the bytes were being stored.
If it werent for the twin concepts of separation and collision, the filing cabinet system of storing files may never have occurred, or at least in the way that it has done. Filing cabinets/Folders and individual files are good for us techies, the programmers amongst you especially. I have two projects ProjectA and ProjectB and I need to enforce separation so I put them in two different folders. I have a file called automate.c for both a COM component and an attached system device. So Ill have com/automate.c and dev/automate.c. Thus separating the files, removing collision, but also having the side effect of rudimentary namespaces. Whats wrong with this, I may hear you cry.
Imagine now, you have a 55yr old man on his PC, listening to an MP3 Rachmaninov. Does he care if it is an MP3? Does he care that the file is called PnoConcertoNo3.mp3? No. What he cares about is that it is music he likes to listen to.
Now lets take another scenario. Imagine a marketing person, they can talk the talk, theyre computer literate. They download a brochure file for a competitors product.
What do they call it? Where do they put it? There is actually no simple answer. Ill explain why in a moment.
The reason why the cabinet/file concept fails us today, is that in programming there is no room for ambiguity, no multiple meanings, therefore the programmers concept of separation to remove collisions is a good one. I know that dev/automate.cs sole purpose is the automation of the device.
Now lets go back to the brochure. What do I call it? People get hung up on names, brochure.pdf, someNewFangledGadgetV1.html, whatever, but it really shouldnt matter. What matters is this.
Its a competitors product
The product is called someNewFangledGadget Version 1
The competitor is called FooBars Gadgets
It does a similar job to ourOldFangledWidget
So the file could go in Competitors , Products , FooBars Gadgets, ourOldFangledWidget
But we currently have to choose one. I know we could use symlinks, but that really isnt the point.
Now lets go back to the man. He likes his music, and he likes Piano concertos, he likes his Classical music, and he likes Rachmaninov. Now the only thing that combines all these things is Music, so if we have a folder called Music, he could just put a file in there called RachmaninovPnoConcertoNo3.mp3 and hes on his way. Or is he?
You see a CBDS really comes to the fore when youre searching for something. My Program Files folder contains over 32000 files and 2481 directories, and believe me, that isn't very much (my basic 'Eclipse' install has 2831 files and 539 directories). Is this really a good thing? Most of the time I dont even care what these files are called. Another problem with the filing metaphor, is its inherent tree-structure, going from the general to the specific from d:\development to d:\development\java\j2sdk1.4.2_05\jre\bin\java.exe you see effectively the java.exe belongs to bin (and no other) which in turn belongs to jre (and no other), etc, etc.
What really needs to happen for a CBDS to be realistic is to change the concept to a graph rather than a tree. This would allow us to at least realise some of the things in Gelernters paper A file can have 1 name, many names, or no name and files can share a name - and it can exist in one directory, many directories or no directory
Instead of storing the data to a directory, the file is just stored to an area. All folders are replaced by categories, and of course these will have sub categrories, but none of these are at all mutually exclusive. So the brochure.html could be stored simultaneously in Competitors , Products , FooBars Gadgets, ourOldFangledWidget, because the filesystem realises that these are all relevant. If the user changes the file, such that it is clear that the product is actually not like ourOldFangledWidget then the file would be removed automatically. Our brains work similar to this notion, Remember things based on what happened or what we did, not some kind of symbolic name we give to this event In other words, a CBDS will check your data as it is stored (take action) and store it according to the metadata and the content. The user will not have to figure out how to store it, where to store it, maybe not even give it a name. This is basically forgoing the concept of having indexes for files and folders, but using the indexes as the way of storing and retrieving data.
Now to the Hungry Disks. When a term (some notion for a folder in a CBDS) is created, either by the user, or by the computer, any file stored to the unspecified area, is checked by a number of these high-level terms (Music, Documents, Programs), each one of these terms will check to see if the file is relevant. For instance Cubase should exist in both Music and Programs (or sub-terms thereof, such as Programs|Audio) , moreover, for each term a set of related and sub-terms can be triggered, to see if they also find the data relevant to their topic.
So taking the MP3 concept, the file representing data called Rachmaninov Piano Concerto No.3 is put into the unspecified area. Instantly, the Music term recognises it through the files ID3 tags, loading the file location into the Music Index, it has a number of sub-terms, MP3, WAV, WMA, Jazz, Rock, Classical. The terms that find the file relevant are MP3, and Classical. Classical has two sub-terms PerformedBy and Composer. Composer fires the Rachmaninov sub-term, Rachmaninov fires the 19th Century and 20th Century in the History high-level term, etc, etc.
A quite contrived example, but the point is that this music file, can be found in Music ,MP3, Classical, Composers, Rachmaninov or even History/19th Century and History/20th Century. But the user didnt have to do anything.
This is an extremely powerful concept, and one that takes a fair bit of getting your head around. But lets take another business example, you mail a supplier, Jones Bearings to say the shipment of Ball Bearings you requested, dated the 16th August, has not arrived.
You can guess where Im going with this. Instead of storing this in a pointless Sent Items File (after all you already knew they were sent) an email program with a CBDS, would automatically store the reference in the Supplier correspondence, in Jones Bearings, in 16th August, Week Ending 22nd August, in Late deliveries etc,etc.
However, you can also jump from one to related term to another, so in the 16th August I may have a reference to a file that is related to Smith Washers and I can jump to that.
The point is to a CBDS its all a reference to some set of bytes, and those bytes could be proper files, like a Word Document, it could be an email, an anniversary reminder, it doesnt matter it removes the need for storing things in files and folders, in particular places, with certain names or suffixes and gets to the heart of the matter; I have some data and I either want to read it, write it, delete it, create something new, or find it.
It remains to be seen whether CBDS (or something similar to it) will actually arrive.
Isn't this what the Jini lookup server as well as Javaspaces can enable? I.e. can't you just start putting entries into a Javaspace that anchor a filename with a bunch of attributes that you can search with :-)
And, some people will suggest RDBMS or WS-Somesuch. But, I'm partial to Jini, so that's how I'd solve the problem, particularly since there are other things that fall out of using Jini...
For my sins, I've been considering something along these lines, simply because it sounds so interesting. My main thoughts revolved around a set of VFS nodes as services, with the context indexes being stored and manipulated through a space. i.e. the 'FAT table' being stored in the space with the physical files using the VFS node 'disk'
For the 'undefined' area, Using a file poller that enabled remote events to be registered, seemed like a pretty good way to go. This works pretty well for standard files (although I have to say I haven't really looked as FileSystemView - mental note made!), existing metadata really isn't that difficult to handle as we have control over the entries representing the files, but for internal content of, say Word Documents, some sort of content reader service through OpenOffice.org could work, however the for more aesthetic or unidentifiable files, such as photographs, the need to fall back on user-supplied content description would be needed. But I agree, Jini is a perfect vehicle for this kind of thing - if I don't care about where to store it, I don't actually care where my disk is, or what the filesystem is.
As I understand it, you are saying that you want a "logical-physical" independence. That is, the physical file can be stored anywhere, but when a file's contents belongs to a category, then we should be able to see it easily.
Well, one of the ways of doing that is by using a relational database that has a logical-physical separation as one of its foundational principles. I see no reason why the contents of any file of any type can not be stored in a database, using predefined keywords to allocate it to one or more searchable categories.
At another level, it seems that the cause of the problem is the fact that a file system has a predefined rigid hierarchy. It is the rigid hierarchy that seems to be causing the problem.
It is interesting to observe that Object Oriented Design is based on designing a rigid hierarchy of classes, inheritance being one of the key features of OO Design.
If rigid hierarchies in file systems cause problems, why do they not cause problems in OO applications?
> Well, one of the ways of doing that is by using a > a relational database that has a logical-physical > separation as one of its foundational principles. I see no > reason why the contents of any file of any type can not be > stored in a database, using predefined keywords to > allocate it to one or more searchable categories.
Well, it's funny you should say that - Oracle's IFS (I think it's now called iFiles) supports the creation of filesystems through a database, and indeed file format translation. But you are still fundamentally addressing a generic->specific filesystem i.e tree-based, because the current user-view of a file system is enabled in that fashion
Yes, you need some form of logical physical separation, but no more, in concept, than we have at the moment, the FAT table holds all the initial indexes into the disk, with pointers to cylinders and sectors, etc. for the physical data. One of my issues, is the use of 'predefined keywords' - such things shouldn't really exist; the filesystem should determine these keywords, and allocate them automatically, based on the content, and should only fall back on user-supplied meta-data, if the content cannot be sufficiently interpreted (for instance, holiday photographs, etc)
Specifically, with relational databases, modelling a CBDS; a CBDS exemplifies a table structure that can simultaneously model multiple cardinalities, one-to-one, one-to-many, many-to-one, many-to-many, with these relationships potentially changing rapidly, can be a difficult one, effectively rendering out content-based indexes as folders in real-time, where items can move and grow from term-to-term as the data changes or other data provides more relevancy
> At another level, it seems that the cause of the > the problem is the fact that a file system has a > predefined rigid hierarchy. It is the rigid hierarchy that > seems to be causing the problem.
It's not only that, but it's one of the issues with XML - indeed one of the reasons why hierarchical databases fell out of fashion and led to the rise of relational databases - generic data (after all you can write anything about anything into a word processor) is not hierarchical, and it is not distinctly, always, relational either.
> It is interesting to observe that Object Oriented > ted Design is based on designing a rigid hierarchy of > classes, inheritance being one of the key features of OO > Design.
Ahh but the concept of aggregation and composition removes the strictly hierarchical nature of inheritance. Interfaces provide limited concepts of multiply inherited object protocols as well
I always like to think of inheritance, aggregation and interfaces as the three a's (pron: ah's), 'is a', 'has a', and 'provides a'
> If rigid hierarchies in file systems cause problems, > ms, why do they not cause problems in OO applications?
But think about the point I made regarding why file systems became the way they are - programmers like separation, avoidance of collision, and hate ambiguity. That's why hierarchies are good for programmers - eventually you get down to the most specific level where you can guarantee the uniqueness of this source file.
I like to blog about things like this, because it really gives you something different to think about, and can provoke quite an interesting discussion. I liked your comments on the parity of hierarchical file systems to OO hierarchies, perhaps something could be gleaned from this - would filesystems be better if files and folders were true objects, allowing concepts like inheritance, protocols and composition?
ReiserFS has some rather interesting capabilities in it, and in Han's thoughts for the future. The unfortunate thing is that ReiserFS is not a portable solution. It is locked into Linux so to speak. More importantly, I think the level that it is integrated at, creates certain problems for making it possible for applications to create the capabilities I describe below.
I think that the filesystem, is a convenient way to separate files. Calum is suggesting that we need a way for the user to find files by using content information.
Imagine, if you will, that when a user wants to find some information (we currently say open a file), that a google like search box opened instead, and they typed in some keywords, and a list of matches came back. They then would select the match that they wanted and continue.
Currently, we use file extensions to constrain the user to using data that is of the correct type for a particular application (in general, I know that applications such as emacs and vi have no such limitations). Calum is talking about adding to such groupings, more information.
Everytime a file is saved to disk, it would be indexed by the 'google-like' indexer so that anytime I wanted to view a file, I could ask for it based on things inside of the file, instead of using the rather brief attributes of directory, filename and extension...
> I think that the filesystem, is a convenient way to > separate files. Calum is suggesting that we need a way > for the user to find files by using content information.
That is one way of looking at it - the filesystem concept is convenient as a way of physically storing and separating files, but the tree-based approach forces mutual exclusion between folders - an example of this is two filing cabinets both holding a file about the same person, but they are not physically holding the same information.
Sometimes you want varying degrees of separation between files and folders, because their content is related, so contextually 6 files may be related very tightly in their content but may actually be stored on separate disks or machines, with only a couple of them residing in the same directory to prompt a user that they may, in fact, be relevant to each other. The hungry disk/indexFS concept, would allow for this context relationship to be easily presented to a user
> Imagine, if you will, that when a user wants to find some > information (we currently say open a file), that a google > like search box opened instead, and they typed in some > keywords, and a list of matches came back. They then > would select the match that they wanted and continue.
This is exactly right. The OpenDialog metaphor is replaced with a search function, and a 'directory view' but instead of displaying root directories, it shows the major generic terms, with each 'click-into' adding into the search criteria rather than being the exclusion filter that we currently have.
> Currently, we use file extensions to constrain the user to > using data that is of the correct type for a particular > application (in general, I know that applications such as > emacs and vi have no such limitations). Calum is talking > about adding to such groupings, more information.
Yes, because we filenames say more about how we use them, rather than what the content of the files represent. Say you provide create a business proposal cost breakdown called 'proposal.xls'. The system knows that it is an Excel file from the extension, and thus what application the user wants to use to view it, yet it actually knows nothing about 'proposal.xls', we often use the parent folder to provide a context for files, but files can have multiple contexts and these contexts have differing variances.
> Everytime a file is saved to disk, it would be indexed by > the 'google-like' indexer so that anytime I wanted to view > a file, I could ask for it based on things inside of the > file, instead of using the rather brief attributes of > directory, filename and extension...
Exactly, but the 'utopian' view, if you will, is that when you save the document, there really isn't a folder you store it to, you might give it a name just for your own piece of mind, and the combination of name and meta-information allows unique files to be stored but many of these files might have them same name, instead of giving it a directory context you can just add some meta-information and off the system goes and stores it using the indexes. Also the idea that the indexer has multiple 'grabbers' for the groupings where these 'directories' jump out and 'grab' the files most relevant to them.
The other thing is: how many times do you forget where you put a file, or just put a file in a big 'dumping ground' of a folder ('My Documents' anyone?)