The Artima Developer Community
Sponsored Link

The C++ Source
Wild-card Searches of UNIX Directories with Random-Access Iterators
by Matthew Wilson
September 13, 2004

<<  Page 3 of 3

Advertisement

Implementation

Despite the fact that glob_sequence provides a more refined iterator concept (Random Access [7]) than does readdir_sequence (Input Iterator [8]), its implementation (see globsequencemethods.h) is conceptually simpler. This is because glob() returns an array of pointers to entries, hence, the const_iterator type is char const** [9]. Because it provides Random Access iterators, glob_sequence is able to also provide a subscript operator, within which the requested index is subject to a validity assertion, and also the rbegin() and rend() functions that provide reverse iterators to the managed range. All we would need to do in a dumb mapping is provide pointers to the first (&gl_pathv[0]) and one-past-the-end (&gl_pathv[gl_pathc]) locations in the array for our begin() and end() iterators.

Because we want to be able to select files and/or directories, and trim dots directories, we need to put in a little more effort. As demonstrated in [1] and [3], when dealing with sequential APIs such as opendir()/readdir() on UNIX, or FindFirstFile() on Win32, it's relatively straightforward to remove an element from the enumerated range: just call (readdir() or FindNextFile()) again. Such logic can be added to the iterator class. But glob() returns the entire unfiltered set en bloc, and to continue to support the Random Iterator concept we need to ensure that the pointers to all the selected entries are stored contiguously. This means we have to manipulate the pointers in situ.

All the action takes place in the glob_init_() member function, called by the two constructors. Each constructor initialises the m_flags members via the validate_flag_() static method, which adds both files and directories to the flags in the case where neither is specified in the constructor arguments. glob_init_() performs three main functions. First, it concatenates the directory, if specified, and the search pattern, and translates the directory to an absolute path if absolutePath is specified.

Second, it calls glob(), presenting the composite directory+pattern and the appropriate translation of the glob_sequence constructor flags (markDirs, noSort, files/directories) to those of the glob() API. If the call to glob() fails, a glob_sequence_exception is thrown.

The final role of glob_init_() is to process the results. This is done in four parts. First, if there are to be any changes, the entry pointers are copied into the member buffer m_buffer, which is an instance of stlsoft::auto_buffer [10]. This is to allow us to manipulate the array of pointers without altering the contents of the buffer returned by glob(), which would be very poor form [11].

Second, if the dots directories are to be elided, the function searches through the array of pointers until it's found them both. For each one found the pointer is swapped with the one at the base of the array, and then the base advanced one and the item count dropped. In the first implementation I simply incremented m_base past "." and ".." if they were present, since they were always the first two (or one) entries in the array. Unfortunately, this was only an artefact of my test environment [12], and when I moved to a different platform I quickly saw the error of my ways. This is an important point: when doing cross-platform development, you must test on several platforms.

Since glob() will append a path name separator to directories if asked, we cannot simply test the entry names for equivalence to "." or ".." using the filesystem_traits::is_dots() function we saw in the implementation of readdir_sequence. Instead, the member functions is_dots_maybe_slashed_() and is_end_of_path_elements_() provide the requisite functionality, and enable the processing to be terminated once both dots directories are found in a given set of results.

The third part of the filtering is to elide directories. Since glob() supports the GLOB_ONLYDIR to request only directories, we can trust it to not return files when we've not asked for them with the files flags. There's no flag to ask glob() to return only files, however, so we need to do that ourselves. As you saw in globsequencemethods.h, this is achieved in the same manner as for the dots elision. This is done in one of two ways. If we asked for directories to be marked with a trailing path name separator (with the markDirs flag) then we simply test whether the entry name is so marked. If it is, then we do the swap/advance. If not, we test it with stat(), and swap/advance if it's not a file.

All this swapping and advancing is quite efficient, but it leaves the entries out of their original order. Hence, the last part of the filtering task for glob_init_() is to call std::sort() on the remaining entries if the noSort flag was not specified, and if the resultant number of entries is different to that returned by glob().

The implementation is available online [2] as part of the STLSoft libraries, which includes sample/test programs. This class is also used within the UNIX implementation of the recls library [13].

[See the sidebar, Taking Exception to Error Handling for more information on glob_sequence and exception safety.]

Summary

You may be wondering, given the power of glob() / glob_sequence, why should one ever want to use readdir() / readdir_sequence? There are several reasons:

  1. glob() is implemented in terms of readdir(). Consequently, it's inevitable that if your requirement is more suitable to readdir(), then using readdir() / readdir_sequence directly will be more efficient.
  2. As we've seen, glob() returns all the matching entries in an array. Naturally, the search has to be conducted to completion before any results may be returned. If you are looking for a subset of entries in a given search, it may well be more efficient to be able to search and process them in a step-wise fashion with readdir(), especially if the entries you are after may occur earlier in the search.
  3. The last reason is a bit more subtle. Since glob() (or glob_sequence) accepts a pattern that may contain wildcards, and those wildcards may also appear in the directory parts of the given pattern, it increases the complexity of your program. If you want to search a given directory, then you constrain this search more clearly by using readdir(), which will balk at being given any wildcards.

Notwithstanding these specific aspects, glob() is a very powerful and useful tool on UNIX. Using it via glob_sequence affords all the advantages over the raw API demonstrated here, and costs you nothing.

Acknowledgements

Thanks to Greg Peet, Walter Bright, and Bjorn Karlsson for reviewing an early draft and providing their customary essential insightful observations and helpful comments.

Notes and references

  1. Reading UNIX Directories via STL-compliant Sequences, Matthew Wilson, The C++ Source, Volume 1, Number 1, June 2004.
  2. UNIXSTL is a sub-project of STLSoft, providing STL-compliant software for UNIX operating system APIs. It is located at http://unixstl.org/. STLSoft itself resides at http://stlsoft.org/.
  3. In Adapting Windows Enumeration Models to STL Iterator Concepts, Windows Developer Network, March 2003, I described the mapping of the Win32 FindFile enumeration API to STL iterators, in the form of the WinSTL (http://winstl.org) (basic_)findfile_sequence<> class. These three file enumeration classes represent an interesting spectrum of models: readdir_sequence does not take wildcards and returns all files and/or directories in a single directory; findfile_sequence does the same, but can specify wildcards for filtering the entries; glob_sequence may use wildcards in the entry names and in the directory names. If you keep wildcards out of the directory when using glob_sequence, then it has virtually identical semantics as findfile_sequence, and the two may be used to provide the file-searching functionality in cross-platform developments: client code can remain largely unchanged, save for the selection of the requisite component from its namespace [5, 6]
  4. The C++ Programming Language, Special Edition, Bjarne Stroustrup, Addison-Wesley, 1997, p. 366.
  5. Herb Sutter demonstrates this technique in items 13 and 14 of More Exceptional C++, Addison-Wesley, 2002.
  6. Flexible C++ #6: Flexible Implementations Without Using Directives, Matthew Wilson, C/C++ Users Journal Experts Forum, April 2004
  7. http://www.sgi.com/tech/stl/RandomAccessIterator.html
  8. http://www.sgi.com/tech/stl/InputIterator.html
  9. You can see in enumwithglob.cpp that the const_iterator is actually defined as stlsoft::pointer_iterator<value_type const , const_pointer , const_reference >::iterator_type. This is to ensure that it is compatible with standard library implementations whose random-access iterators are actually implemented as class objects. This could be the subject of an entire article in itself—and probably will be in the future—so you?ll have to trust me for now. :-)
  10. Efficient Variable Automatic Buffers, Matthew Wilson, C/C++ Users Journal, Volume 21 Number 12, December 2003. auto_buffer<> is an optimised dynamically (re-)sizable memory buffer, suitable for manipulating arrays of POD types. In most cases, its use in glob_sequence won't result in any memory allocation anyway, since the internal array size is set to 64 (see globsequence.h in this article).
  11. Note that my original implementation did exactly this, based on the assumption that no implementation of globfree() is not going to check that the order of the pointers in the gl_pathv buffer. It seemed too fantastic to me to imagine that any implementation would. Note that I was not so cavalier with the actual contents of the pointer array. Specifically they were swapped in situ, rather than, say, setting to NULL the ones selected out, in case the glob() implementation allocated the storage for each entry individually, rather than in a single block. Nonetheless, a few years older and wiser, and this assumption seemed much less smart. Making assumptions about the insides of APIs you've not implemented is a very bad habit to get into, and this instance certainly didn't warrant taking the risks.
  12. In order to have total control over the order of returned items in the globbing and to work in my development environment of choice—Visual Studio 98 IDE (albeit using compilers other than VC 6)—I wrote an implementation of glob() for Win32 using the FindFirstFile() API, available online at http://synesis.com.au/software/index.html#unixem. The initial version returned "." and ".." as the first two items—because FindFirstFile() orders the matched items—hence my consternation when I moved the class and test program to one of my Linux boxes.
  13. The recls ("recursive ls") library ( http://recls.org/) is a platform-independent search library, which provides recursive enumeration of file-system and, as of version 1.5, FTP systems. It has a C-API, and provides mappings to several languages/technologies, including C++, C#/.NET, COM, D, Java, and STL. It has been the exemplar for the first few installments of my Positive Integration column, for C/C++ Users Journal (http://www.cuj.com/).
  14. The author(s) of the excellent The Art Of UNIX Programming, (Eric Raymond, et al, Addison-Wesley, 2004) admit that one of the (few) flaws in the design and implementation of UNIX is the lack of a facility for receiving asynchronous notifications of file-system changes. UNIX has no equivalent to the Windows FindFileChangeNotification() function, which means that application code is inherently biased towards looking at file-system contents as static. Given the fact that UNIX focuses, or originally focused, on short-lived discrete tasks, this is understandable. Nonetheless, it is certainly possible for file-system contents to change, and when writing file-system code, one needs to be prepared to handle it.

About the Author

Matthew Wilson is a software development consultant, contributing editor for C/C++ User's Journal, and creator of the STLSoft libraries. He is author of the forthcoming book Imperfect C++ (Addison-Wesley, Oct 2004), and is currently working on his next two books, one of which is not about C++. Matthew can be contacted via http://imperfectcplusplus.com/.

<<  Page 3 of 3


Sponsored Links



Google
  Web Artima.com   
Copyright © 1996-2014 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use - Advertise with Us