Wild-card Searches of Unix Directories with Random-Access Iterators

by Matthew Wilson

September 13, 2004

Summary

STL meets glob(): Power, robustness, and genericity without sacrificing efficiency.

Wild-card Searches of UNIX Directories with Random-Access Iterators Summary

This article is the second of a two part series looking at techniques for adapting UNIX file-system enumeration APIs to STL-like sequences. In the first article [1] I described an adaptation of the UNIX opendir()/readdir() API to readdir_sequence, an STL-like sequence class supporting Input Iterators. readdir_sequence reflects the semantics of opendir()/readdir() in that it supports enumeration of the entries in a single directory. It provides the additional features of being able to select files and/or directories, and the ability to elide the dots directories—"." (current directory) and ".." (parent directory)—from the resultant range if required.

UNIX has another, more powerful, method of searching the file-system: the glob() API. Rather than taking the name of a directory and enumerating its contents, glob() takes a wildcard pattern, such as "/usr/*/std*h/", and returns all matching file-system entries.

`glob()`-ing Manually

We saw in [1] that there are several advantages to using an STL-like sequence class instead of a raw API, and that principle applies equally to using glob(). Consider the task of finding all hidden ("dot") files—such as .bashrc—in a given code directory using the glob() API directly, as shown in the following code.

// enumwithglob.cpp: Enumerating sub-directories using glob()
#include <glob.h>     // glob(), globfree()
#include <sys/stat.h> // stat()

#include <algorithm>  // std::copy
#include <iterator>   // std::ostream_iterator
#include <iostream>   // std::cout, std::endl
#include <string>     // std::string
#include <vector>     // std::vector

using std::copy;
using std::cout;
using std::endl;
using std::ostream_iterator;
using std::string;
using std::vector;

const char HOME[]     = "/home/matty/";
const char PATTERN[]  = ".*";

int main()
{
  vector<string>  dotNames;
  glob_t          gl;

  if(0 == glob((string(HOME) + PATTERN).c_str(), 0, NULL, &gl))
  {
    for(char **begin = &gl.gl_pathv[0], **end = &gl.gl_pathv[gl.gl_pathc]; begin != end; ++begin)
    {
      struct stat st;

      // Skip dots
      if( '.' == (*begin)[0] &&
          ( '\0' == (*begin)[1] ||
            ( '.' == (*begin)[1] &&
              '\0' == (*begin)[2])))
      {
        // do nothing
      }
      else
      {
        if(0 == stat(*begin, &st))
        {
          if(S_IFREG == (st.st_mode & S_IFREG))
          {
            dotNames.push_back((*begin));
          }
        }
      }
    }
    globfree(&gl);
  }

  cout << "Dumping . files in " << HOME << endl;
  copy(dotNames.begin(), dotNames.end(), ostream_iterator<string>(cout, "\n"));

  return 0;
}

As well as being quite a large amount of code, there are several specific problems with it. First, we must concatenate the directory and search pattern, before passing the combined string value as the first argument in the call to glob(). Admittedly, in many cases the two will not be separate, so that's a small thing, but there are other issues are not so trivial.

A successful call to glob() returns a block of memory in the gl_pathv member of the glob_t structure passed as the fourth argument. (glob()'s second and third arguments are used for flags, which I'll discuss later, and a callback error-handling function, which I don't discuss in this article.) The returned memory block contains the paths for all matching entries to the specified pattern at the time of the invocation of the call. In common with many UNIX library functions, the addition of the const keyword to C was too late in the game, so the type of gl_pathv is char**, rather than char const**. Hence, the first significant concern in the given code is that begin and end are declared to be of a non-const pointer type. Although I've obviously avoided it in this case, it's always possible to introduce bugs by writing to non-const pointers.

The second issue is that the memory block must be freed, and this is done by calling the beguilingly named globfree(). This issue of calling paired allocate/release functions represents a classic problem area in C++, since any statements occurring between them may be a source of exception unsafety. Indeed, the code in enumwithglob.cpp is not exception safe since the call to dotNames.push_back() may result in an exception being thrown, in which case globfree() would not be called, and the memory block would leak. To correct this would require inserting try-catch scoping into our sample, making it even more verbose.

The remaining issues with this code are more prosaic, but still detract from readability and maintainability. The elision of the dots directories is a manual process, which is a pain for all but the rare cases when you do actually want them. Finally, each entry must be stat()-ed and the return code of stat() tested along with the resultant flags. (For the non-UNIX folks, stat() is the logical equivalent of Win32's GetFileAttributeEx().)

(I must admit that in this particular case, the specific dots directory elision test is not needed because the stat() test ensures that directories are not added to dotNames vector. But I'm sure you can see how this test could be necessary in the more general case.)

`glob()`-ing using `unixstl::glob_sequence`

That's a lot of work for something conceptually simple. Now contrast that with the sublime simplicity of the functionally equivalent form using the UNIXSTL[2] glob_sequence class [3], defined in the following program.

// enumwithglobsequence.cpp: Enumerating sub-directories using unixstl::glob_sequence
#include <unixstl_glob_sequence.h>  // unixstl::glob_sequence

#include <algorithm>                // std::copy
#include <iterator>                 // std::ostream_iterator
#include <iostream>                 // std::cout, std::endl
#include <string>                   // std::string
#include <vector>                   // std::vector

using std::copy;
using std::cout;
using std::endl;
using std::ostream_iterator;
using std::string;
using std::vector;
using unixstl::glob_sequence;

const char HOME[]     = "/home/matty/";
const char PATTERN[]  = ".*";

int main()
{
  glob_sequence  dir(HOME, PATTERN, glob_sequence::files);
  vector<string> dotNames(dir.begin(), dir.end());

  cout << "Dumping . files in " << HOME << endl;
  copy(dotNames.begin(), dotNames.end(), ostream_iterator<string>(cout, "\n"));
  return 0;
}

I think you'll agree that there's a considerable saving in lines-of-code, about 25:2. Because the value_type of glob_sequence (see globsequence.h) is char const*, for which std::string provides a conversion constructor [4], we can pass the iterators from its begin() and end() methods to the constructor of the std::vector<std::string> instance dotNames.

It's Not Just About Dropping the SLOCs

The advantages afforded by using glob_sequence are more than just a reduction in client code quantity; the other issues of concern are also addressed. Automatic resource deallocation and exception-safety is provided by glob_sequence's destructor calling globfree(); classic Resource Acquisition Is Initialisation [4]. Because we are using RAII, a failure code from glob() results in an instance of glob_sequence_exception (see code below) being thrown from the constructor of glob_sequence, rather than requiring users to test the "validity" of the sequence instance after construction. Here is the class definition for glob_sequence_exception:

// globsequenceexception.h: Declaration of the unixstl::glob_sequence_exception class
// Includes (as shown in globsequence.h)

class glob_sequence_exception
  : public std::exception
{
/// Types
public:
  typedef std::exception          parent_class_type;
  typedef glob_sequence_exception class_type;

/// Construction
public:
  ss_explicit_k glob_sequence_exception(int globStatus, int errno_)
    : m_globStatus(globStatus)
    , m_errno(errno_)
  {}

/// Accessors
public:
  char const  *what() const /* throw() */
  {
    return "glob_sequence failure";
  }
  int get_globstatus() const
  {
    return m_globStatus;
  }
  int get_errno() const
  {
    return m_errno;
  }

// Members
private:
  int const  m_globStatus;
  int const  m_errno;

// Not to be implemented
private:
  class_type &operator =(class_type const &);
};

Just as with readdir_sequence, all the filtering of files and/or directories, and elision of dots directories is selected by specifying the appropriate flags to the glob_sequence constructor. You may select files, or directories (with or without dots directories), or both, just by specifying the requisite flags. Furthermore, there are a number of other flags that offer even more power than is provided by readdir_sequence. The flag noSort causes the flag GLOB_NOSORT to be passed to the underlying call to glob(), which prevents it from sorting the results. One can presume that this will result in faster searching for some/all implementations, so this flag is included in the flags parameter's default value. The flag markDirs causes the GLOB_MARK flag to be passed to glob(), which marks all directory entries with a terminating path name separator ('/'). This saves you the effort of having to check for the path name separator and adding one prior to concatenating on file names or other search patterns.

Finally, the absolutePath flag causes the search directory, and therefore all the results, to be evaluated as an absolute path. In other words, if the working directory is "/usr/include", and the search directory is given as "..", then the entries will be rooted from "/usr/", rather than from "../". This functionality is provided by the class (see globsequencemethods.h), rather than a part of glob()'s otherwise impressively rich option set.

Implementation

Despite the fact that glob_sequence provides a more refined iterator concept (Random Access [7]) than does readdir_sequence (Input Iterator [8]), its implementation (see globsequencemethods.h) is conceptually simpler. This is because glob() returns an array of pointers to entries, hence, the const_iterator type is char const** [9]. Because it provides Random Access iterators, glob_sequence is able to also provide a subscript operator, within which the requested index is subject to a validity assertion, and also the rbegin() and rend() functions that provide reverse iterators to the managed range. All we would need to do in a dumb mapping is provide pointers to the first (&gl_pathv[0]) and one-past-the-end (&gl_pathv[gl_pathc]) locations in the array for our begin() and end() iterators.

Because we want to be able to select files and/or directories, and trim dots directories, we need to put in a little more effort. As demonstrated in [1] and [3], when dealing with sequential APIs such as opendir()/readdir() on UNIX, or FindFirstFile() on Win32, it's relatively straightforward to remove an element from the enumerated range: just call (readdir() or FindNextFile()) again. Such logic can be added to the iterator class. But glob() returns the entire unfiltered set en bloc, and to continue to support the Random Iterator concept we need to ensure that the pointers to all the selected entries are stored contiguously. This means we have to manipulate the pointers in situ.

All the action takes place in the glob_init_() member function, called by the two constructors. Each constructor initialises the m_flags members via the validate_flag_() static method, which adds both files and directories to the flags in the case where neither is specified in the constructor arguments. glob_init_() performs three main functions. First, it concatenates the directory, if specified, and the search pattern, and translates the directory to an absolute path if absolutePath is specified.

Second, it calls glob(), presenting the composite directory+pattern and the appropriate translation of the glob_sequence constructor flags (markDirs, noSort, files/directories) to those of the glob() API. If the call to glob() fails, a glob_sequence_exception is thrown.

The final role of glob_init_() is to process the results. This is done in four parts. First, if there are to be any changes, the entry pointers are copied into the member buffer m_buffer, which is an instance of stlsoft::auto_buffer [10]. This is to allow us to manipulate the array of pointers without altering the contents of the buffer returned by glob(), which would be very poor form [11].

Second, if the dots directories are to be elided, the function searches through the array of pointers until it's found them both. For each one found the pointer is swapped with the one at the base of the array, and then the base advanced one and the item count dropped. In the first implementation I simply incremented m_base past "." and ".." if they were present, since they were always the first two (or one) entries in the array. Unfortunately, this was only an artefact of my test environment [12], and when I moved to a different platform I quickly saw the error of my ways. This is an important point: when doing cross-platform development, you must test on several platforms.

Since glob() will append a path name separator to directories if asked, we cannot simply test the entry names for equivalence to "." or ".." using the filesystem_traits::is_dots() function we saw in the implementation of readdir_sequence. Instead, the member functions is_dots_maybe_slashed_() and is_end_of_path_elements_() provide the requisite functionality, and enable the processing to be terminated once both dots directories are found in a given set of results.

The third part of the filtering is to elide directories. Since glob() supports the GLOB_ONLYDIR to request only directories, we can trust it to not return files when we've not asked for them with the files flags. There's no flag to ask glob() to return only files, however, so we need to do that ourselves. As you saw in globsequencemethods.h, this is achieved in the same manner as for the dots elision. This is done in one of two ways. If we asked for directories to be marked with a trailing path name separator (with the markDirs flag) then we simply test whether the entry name is so marked. If it is, then we do the swap/advance. If not, we test it with stat(), and swap/advance if it's not a file.

All this swapping and advancing is quite efficient, but it leaves the entries out of their original order. Hence, the last part of the filtering task for glob_init_() is to call std::sort() on the remaining entries if the noSort flag was not specified, and if the resultant number of entries is different to that returned by glob().

The implementation is available online [2] as part of the STLSoft libraries, which includes sample/test programs. This class is also used within the UNIX implementation of the recls library [13].

[See the sidebar, Taking Exception to Error Handling for more information on glob_sequence and exception safety.]

Summary

You may be wondering, given the power of glob() / glob_sequence, why should one ever want to use readdir() / readdir_sequence? There are several reasons:

glob() is implemented in terms of readdir(). Consequently, it's inevitable that if your requirement is more suitable to readdir(), then using readdir() / readdir_sequence directly will be more efficient.
As we've seen, glob() returns all the matching entries in an array. Naturally, the search has to be conducted to completion before any results may be returned. If you are looking for a subset of entries in a given search, it may well be more efficient to be able to search and process them in a step-wise fashion with readdir(), especially if the entries you are after may occur earlier in the search.
The last reason is a bit more subtle. Since glob() (or glob_sequence) accepts a pattern that may contain wildcards, and those wildcards may also appear in the directory parts of the given pattern, it increases the complexity of your program. If you want to search a given directory, then you constrain this search more clearly by using readdir(), which will balk at being given any wildcards.

Notwithstanding these specific aspects, glob() is a very powerful and useful tool on UNIX. Using it via glob_sequence affords all the advantages over the raw API demonstrated here, and costs you nothing.

Acknowledgements

Thanks to Greg Peet, Walter Bright, and Bjorn Karlsson for reviewing an early draft and providing their customary essential insightful observations and helpful comments.

Notes and references

Reading UNIX Directories via STL-compliant Sequences, Matthew Wilson, The C++ Source, Volume 1, Number 1, June 2004.
UNIXSTL is a sub-project of STLSoft, providing STL-compliant software for UNIX operating system APIs. It is located at http://unixstl.org/. STLSoft itself resides at http://stlsoft.org/.
In Adapting Windows Enumeration Models to STL Iterator Concepts, Windows Developer Network, March 2003, I described the mapping of the Win32 FindFile enumeration API to STL iterators, in the form of the WinSTL (http://winstl.org) (basic_)findfile_sequence<> class. These three file enumeration classes represent an interesting spectrum of models: readdir_sequence does not take wildcards and returns all files and/or directories in a single directory; findfile_sequence does the same, but can specify wildcards for filtering the entries; glob_sequence may use wildcards in the entry names and in the directory names. If you keep wildcards out of the directory when using glob_sequence, then it has virtually identical semantics as findfile_sequence, and the two may be used to provide the file-searching functionality in cross-platform developments: client code can remain largely unchanged, save for the selection of the requisite component from its namespace [5, 6]
The C++ Programming Language, Special Edition, Bjarne Stroustrup, Addison-Wesley, 1997, p. 366.
Herb Sutter demonstrates this technique in items 13 and 14 of More Exceptional C++, Addison-Wesley, 2002.
Flexible C++ #6: Flexible Implementations Without Using Directives, Matthew Wilson, C/C++ Users Journal Experts Forum, April 2004
http://www.sgi.com/tech/stl/RandomAccessIterator.html
http://www.sgi.com/tech/stl/InputIterator.html
You can see in enumwithglob.cpp that the const_iterator is actually defined as stlsoft::pointer_iterator<value_type const , const_pointer , const_reference >::iterator_type. This is to ensure that it is compatible with standard library implementations whose random-access iterators are actually implemented as class objects. This could be the subject of an entire article in itself—and probably will be in the future—so you?ll have to trust me for now. :-)
Efficient Variable Automatic Buffers, Matthew Wilson, C/C++ Users Journal, Volume 21 Number 12, December 2003. auto_buffer<> is an optimised dynamically (re-)sizable memory buffer, suitable for manipulating arrays of POD types. In most cases, its use in glob_sequence won't result in any memory allocation anyway, since the internal array size is set to 64 (see globsequence.h in this article).
Note that my original implementation did exactly this, based on the assumption that no implementation of globfree() is not going to check that the order of the pointers in the gl_pathv buffer. It seemed too fantastic to me to imagine that any implementation would. Note that I was not so cavalier with the actual contents of the pointer array. Specifically they were swapped in situ, rather than, say, setting to NULL the ones selected out, in case the glob() implementation allocated the storage for each entry individually, rather than in a single block. Nonetheless, a few years older and wiser, and this assumption seemed much less smart. Making assumptions about the insides of APIs you've not implemented is a very bad habit to get into, and this instance certainly didn't warrant taking the risks.
In order to have total control over the order of returned items in the globbing and to work in my development environment of choice—Visual Studio 98 IDE (albeit using compilers other than VC 6)—I wrote an implementation of glob() for Win32 using the FindFirstFile() API, available online at http://synesis.com.au/software/index.html#unixem. The initial version returned "." and ".." as the first two items—because FindFirstFile() orders the matched items—hence my consternation when I moved the class and test program to one of my Linux boxes.
The recls ("recursive ls") library ( http://recls.org/) is a platform-independent search library, which provides recursive enumeration of file-system and, as of version 1.5, FTP systems. It has a C-API, and provides mappings to several languages/technologies, including C++, C#/.NET, COM, D, Java, and STL. It has been the exemplar for the first few installments of my Positive Integration column, for C/C++ Users Journal (http://www.cuj.com/).
The author(s) of the excellent The Art Of UNIX Programming, (Eric Raymond, et al, Addison-Wesley, 2004) admit that one of the (few) flaws in the design and implementation of UNIX is the lack of a facility for receiving asynchronous notifications of file-system changes. UNIX has no equivalent to the Windows FindFileChangeNotification() function, which means that application code is inherently biased towards looking at file-system contents as static. Given the fact that UNIX focuses, or originally focused, on short-lived discrete tasks, this is understandable. Nonetheless, it is certainly possible for file-system contents to change, and when writing file-system code, one needs to be prepared to handle it.

Talk back!

Have an opinion? Readers have already posted 2 comments about this article. Why not add yours?

About the author

Matthew Wilson is a software development consultant for Synesis Software, and creator of the STLSoft libraries. He is author of the forthcoming book Imperfect C++ (to be published Sept 2004 by Addison-Wesley), and is currently working on his next two books, one of which is not about C++. Matthew can be contacted via http://imperfectcplusplus.com/.

glob()-ing Manually

glob()-ing using unixstl::glob_sequence

It's Not Just About Dropping the SLOCs

Implementation

Summary

Acknowledgements

Notes and references

Talk back!

About the author

`glob()`-ing Manually

`glob()`-ing using `unixstl::glob_sequence`