The Artima Developer Community
The C++ Source | C++ Community News | Discuss | Print | Email | First Page | Previous | Next
Sponsored Link

The C++ Source
Wild-card Searches of UNIX Directories with Random-Access Iterators
by Matthew Wilson
September 13, 2004

Page 1 of 3  >>

Advertisement

Summary
STL meets glob(): Power, robustness, and genericity without sacrificing efficiency.

This article is the second of a two part series looking at techniques for adapting UNIX file-system enumeration APIs to STL-like sequences. In the first article [1] I described an adaptation of the UNIX opendir()/readdir() API to readdir_sequence, an STL-like sequence class supporting Input Iterators. readdir_sequence reflects the semantics of opendir()/readdir() in that it supports enumeration of the entries in a single directory. It provides the additional features of being able to select files and/or directories, and the ability to elide the dots directories—"." (current directory) and ".." (parent directory)—from the resultant range if required.

UNIX has another, more powerful, method of searching the file-system: the glob() API. Rather than taking the name of a directory and enumerating its contents, glob() takes a wildcard pattern, such as "/usr/*/std*h/", and returns all matching file-system entries.

glob()-ing Manually

We saw in [1] that there are several advantages to using an STL-like sequence class instead of a raw API, and that principle applies equally to using glob(). Consider the task of finding all hidden ("dot") files—such as .bashrc—in a given code directory using the glob() API directly, as shown in the following code.

// enumwithglob.cpp: Enumerating sub-directories using glob()
#include <glob.h>     // glob(), globfree()
#include <sys/stat.h> // stat()

#include <algorithm>  // std::copy
#include <iterator>   // std::ostream_iterator
#include <iostream>   // std::cout, std::endl
#include <string>     // std::string
#include <vector>     // std::vector

using std::copy;
using std::cout;
using std::endl;
using std::ostream_iterator;
using std::string;
using std::vector;

const char HOME[]     = "/home/matty/";
const char PATTERN[]  = ".*";

int main()
{
  vector<string>  dotNames;
  glob_t          gl;

  if(0 == glob((string(HOME) + PATTERN).c_str(), 0, NULL, &gl))
  {
    for(char **begin = &gl.gl_pathv[0], **end = &gl.gl_pathv[gl.gl_pathc]; begin != end; ++begin)
    {
      struct stat st;

      // Skip dots
      if( '.' == (*begin)[0] &&
          ( '\0' == (*begin)[1] ||
            ( '.' == (*begin)[1] &&
              '\0' == (*begin)[2])))
      {
        // do nothing
      }
      else
      {
        if(0 == stat(*begin, &st))
        {
          if(S_IFREG == (st.st_mode & S_IFREG))
          {
            dotNames.push_back((*begin));
          }
        }
      }
    }
    globfree(&gl);
  }

  cout << "Dumping . files in " << HOME << endl;
  copy(dotNames.begin(), dotNames.end(), ostream_iterator<string>(cout, "\n"));

  return 0;
}

As well as being quite a large amount of code, there are several specific problems with it. First, we must concatenate the directory and search pattern, before passing the combined string value as the first argument in the call to glob(). Admittedly, in many cases the two will not be separate, so that's a small thing, but there are other issues are not so trivial.

A successful call to glob() returns a block of memory in the gl_pathv member of the glob_t structure passed as the fourth argument. (glob()'s second and third arguments are used for flags, which I'll discuss later, and a callback error-handling function, which I don't discuss in this article.) The returned memory block contains the paths for all matching entries to the specified pattern at the time of the invocation of the call. In common with many UNIX library functions, the addition of the const keyword to C was too late in the game, so the type of gl_pathv is char**, rather than char const**. Hence, the first significant concern in the given code is that begin and end are declared to be of a non-const pointer type. Although I've obviously avoided it in this case, it's always possible to introduce bugs by writing to non-const pointers.

The second issue is that the memory block must be freed, and this is done by calling the beguilingly named globfree(). This issue of calling paired allocate/release functions represents a classic problem area in C++, since any statements occurring between them may be a source of exception unsafety. Indeed, the code in enumwithglob.cpp is not exception safe since the call to dotNames.push_back() may result in an exception being thrown, in which case globfree() would not be called, and the memory block would leak. To correct this would require inserting try-catch scoping into our sample, making it even more verbose.

The remaining issues with this code are more prosaic, but still detract from readability and maintainability. The elision of the dots directories is a manual process, which is a pain for all but the rare cases when you do actually want them. Finally, each entry must be stat()-ed and the return code of stat() tested along with the resultant flags. (For the non-UNIX folks, stat() is the logical equivalent of Win32's GetFileAttributeEx().)

(I must admit that in this particular case, the specific dots directory elision test is not needed because the stat() test ensures that directories are not added to dotNames vector. But I'm sure you can see how this test could be necessary in the more general case.)

Page 1 of 3  >>

The C++ Source | C++ Community News | Discuss | Print | Email | First Page | Previous | Next

Sponsored Links



Google
  Web Artima.com   
Copyright © 1996-2014 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use - Advertise with Us