Reading Unix-style Directories via STL-compliant Sequences
by Matthew Wilson
June 21, 2004

Summary
This article shows how to easily process UNIX-style directory entries as STL sequences. Copyright © 2004, Matthew
Reading Unix-style Directories via STL-compliant Sequences Summary
Wilson

The C++ community is gradually moving towards the STL paradigm—containers, iterators, function objects, algorithms and adaptors—which provides great advantages in terms of commonality of expression, reduction in developer effort, and greater robustness, maintainability and reuse. However, apart from the standard library and one or two exceptions (Boost [1], STLSoft [2]), there is relatively little STL-compliant code, which is, in part, due to the complexities involved in its development.

In this article I'll demonstrate how one kind of enumeration API—the UNIX opendir()/readdir() API—can be mapped into an STL-compliant sequence-like class providing Input Iterators. I examine the implementation of readdir_sequence, from UNIXSTL, the UNIX-specific sub-project of STLSoft ([2]).

Before getting into the details, consider the advantages of an STL-compliant approach. Let's say you need to load the names of the sub-directories of /home/matty/ into a vector of strings. Using the raw opendir() API, your code might look something like that shown in Listing 1.

It's not an enormous amount of code, to be sure, but it's still quite a bit for something so conceptually straightforward. Let's look at the readdir_sequence-based version, in Listing 2. It's clear in comparing the two code snippets that readdir_sequence represents a big win over the version using the opendir() API. There are several advantages:

  1. The code is more succinct. That's always a popular improvement.
  2. Closing the search, via closedir(), is automatically handled, via Resource Acquisition Is Initialisation, which improves robustness. Indeed, the first example is not exception-safe, since the std::string constructor and the std::vector<>::push_back() method may throw exceptions; the second version is exception-safe.
  3. Elision of the dots directories—a very common requirement—is done automatically.
  4. Directory and/or file selection is done via the appropriate constructor flag.
  5. The effort in remembering the search directory and prefixing it to each entry name prior to stat-ing them is eliminated. All these aspects are handled internally by readdir_sequence::const_iterator.

The one flaw in the design is that the value_type is struct dirent const*, which means you have to explicitly enumerate the entries, rather than use algorithms or iterator-based constructors (see Sidebar).

Let's look now how it's implemented. Listing 3 shows the definition of the readdir_sequence class. It provides begin() and end() methods, which return iterators of member type const_iterator, since only non-mutating access to the entries is provided. Hence, readdir_sequence provides a read-only view on a directory.

It also provides an empty() method, which tests begin() against end(). I've deliberately omitted a size() member, since the size could only be obtained by conducting an enumeration over the range, which is a costly operation. Not providing this method is an unequivocal documentation of this fact [3]. If one really wants to calculate the range size, it can be done via std::distance().

For convenience to client code, a get_directory() method is provided which returns the search directory, having ensured, in the constructor, that it has a trailing path name separator ('/'). This means that if you need to express the returned values in absolute form, you can do so simply, as in:

dirNames.push_back(dir.get_directory() + (*b)->d_name);

The two parameters to the constructor are the directory to search and the flags, which control the search. I'll look at these shortly when I discuss the readdir_sequence::const_iterator class.

Since all the member variables of this class are fully-fledged value types, there is no need to proscribe or provide explicit implementations of either the copy constructor or the copy assignment operator. The compiler provided ones will work quite nicely and safely.

Another notable aspect to the class is that the type of the string member, m_directory, depends on whether the symbol PATH_MAX is defined [4]. It if is, then the operating environment has a fixed maximum path limit, and the class uses the STLSoft string basic_static_string, which has an internal array of the given size, so there's no memory allocation involved. If PATH_MAX is not defined, then the operating environment does not have a fixed maximum path limit, so the STLSoft string basic_simple_string is used instead [5]. It's a small efficiency, certainly, but I'm funny like that.

A further saving comes from the use of the m_scratch member. Since the directory is constant for the life of the sequence/iterator, and it's contents are not revealed directly outside the iterator instance, you are able to reuse it for constructing the full path names of the entries, in order to call stat() on them. The m_dirLen member remembers the original length of the directory alone, so it can be truncated to that length (which includes path name separator) ready for each entry name to be appended to it.

The opendir() API provides a two-step process to directory enumeration (in contrast to, say, Win32's FindFirstFile()), so the sequence class starts the enumeration—in its begin() method—and hands the DIR pointer over to the iterator to walk through the matched items.

Listing 4 shows the definition of the readdir_sequence::const_iterator class, which is where all the action happens. Because opendir() affords only single-pass manipulation, only the Input Iterator concept [6] is supported. Iterator copy construction and copy assignment semantics are supported by the use of a reference-counted shared handle, of type readdir_sequence::const_iterator::rds_shared_handle [7], to support iterator instance copying. Multiple concurrent enumerations may be conducted by calling begin() but since we're dealing with file systems, whose contents may change at any time as a result of other processes, subsequent ranges obtained from begin() may not contain the same elements.

The opendir() API returns all the entries in a particular directory, irrespective of whether they are files or directories, and includes the dots directories—"." and "..". Specifying files without directories, or vice versa, as the second argument to the readdir_sequence constructor causes only the matching entries to be returned, which is a nice convenience. If neither are specified, it defaults to returning both, because this conforms to the API's behaviour and also because it is more efficient, as you'll see shortly. Since most directory enumeration—in my experience, at least—is not concerned with the dots directories, they are elided from the enumeration range by default. Specifying the readdir_sequence::includeDots flag causes them to be included in the range.

Entry filtering is performed within operator++(). Detection of files and/or directories is done by calling stat(), but this is only called when one type or the other is to be returned; hence no unnecessary costs ensue [8]. If the entry does not match, the for loop is not broken, and the next entry is retrieved. Dot elision is similarly done by detecting whether the entry is "." or ".." [9]. When readdir() returns NULL the enumeration is complete, and the iterator enters a state whereby a comparison with that returned by end() would evaluate to true, terminating the client iteration loop (assuming it's written correctly!).

In my next article, I'll describe the mapping of the UNIX glob() API, which supports a more refined STL Iterator model [10], and presents a number of different challenges to providing a simple and efficient sequence class. If you're interested, you can download the STLSoft libraries here.

Acknowledgements

Thanks to Jeremy Siek for wielding the scythe without mercy, and helping me dramatically improve on the first draft.

Notes

  1. Boost is an open-source organisation whose focus is the development of libraries that integrate with the C++ Standard Library, and is located at. It has thousands of members, including many of the top names in the C++ software community.
  2. STLSoft is an open-source organisation whose focus is the development of robust, lightweight, cross-platform STL-compatible software. It has fewer members than Boost.
  3. This is similar to the omission of operator[] from std::list.
  4. W. Richard Stevens, Advanced Programming In The UNIX Environment, Addison-Wesley, 1993.
  5. I prefer this string since I can use it without linking in a load of stuff from the standard library in contexts where I'm producing very small executable programs/dynamic-libraries (e.g. http://shellext.com/). It's also got a 32-bit footprint (on 32-bit systems), which is nice when you've got lots of empty ones, and affords its user predictable and consistent behaviour (i.e. no guessing whether it's got mad COW disease). It doesn't have all the unnecessary kitchen sink methods found in std::basic_string, since you can use algorithms for them. It is compatible with the IOStreams, while having no knowledge of them whatsoever. But I'm not given to attempting to sell its use to others; we can settle on it being an internal STLSoft implementation component.
  6. http://www.sgi.com/tech/stl/InputIterator.html
  7. As with the rest of the STL, the readdir_sequence is not written to be thread-safe, so rds_shared_handle does not use atomic integer operations. If you want thread-safety, you must handle that in client code.
  8. http://www.comeaucomputing.com/faqs/genfaq.html#whatcando
  9. The original version also incorrectly removed any other entry beginning with "..", e.g. "..any_file", which understandably made a couple of users unhappy.
  10. http://www.sgi.com/tech/stl/Iterators.html

Talk back!

Have an opinion? Be the first to post a comment about this article.

About the author

-