|
|
|
The C++ Source |
C++ Community News |
Discuss |
Print |
Email |
Screen Friendly Version |
Previous |
Next
|
|
Sponsored Link •
|
STL meets glob(): Power, robustness, and genericity without
sacrificing efficiency.
This article is the second of a two part
series looking at techniques for adapting UNIX file-system enumeration
APIs to STL-like sequences. In the first article [1] I described an adaptation of
the UNIX opendir()/readdir() API to
readdir_sequence, an STL-like sequence class supporting Input
Iterators. readdir_sequence reflects the semantics of
opendir()/readdir() in that it supports enumeration of the
entries in a single directory. It provides the additional features of
being able to select files and/or directories, and the ability to elide
the dots directories—"." (current directory) and
".." (parent directory)—from the resultant range if
required.
UNIX has another, more powerful, method of searching the file-system:
the glob() API. Rather than taking the name of a directory
and enumerating its contents, glob() takes a wildcard
pattern, such as "/usr/*/std*h/", and returns all matching
file-system entries.
glob()-ing ManuallyWe saw in [1] that there
are several advantages to using an STL-like sequence class instead of a
raw API, and that principle applies equally to using glob().
Consider the task of finding all hidden ("dot") files—such as
.bashrc—in a given code directory using the
glob() API directly, as shown in the following code.
// enumwithglob.cpp: Enumerating sub-directories using glob()
#include <glob.h> // glob(), globfree()
#include <sys/stat.h> // stat()
#include <algorithm> // std::copy
#include <iterator> // std::ostream_iterator
#include <iostream> // std::cout, std::endl
#include <string> // std::string
#include <vector> // std::vector
using std::copy;
using std::cout;
using std::endl;
using std::ostream_iterator;
using std::string;
using std::vector;
const char HOME[] = "/home/matty/";
const char PATTERN[] = ".*";
int main()
{
vector<string> dotNames;
glob_t gl;
if(0 == glob((string(HOME) + PATTERN).c_str(), 0, NULL, &gl))
{
for(char **begin = &gl.gl_pathv[0], **end = &gl.gl_pathv[gl.gl_pathc]; begin != end; ++begin)
{
struct stat st;
// Skip dots
if( '.' == (*begin)[0] &&
( '\0' == (*begin)[1] ||
( '.' == (*begin)[1] &&
'\0' == (*begin)[2])))
{
// do nothing
}
else
{
if(0 == stat(*begin, &st))
{
if(S_IFREG == (st.st_mode & S_IFREG))
{
dotNames.push_back((*begin));
}
}
}
}
globfree(&gl);
}
cout << "Dumping . files in " << HOME << endl;
copy(dotNames.begin(), dotNames.end(), ostream_iterator<string>(cout, "\n"));
return 0;
}
As well as being quite a large amount of code, there are several
specific problems with it. First, we must concatenate the directory and
search pattern, before passing the combined string value as the first
argument in the call to glob(). Admittedly, in many cases
the two will not be separate, so that's a small thing, but there are other
issues are not so trivial.
A successful call to glob() returns a block of memory in
the gl_pathv member of the glob_t structure
passed as the fourth argument. (glob()'s second and third
arguments are used for flags, which I'll discuss later, and a callback
error-handling function, which I don't discuss in this article.) The
returned memory block contains the paths for all matching entries to the
specified pattern at the time of the invocation of the call. In common
with many UNIX library functions, the addition of the const
keyword to C was too late in the game, so the type of
gl_pathv is char**, rather than char
const**. Hence, the first significant concern in the given code is
that begin and end are declared to be of a
non-const pointer type. Although I've obviously avoided it in this case,
it's always possible to introduce bugs by writing to non-const pointers.
The second issue is that the memory block must be freed, and this is
done by calling the beguilingly named globfree(). This issue
of calling paired allocate/release functions represents a classic problem
area in C++, since any statements occurring between them may be a source
of exception unsafety. Indeed, the code in enumwithglob.cpp is not exception
safe since the call to dotNames.push_back() may result in an
exception being thrown, in which case globfree() would not be
called, and the memory block would leak. To correct this would require
inserting try-catch scoping into our sample, making it even more verbose.
The remaining issues with this code are more prosaic, but still detract
from readability and maintainability. The elision of the dots directories
is a manual process, which is a pain for all but the rare cases when you
do actually want them. Finally, each entry must be stat()-ed
and the return code of stat() tested along with the resultant flags. (For
the non-UNIX folks, stat() is the logical equivalent of
Win32's GetFileAttributeEx().)
(I must admit that in this particular case, the specific dots directory
elision test is not needed because the stat() test ensures
that directories are not added to dotNames vector. But I'm
sure you can see how this test could be necessary in the more general
case.)
glob()-ing using unixstl::glob_sequenceThat's a lot of work for something conceptually simple. Now contrast
that with the sublime simplicity of the functionally equivalent form using
the UNIXSTL[2]
glob_sequence class [3], defined in the following
program.
// enumwithglobsequence.cpp: Enumerating sub-directories using unixstl::glob_sequence
#include <unixstl_glob_sequence.h> // unixstl::glob_sequence
#include <algorithm> // std::copy
#include <iterator> // std::ostream_iterator
#include <iostream> // std::cout, std::endl
#include <string> // std::string
#include <vector> // std::vector
using std::copy;
using std::cout;
using std::endl;
using std::ostream_iterator;
using std::string;
using std::vector;
using unixstl::glob_sequence;
const char HOME[] = "/home/matty/";
const char PATTERN[] = ".*";
int main()
{
glob_sequence dir(HOME, PATTERN, glob_sequence::files);
vector<string> dotNames(dir.begin(), dir.end());
cout << "Dumping . files in " << HOME << endl;
copy(dotNames.begin(), dotNames.end(), ostream_iterator<string>(cout, "\n"));
return 0;
}
I think you'll agree that there's a considerable saving in
lines-of-code, about 25:2. Because the value_type of
glob_sequence (see globsequence.h) is char
const*, for which std::string provides a conversion
constructor [4], we can pass
the iterators from its begin() and end() methods
to the constructor of the std::vector<std::string>
instance dotNames.
The advantages afforded by using glob_sequence are more than just
a reduction in client code quantity; the other issues of concern are also
addressed. Automatic resource deallocation and exception-safety is
provided by glob_sequence's destructor calling
globfree(); classic Resource Acquisition Is
Initialisation [4].
Because we are using RAII, a failure code from glob() results
in an instance of glob_sequence_exception (see code below)
being thrown from the constructor of glob_sequence, rather
than requiring users to test the "validity" of the sequence instance after
construction. Here is the class definition for
glob_sequence_exception:
// globsequenceexception.h: Declaration of the unixstl::glob_sequence_exception class
// Includes (as shown in globsequence.h)
class glob_sequence_exception
: public std::exception
{
/// Types
public:
typedef std::exception parent_class_type;
typedef glob_sequence_exception class_type;
/// Construction
public:
ss_explicit_k glob_sequence_exception(int globStatus, int errno_)
: m_globStatus(globStatus)
, m_errno(errno_)
{}
/// Accessors
public:
char const *what() const /* throw() */
{
return "glob_sequence failure";
}
int get_globstatus() const
{
return m_globStatus;
}
int get_errno() const
{
return m_errno;
}
// Members
private:
int const m_globStatus;
int const m_errno;
// Not to be implemented
private:
class_type &operator =(class_type const &);
};
Just as with readdir_sequence, all the filtering of files
and/or directories, and elision of dots directories is selected by
specifying the appropriate flags to the glob_sequence
constructor. You may select files, or directories (with or without dots
directories), or both, just by specifying the requisite flags.
Furthermore, there are a number of other flags that offer even more power
than is provided by readdir_sequence. The flag
noSort causes the flag GLOB_NOSORT to be passed
to the underlying call to glob(), which prevents it from
sorting the results. One can presume that this will result in faster
searching for some/all implementations, so this flag is included in the
flags parameter's default value. The flag markDirs causes
the GLOB_MARK flag to be passed to glob(), which
marks all directory entries with a terminating path name separator
('/'). This saves you the effort of having to check for the
path name separator and adding one prior to concatenating on file names or
other search patterns.
Finally, the absolutePath flag causes the search
directory, and therefore all the results, to be evaluated as an absolute
path. In other words, if the working directory is
"/usr/include", and the search directory is given as
"..", then the entries will be rooted from
"/usr/", rather than from "../". This
functionality is provided by the class (see globsequencemethods.h),
rather than a part of glob()'s otherwise impressively rich
option set.
Despite the fact that glob_sequence provides a more
refined iterator concept (Random Access [7]) than does
readdir_sequence (Input Iterator [8]), its implementation (see globsequencemethods.h)
is conceptually simpler. This is because glob() returns an
array of pointers to entries, hence, the const_iterator type
is char const** [9]. Because it provides Random
Access iterators, glob_sequence is able to also provide a
subscript operator, within which the requested index is subject to a
validity assertion, and also the rbegin() and
rend() functions that provide reverse iterators to the
managed range. All we would need to do in a dumb mapping is provide
pointers to the first (&gl_pathv[0]) and one-past-the-end
(&gl_pathv[gl_pathc]) locations in the array for our
begin() and end() iterators.
Because we want to be able to select files and/or directories, and trim
dots directories, we need to put in a little more effort. As demonstrated
in [1] and [3], when dealing with sequential
APIs such as opendir()/readdir() on UNIX, or
FindFirstFile() on Win32, it's relatively straightforward to
remove an element from the enumerated range: just call
(readdir() or FindNextFile()) again. Such logic
can be added to the iterator class. But glob() returns the
entire unfiltered set en bloc, and to continue to support the
Random Iterator concept we need to ensure that the pointers to all the
selected entries are stored contiguously. This means we have to
manipulate the pointers in situ.
All the action takes place in the glob_init_() member
function, called by the two constructors. Each constructor initialises
the m_flags members via the validate_flag_()
static method, which adds both files and directories to the flags in the
case where neither is specified in the constructor arguments.
glob_init_() performs three main functions. First, it
concatenates the directory, if specified, and the search pattern, and
translates the directory to an absolute path if absolutePath is specified.
Second, it calls glob(), presenting the composite
directory+pattern and the appropriate translation of the
glob_sequence constructor flags (markDirs,
noSort, files/directories) to those of the
glob() API. If the call to glob() fails, a
glob_sequence_exception is thrown.
The final role of glob_init_() is to process the results.
This is done in four parts. First, if there are to be any changes, the
entry pointers are copied into the member buffer m_buffer,
which is an instance of stlsoft::auto_buffer [10]. This is to allow us to
manipulate the array of pointers without altering the contents of the
buffer returned by glob(), which would be very poor form [11].
Second, if the dots directories are to be elided, the function searches
through the array of pointers until it's found them both. For each one
found the pointer is swapped with the one at the base of the array, and
then the base advanced one and the item count dropped. In the first
implementation I simply incremented m_base past
"." and ".." if they were present, since they
were always the first two (or one) entries in the array. Unfortunately,
this was only an artefact of my test environment [12], and when I moved to a
different platform I quickly saw the error of my ways. This is an
important point: when doing cross-platform development, you must test on
several platforms.
Since glob() will append a path name separator to
directories if asked, we cannot simply test the entry names for
equivalence to "." or ".." using the
filesystem_traits::is_dots() function we saw in the
implementation of readdir_sequence. Instead, the member
functions is_dots_maybe_slashed_() and
is_end_of_path_elements_() provide the requisite
functionality, and enable the processing to be terminated once both dots
directories are found in a given set of results.
The third part of the filtering is to elide directories. Since
glob() supports the GLOB_ONLYDIR to request only
directories, we can trust it to not return files when we've not asked for
them with the files flags. There's no flag to ask glob() to
return only files, however, so we need to do that ourselves. As you saw
in globsequencemethods.h, this is
achieved in the same manner as for the dots elision. This is done in one
of two ways. If we asked for directories to be marked with a trailing
path name separator (with the markDirs flag) then we simply
test whether the entry name is so marked. If it is, then we do the
swap/advance. If not, we test it with stat(), and
swap/advance if it's not a file.
All this swapping and advancing is quite efficient, but it leaves the
entries out of their original order. Hence, the last part of the
filtering task for glob_init_() is to call
std::sort() on the remaining entries if the
noSort flag was not specified, and if the resultant number of
entries is different to that returned by glob().
The implementation is available online [2] as part of the STLSoft
libraries, which includes sample/test programs. This class is also used
within the UNIX implementation of the recls library [13].
[See the sidebar, Taking Exception to
Error Handling for more information on glob_sequence and
exception safety.]
You may be wondering, given the power of glob() /
glob_sequence, why should one ever want to use readdir() /
readdir_sequence? There are several reasons:
glob() is implemented in terms of readdir().
Consequently, it's inevitable that if your requirement is more suitable to
readdir(), then using readdir() /
readdir_sequence directly will be more efficient.
glob() returns all the matching entries in an
array. Naturally, the search has to be conducted to completion before any
results may be returned. If you are looking for a subset of entries in a
given search, it may well be more efficient to be able to search and
process them in a step-wise fashion with readdir(),
especially if the entries you are after may occur earlier in the search.
glob() (or
glob_sequence) accepts a pattern that may contain wildcards,
and those wildcards may also appear in the directory parts of the given
pattern, it increases the complexity of your program. If you want to
search a given directory, then you constrain this search more clearly by
using readdir(), which will balk at being given any
wildcards.
Notwithstanding these specific aspects, glob() is a very
powerful and useful tool on UNIX. Using it via glob_sequence
affords all the advantages over the raw API demonstrated here, and costs
you nothing.
FindFile enumeration API to STL iterators, in the form of the
WinSTL (http://winstl.org)
(basic_)findfile_sequence<> class. These three file
enumeration classes represent an interesting spectrum of models:
readdir_sequence does not take wildcards and returns all
files and/or directories in a single directory;
findfile_sequence does the same, but can specify wildcards
for filtering the entries; glob_sequence may use wildcards in
the entry names and in the directory names. If you keep wildcards out of
the directory when using glob_sequence, then it has virtually
identical semantics as findfile_sequence, and the two may be
used to provide the file-searching functionality in cross-platform
developments: client code can remain largely unchanged, save for the
selection of the requisite component from its namespace [5, 6]
stlsoft::pointer_iterator<value_type const , const_pointer ,
const_reference >::iterator_type. This is to ensure that it is
compatible with standard library implementations whose random-access
iterators are actually implemented as class objects. This could be the
subject of an entire article in itself—and probably will be in the
future—so you?ll have to trust me for now. :-)
auto_buffer<> is an optimised dynamically (re-)sizable
memory buffer, suitable for manipulating arrays of POD types. In most
cases, its use in glob_sequence won't result in any memory
allocation anyway, since the internal array size is set to 64 (see globsequence.h in this article).
globfree() is not going
to check that the order of the pointers in the gl_pathv
buffer. It seemed too fantastic to me to imagine that any
implementation would. Note that I was not so cavalier with the actual
contents of the pointer array. Specifically they were swapped in situ,
rather than, say, setting to NULL the ones selected out, in case the
glob() implementation allocated the storage for each entry
individually, rather than in a single block. Nonetheless, a few years
older and wiser, and this assumption seemed much less smart. Making
assumptions about the insides of APIs you've not implemented is a very bad
habit to get into, and this instance certainly didn't warrant taking the
risks.
glob() for Win32 using the
FindFirstFile() API, available online at
http://synesis.com.au/software/index.html#unixem. The initial version
returned "." and ".." as the first two items—because
FindFirstFile() orders the matched items—hence my
consternation when I moved the class and test program to one of my Linux
boxes.
recls ("recursive ls") library ( http://recls.org/) is a platform-independent
search library, which provides recursive enumeration of file-system and,
as of version 1.5, FTP systems. It has a C-API, and provides mappings to
several languages/technologies, including C++, C#/.NET, COM, D, Java, and
STL. It has been the exemplar for the first few installments of my
Positive Integration column, for C/C++ Users Journal (http://www.cuj.com/).
FindFileChangeNotification()
function, which means that application code is inherently biased towards
looking at file-system contents as static. Given the fact that UNIX
focuses, or originally focused, on short-lived discrete tasks, this is
understandable. Nonetheless, it is certainly possible for file-system
contents to change, and when writing file-system code, one needs to be
prepared to handle it.
|
Sponsored Links
|