|
|
|
Sponsored Link •
|
|
Advertisement
|
Don't ever let anyone tell you that STL is nice and abstract and all that, but it just doesn't perform well. I poke holes in that myth in several places in this book (Extended STL, Volume 1: Collections and Iterators, Sections 17.1.1, 19.1.1, 27.9, 36.2.2, and 38.2.1), but this chapter blows it clean out of the water.
The subject matter of this chapter is Scatter/Gather I/O (also known as Scatter I/O), which means the exchange of data between application code and (usually kernel) I/O routines in the form of multiple blocks per action. Its primary intent is to allow client code to manipulate application-level packet headers separately from the payloads, but it can also be turned to other cunning ends, and can yield considerable performance benefits.
The cost is that it complicates the manipulation of data when logically contiguous data is spread over physically discontiguous blocks. To aid in handling such cases, we may use classes that (re)linearize the data back to an acceptably manipulable abstraction.
Since this is a book about extending STL, the abstraction
we will seek is that of an STL collection (Section 2.2) and its
associated iterator ranges. But as we will see, there is a cost in such
abstractions, so we will go further and examine how we might optimize
the transfer of information without diluting the power of the
abstraction. This will lead us into looking into the rules regarding
overriding functions in the std namespace and how we may accommodate one
with the other.
Scatter/Gather I/O involves the exchange of information between an I/O API
and client code in a physically discontiguous form. In all cases I've come
across, this discontiguous form involves a number of separate memory blocks.
For example, the UNIX readv() and writev() functions
act like their read() and write() siblings, but,
rather than a pointer to a single area of memory and its size, they are passed
an array of iovec structures:
struct iovec
{
void* iov_base;
size_t iov_len;
};
ssize_t readv(int fd, const struct iovec* vector, int count);
ssize_t writev(int fd, const struct iovec* vector, int count);
The Windows Sockets API has an analogous structure and corresponding functions:
struct WSABUF
{
u_long len;
char* buf;
};
int WSARecv(SOCKET s
, WSABUF* lpBuffers
, DWORD dwBufferCount
, . . . // And 4 more parameters);
int WSASend(SOCKET s
, WSABUF* lpBuffers
, DWORD dwBufferCount
, . . . // And 4 more parameters);
int WSARecvFrom(SOCKET s
, WSABUF* lpBuffers
, DWORD dwBufferCount
, . . . // And 6 more parameters);
int WSASendTo( SOCKET s
, WSABUF* lpBuffers
, DWORD dwBufferCount
, . . . // And 6 more parameters);
You might wonder why people would want to perform I/O in such a fashion, given the obvious complication to client code. Well, if your file or network data has a fixed format, you can read one or more records/packets in or out without any need to move, reformat, or coalesce them. This can be quite a convenience. Similarly, if your records/packets have variable format but a fixed-size header, you can read/write the header directly to/from a matching structure and treat the rest as an opaque variable-size blob. And there's a third reason: performance. I once created a network server architecture using Scatter/Gather I/O that used a multithreaded nonlocking memory allocation scheme. (Suffice to say, it was rather nippy.)
But however much Scatter/Gather I/O may help in terms of performance, when dealing with variable-length records/packets, or those whose payloads contain elements that are variable-length, the client code is complicated, usually low on transparency, and bug-prone. An efficient abstraction is needed.
The challenge with Scatter/Gather I/O is that using memory scattered over multiple blocks is not a trivial matter. On projects (on Windows platforms) in the 1990s, I tended to use a custom COM stream implementation from my company's proprietary libraries, which was implemented for a different task some years previously. Permit me to talk about the COM stream architecture for a moment. (I know I promised in the last chapter there would be no more COM, but there is a point to this, even for UNIX diehards. Trust me, I'm a doctor!)
A COM stream is an abstraction over an underlying storage medium having
much in common with the file abstractions we're used to. Essentially, it
has access to the underlying medium and defines a current point within
its logical extent. A stream object exhibits the IStream interface
(shown in abbreviated form in Listing 31.1), which contains a number of
methods, including Seek(), SetSize(), Stat(),
and Clone().
There are also methods for acquiring exclusive access to regions of the
underlying medium. The IStream interface derives from
ISequentialStream (also shown in Listing 31.1), which defines the
two methods Read() and Write().You can implement a
stream for a particular underlying medium directly by deriving from IStream and
providing suitable definitions for its methods.
Listing 31.1 Definition of the ISequentialStream and IStream Interfaces
interface ISequentialStream
: public IUnknown
{
virtual HRESULT Read(void* p, ULONG n, ULONG* numRead) = 0;
virtual HRESULT Write(void const* p, ULONG n, ULONG* numWritten) = 0;
};
interface IStream
: public ISequentialStream
{
virtual HRESULT Seek(. . .) = 0;
virtual HRESULT SetSize(. . .) = 0;
virtual HRESULT CopyTo(. . .) = 0;
virtual HRESULT Commit(. . .) = 0;
virtual HRESULT Revert(. . .) = 0;
virtual HRESULT LockRegion(. . .) = 0;
virtual HRESULT UnlockRegion(. . .) = 0;
virtual HRESULT Stat(. . .) = 0;
virtual HRESULT Clone(. . .) = 0;
};
COM defines another stream-related abstraction, in the form of the
ILockBytes interface (shown in abbreviated form in Listing 31.2). It
abstracts arbitrary underlying mediums as a logically contiguous array
of bytes. It does not maintain any positional state. Hence, it has
ReadAt() and WriteAt() methods rather than Read() and Write().
Listing 31.2 Definition of the ILockBytes Interface
interface ILockBytes
: public IUnknown
{
virtual HRESULT ReadAt( ULARGE_INTEGER pos, void* p
, ULONG n, ULONG* numRead) = 0;
virtual HRESULT WriteAt(ULARGE_INTEGER pos, void const* p
, ULONG n, ULONG* numWritten) = 0;
virtual HRESULT Flush() = 0;
virtual HRESULT SetSize(. . .) = 0;
virtual HRESULT LockRegion(. . .) = 0;
virtual HRESULT UnlockRegion(. . .) = 0;
virtual HRESULT Stat(. . .) = 0;
};
It is a relatively simple matter to implement a COM stream in terms
of (an object that exhibits) the ILockBytes interface. All that's
required is an ILockBytes* and a position. My company has just such an
entity, accessible via the CreateStreamOnLockBytes() function:
HRESULT CreateStreamOnLockBytes(ILockBytes* plb, unsigned flags
, IStream** ppstm);
Obviously, the next question is, "How do we get hold of an ILockBytes
object?" Again, there's a function for that, CreateLockBytesOnMemory():
HRESULT CreateLockBytesOnMemory(void* pv
, size_t si
, unsigned flags
, void* arena
, ILockBytes** pplb);
This supports a whole host of memory scenarios, including using a fixed
buffer, using Windows "global" memory, using a COM allocator
(IAllocator), and so on. One of the many flags is SYCLBOMF_FIXED_ARRAY,
which indicates that pv points to an array of MemLockBytesBlock
structures:
struct MemLockBytesBlock
{
size_t cb;
void* pv;
};
I'm not going to bang on about this much more, as hindsight is a harsh
judge of things such as opaque pointers whose meanings are moderated by
flags. The point I want to get across about this stuff is that I was
able to take a set of memory blocks containing the scattered packet
contents and get back an IStream pointer from which the packet
information can be extracted in a logical and linear manner. Such code
takes the following simple and reasonably transparent form. (Error
handling is elided for brevity.) The ref_ptr instances are used to
ensure that the reference counts are managed irrespective of any early
returns and/or exceptions.
std::vector<WSABUF> blocks = . . .
size_t payloadSize = . . .
ILockBytes* plb;
IStream* pstm;
SynesisCom::CreateLockBytesOnMemory(&blocks[1], payloadSize
, SYCLBOMF_FIXED_ARRAY | . . ., NULL, &plb);
stlsoft::ref_ptr<ILockBytes> lb(pbl, false); // false "eats" the ref
SynesisCom::CreateStreamOnLockBytes(plb, 0, &pstm);
stlsoft::ref_ptr<IStream> stm(pstm, false); // false "eats" the ref
. . . // Pass off stm to higher-layer processing
The stream can then be wrapped by a byte-order-aware Instance Adaptor class that works in partnership with a message object Factory, to complete the mechanism for efficient translation from TCP packet stream segments to instances of higher-level protocol (C++) objects. The high efficiencies obtainable by such a scheme result from there being no allocations of, and no copying into, memory that does not constitute part of the final translated message object instances.
This is a powerful basis for a communications server model, one that I've used several times, albeit in different guises. In the case described earlier, a number of characteristics of the approach might incline you to search, as I have done, for better, less technology-specific solutions.
First, the major downside of the described mechanism is that, being COM, the server code is effectively Windows-specific. Second, many developers (incorrectly) consider COM, as they (equally incorrectly) do C++ and STL, to be intrinsically inefficient, and it can be hard to disabuse them of that notion even with hard facts. Finally, add in the type-unsafe opaque pointers and the fact that the stream and lock-bytes classes were hidden proprietary implementations, and it all leaves something to be desired.
platformstl::scatter_slice_sequence—A Teaser TrailerAn alternate representation is to be found in a new, and still
evolving, component in the PlatformSTL subproject:
scatter_slice_sequence. This Facade class template maintains an array of
slice structures describing a set of I/O buffers and provides methods
for invoking native read/write functions on the set of buffers, in
addition to providing STL collection access (in the form of begin() and
end() methods). The class works with both iovec and WSABUF by
abstracting their features with attribute shims (Section 9.2.1)
get_scatter_slice_size, get_scatter_slice_ptr, and
get_scatter_slice_size_member_ptr, shown in Listing 31.3.
Listing 31.3 Attribute Shims for the iovec and WSABUF Structures
#if defined(PLATFORMSTL_OS_IS_UNIX)
inline void* const get_scatter_slice_ptr(struct iovec const& ss)
{
return ss.iov_base;
}
inline void*& get_scatter_slice_ptr(struct iovec& ss);
inline size_t get_scatter_slice_size(struct iovec const& ss)
{
return static_cast<size_t>(ss.iov_len);
}
inline size_t& get_scatter_slice_size(struct iovec& ss);
inline size_t iovec::*
get_scatter_slice_size_member_ptr(struct iovec const*)
{
return &iovec::iov_len;
}
#elif defined(PLATFORMSTL_OS_IS_WIN32)
inline void const* get_scatter_slice_ptr(WSABUF const& ss)
{
return ss.buf;
}
inline void*& get_scatter_slice_ptr(WSABUF& ss);
inline size_t get_scatter_slice_size(WSABUF const& ss)
{
return static_cast<size_t>(ss.len);
}
inline size_t& get_scatter_slice_size(WSABUF& ss);
inline u_long WSABUF::* get_scatter_slice_size_member_ptr(WSABUF const*)
{
return &WSABUF::len;
}
#endif /* operating system */
scatter_slice_sequence currently provides for
readv()/ writev() on UNIX and
WSARecv()/WSASend() and
WSARecvFrom()/WSASendTo() on Windows. Listing 31.4 shows an
example that uses an iovec specialization of the class template to read the
contents from one file descriptor into a number of buffers, processes the
content in an STL kind of way, and then writes the converted contents to
another file descriptor.
Listing 31.4 Example Use of scatter_slice_sequence with
readv() and writev()
int fs = . . . // Opened for read
int fd = . . . // Opened for write
for(;;)
{
const size_t BUFF_SIZE = 100;
const size_t MAX_BUFFS = 10;
char buffers[MAX_BUFFS][BUFF_SIZE];
const size_t numBuffers = rand() % MAX_BUFFS;
// Declare an instance with arity of numBuffers
platformstl::scatter_slice_sequence<iovec> sss(numBuffers);
// Set up each slice in the sequence, which may be of
// different sizes in reality
{ for(size_t i = 0; i < numBuffers; ++i)
{
sss.set_slice(i, &buffers[i][0], sizeof(buffers[i]));
}}
if(0 != numBuffers) // In real scenario, might get 0 buffers
{
size_t n = sss.read(::readv, fs); // Read from fs using ::readv()
if(0 == n)
{
break;
}
// "Process" the contents
std::transform( sss.payload().begin(), sss.payload().begin() + n
, sss.payload().begin(), ::toupper);
sss.write(::writev, fd, n); // Write n to fd using ::writev()
}
}
Obviously this example is very stripped down, but I trust your abilities to
imagine that fs and fd might represent sockets, that
the buffers shown here would be obtained from a shared memory arena (which may
not have any to spare at a given time), and that the "processing" would be
something less trivial than setting the contents to uppercase before
(re)transmission.
The sequence's payload (available via payload()) provides random
access iterators over the contents of its memory blocks. Just as with
std::deque, it's important to realize that these iterators are not
contiguous (Section 2.3.6)! Pointer arithmetic on the iterators is a
constant-time operation, but iterating the range is not a linear-time
operation. The scatter_slice_sequence is still a work in progress, and
its interface might evolve further before it's released into the
PlatformSTL subproject proper. (It is on the CD.) But what it clearly
provides is the ability to represent a given set of data blocks as an
STL sequence (Section 2.2), along with adaptor methods read() and
write() that take a file/socket handle and a Scatter/Gather I/O function
and apply them to the blocks. This is the logical equivalent of the COM
stream object created via CreateLockBytesOnMemory() +
SYCLBOMF_FIXED_ARRAY and CreateStreamOnLockBytes(). The one apparent
disadvantage is that its contents have to be traversed one element at a
time, something that may have performance costs. (Hint: This is a clue
about something interesting to follow. . . .)
Adapting ACE_Message_QueueThe main subject of this chapter covers my efforts to adapt the
memory queues of the Adaptive Communications Environment (ACE) to the
STL collection concept, to serve the requirements of one of my recent
commercial networking projects, a middleware routing service. To use the
ACE Reactor framework, you derive event handler classes from
ACE_Event_Handler (overriding the requisite I/O event handler methods)
and register instances of them with the program's reactor singleton.
When the reactor encounters an I/O event of a type for which an instance
is registered, it invokes the appropriate callback method on the
handler. When used with TCP, the Internet's stream-oriented transport
protocol, the common idiom is to handle received data into instances of
ACE_Message_Block and queue them in an instance of (a specialization of)
the class template ACE_Message_Queue, as shown (with error handling
omitted for brevity) in Listing 31.5.
Listing 31.5 A Simple Event Handler for the ACE Reactor Framework
class SimpleTCPReceiver
: public ACE_Event_Handler
{
. . .
virtual int handle_input(ACE_HANDLE h)
{
const size_t BLOCK_SIZE = 1024;
ACE_Message_Block* mb = new ACE_Message_Block(BLOCK_SIZE);
ssize_t n = m_peer.recv(mb->base(), mb->size());
mb->wr_ptr(n);
m_mq.enqueue_tail(mb);
return 0;
}
. . .
private: // Member Variables
ACE_SOCK_Stream m_peer; // Connection socket
ACE_Message_Queue<ACE_SYNCH_USE> m_mq; // Message queue
};
The ACE_Message_Queue class acts as an ordered repository for all
blocks, thereby faithfully representing the data stream. But
ACE_Message_Queue is strictly a container of blocks; it does not attempt
to provide any kind of abstracted access to the contents of the blocks.
To access the contents of a message queue, you can use the associated
class template, ACE_Message_Queue_Iterator, to iterate the blocks, as
shown in Listing 31.6. The ACE_Message_Queue_Iterator::next() method
returns a nonzero result and sets the given pointer reference to the
block if a next block is available; otherwise, it returns 0. The
advance() method moves the current enumeration point to the next block
(if any).
Listing 31.6 Example Code That Uses ACE_Message_Queue_Iterator
void SimpleTCPReceiver::ProcessQueue()
{
ACE_Message_Queue_Iterator<ACE_NULL_SYNCH> mqi(m_mq);
ACE_Message_Block* mb;
for(; mqi.next(mb); mqi.advance())
{
{ for(size_t i = 0; i < mb->length(); ++i)
{
printf("%c", i[mb->rd_ptr()];
}}
mb->rd_ptr(mb->length()); // Advance read ptr to "exhaust" block
}
}
Obviously, if you want to process a set of blocks as a logically contiguous single block, it's going to be a bit messy. We need a sequence to flatten the stream for STL manipulation.
acestl::message_queue_sequence, Version 1The ACESTL subproject contains a number of components for adapting ACE
to STL (and for making ACE components easier to use).
acestl::message_queue_sequence is a class template that acts as an
Instance Adaptor for the ACE_Message_Queue. Since this component's got
quite a kick, I'm going to play my usual author's dirty trick of
presenting you with a progression of implementations. Thankfully, unlike
some material covered in other chapters, the changes between the
versions are entirely additive, which should help keep me under 40 pages
for this topic. Listing 31.7 shows the definition of the first version.
Listing 31.7 Definition of message_queue_sequence
// In namespace acestl
template <ACE_SYNCH_DECL>
class message_queue_sequence
{
public: // Member Types
typedef char value_type;
typedef ACE_Message_Queue<ACE_SYNCH_USE> sequence_type;
typedef message_queue_sequence<ACE_SYNCH_USE> class_type;
typedef size_t size_type;
class iterator;
public: // Construction
explicit message_queue_sequence(sequence_type> mq);
public: // Iteration
iterator begin();
iterator end();
public: // Attributes
size_type size() const;
bool empty() const;
private: // Member Variables
sequence_type& m_mq;
private: // Not to be implemented
message_queue_sequence(class_type const&);
class_type& operator =(class_type const&);
};
Given what we've seen with previous sequences, there's little here that
needs to be remarked on; the interesting stuff will be in the iterator class.
Note that the value type is char, meaning that size() returns the
number of bytes in the queue, and [begin(), end())
defines the range of bytes. No methods pertain to message blocks.
31.4.2 acestl::message_queue_sequence::iterator
Listing 31.8 shows the definition of the
acestl::message_queue_sequence::iterator class. Again, a lot here should
be familiar based on prior experience. (I hope by now you're building a
familiarity with these techniques, recognizing their similarities and
identifying the differences between the different cases of their
application. Naturally, it's my hope that this stands you in great stead
for writing your own STL extensions.)
The iterator category is input
iterator (Section 1.3.1). The element reference category (Section 3.3)
is transient or higher; in fact, it's fixed, with the caveat that no
other code, within or without the defining thread, changes the contents
of the underlying message queue or its blocks (in which case it would be
invalidatable). The iterator is implemented in terms of a shared_handle,
discussed shortly. I've not shown the canonical manipulation of the
shared_handle in the construction methods since we've seen it before in
other sequences (Sections 19.3 and 20.5).
Listing 31.8 Definition of message_queue_sequence::iterator
class message_queue_sequence<. . .>::iterator
: public std::iterator>std::input_iterator_tag
, char, ptrdiff_t
, char*, char&
>
{
private: // Member Types
friend class message_queue_sequence<ACE_SYNCH_USE>;
typedef ACE_Message_Queue_Iterator<ACE_SYNCH_USE> mq_iterator_type;
struct shared_handle;
public:
typedef iterator class_type;
typedef char value_type;
private: // Construction
iterator(sequence_type& mq)
: m_handle(new shared_handle(mq))
{}
public:
iterator()
: m_handle(NULL)
{}
iterator(class_type const& rhs); // Share handle via AddRef() (+)
~iterator() throw(); // Call Release() (-) if non-NULL
class_type& operator =(class_type const& rhs); // (+) new; (-) old
public: // Input Iteration
class_type& operator ++()
{
ACESTL_ASSERT(NULL != m_handle);
if(!m_handle->advance())
{
m_handle->Release();
m_handle = NULL;
}
return *this;
}
class_type operator ++(int); // Canonical implementation
value_type& operator *()
{
ACESTL_ASSERT(NULL != m_handle);
return m_handle->current();
}
value_type operator *() const
{
ACESTL_ASSERT(NULL != m_handle);
return m_handle->current();
}
bool equal(class_type const& rhs) const
{
return lhs.is_end_point() == rhs.is_end_point();
}
private: // Implementation
bool is_end_point() const
{
return NULL == m_handle || m_handle->is_end_point();
}
private: // Member Variables
shared_handle* m_handle;
};
The iteration methods are implemented in terms of the methods of
shared_handle. The endpoint state is identified by either a NULL handle
or a handle that identifies itself as being at the endpoint. The
preincrement operator advances by calling shared_handle::advance() and
releases the handle when advance() returns false. The dereference
operator overloads are implemented in terms of the current() overloads
of shared_handle. Note that the mutating (non-const) overload returns a
mutating reference, whereas the nonmutating (const) overload returns a
char by value.
The main action lies in the shared_handle. Listing 31.9 shows its
implementation. I'm going to invoke another low author tactic now and not
explain the fine detail of the algorithm. I'll leave it as an exercise for you
to figure out. To be fair, though, I will note that it skips empty
ACE_Message_Block instances, which is how its endpoint condition can be so
simple.
Listing 31.9 Definition of shared_handle
struct message_queue_sequence<. . .>::iterator::shared_handle
{
public: // Member Types
typedef shared_handle class_type;
public: // Member Variables
mq_iterator_type m_mqi;
ACE_Message_Block* m_entry;
size_t m_entryLength;
size_t m_entryIndex;
private:
sint32_t m_refCount;
public: // Construction
explicit shared_handle(sequence_type& mq)
: m_mqi(mq)
, m_entry(NULL)
, m_entryLength(0)
, m_entryIndex(0)
, m_refCount(1)
{
if(m_mqi.next(m_entry))
{
for(;;)
{
if(0 != (m_entryLength = m_entry->length()))
{
break;
}
else if(NULL == (m_entry = nextEntry()))
{
break;
}
}
}
}
private:
~shared_handle() throw()
{
ACESTL_MESSAGE_ASSERT("Shared handle destroyed with outstanding references!", 0 == m_refCount);
}
public:
sint32_t AddRef(); // Canonical implementation
sint32_t Release(); // Canonical implementation
public: // Iteration Methods
bool is_end_point() const
{
return m_entryIndex == m_entryLength;
}
char& current()
{
ACESTL_ASSERT(NULL != m_entry);
ACESTL_ASSERT(m_entryIndex != m_entryLength);
return m_entryIndex[m_entry->rd_ptr()];
}
char current() const
{
ACESTL_ASSERT(NULL != m_entry);
ACESTL_ASSERT(m_entryIndex != m_entryLength);
return m_entryIndex[m_entry->rd_ptr()];
}
bool advance()
{
ACESTL_MESSAGE_ASSERT("Invalid index", m_entryIndex < m_entryLength);
if(++m_entryIndex == m_entryLength)
{
m_entryIndex = 0;
for(;;)
{
if(NULL == (m_entry = nextEntry()))
{
return false;
}
else if(0 != (m_entryLength = m_entry->length()))
{
break;
}
}
}
return true;
}
private: // Implementation
ACE_Message_Block* nextEntry()
{
ACE_Message_Block* entry = NULL;
return m_mqi.advance() ? (m_mqi.next(entry), entry) : NULL;
}
private: // Not to be implemented
shared_handle(class_type const&);
class_type& operator =(class_type const&);
};
I hope you'll agree that being able to treat a set of
ACE_Message_Block communications stream fragments as a logically
contiguous stream eases considerably the task of dealing with such
streams. In our middleware project, this enabled us to unmarshal the
high-level protocol messages as objects, by the combination of another
Instance Adaptor and a message Factory.
Permit me a small digression about the protocol. Standards Australia defines a protocol for the exchange of electronic payment implementation, called AS2805. This is a very flexible protocol, with a significant drawback. The messages do not contain any message-size information in their fixed format header, and each message can contain a variable number of fields, some of which are of variable size. This means that you can't know whether all of a message has been received from the peer until the message has been fully parsed. Consequently, being able to easily and efficiently deconstruct a message is critical.
This was achieved by applying another Instance Adaptor to the
acestl::message_queue_sequence instance to make it be treated as a
streamable object, similar to how the logically contiguous ILockBytes
instance was turned into a streamable object with
CreateStreamFromLockBytes(). The streamable object is used by the
message Factory, which understands how to read the message type from the
packet header and then uses that type to dispatch the appropriate
unmarshaling function to read the remainder of the message contents and
create a corresponding message instance. If insufficient data is
available, the Factory invocation fails benignly, and the queue contents
remain unchanged until the next I/O event. Only when a full message is
retrieved is the requisite portion of the head of the queue removed and
released back to the memory cache. If the message parsing fails for bad
contents, the peer has sent bad data, and the connection is torn down.
So, we've got some nice abstractions going—some of which have seen
service in large-scale deployments, which is never to be sniffed at—but
you may have been receiving portents from your skeptical
subconsciousness. We're manipulating a number of blocks of contiguous
memory, but the nature of the acestl::message_queue_sequence::iterator
means that each byte in those blocks must be processed one at a time.
This has to have an efficiency cost. And it does.
Before we proceed, I want to trot out the received wisdom on performance,
namely, to avoid premature optimization. More precisely, although it's
something of a simplification, you've usually got only one bottleneck in a
system at a time. Inefficiencies usually make themselves felt only when a
greater inefficiency has been resolved. In none of the applications in which
I've used message_queue_sequence has it been associated with the bottleneck.
However, I tend to be a little efficiency obsessed—What's that? You've
noticed?—and since STLSoft is an open-source publicly available library, the
message_queue_sequence component might find itself being a bottleneck in
someone else's project, and that would never do. So, I want to show you how to
have your cake and eat it too, that is, how to linearize data block contents in
an STL sequence and yet have block-like efficiency.
acestl::message_queue_sequence, Version 2 First, we need to identify when a block transfer is valid. The ACE
libraries define opaque memory in terms of char, that is, the pointers are
either char const* or char*, presumably to make pointer arithmetic
straightforward. I don't hold with this strategy, but that's irrelevant; it is
what it is. When transferring contents between STL iterators of type char* or
char const* and acestl::message_queue_sequence::iterator, we want the
sequence's contents to be block transferred. In other words, the following code
should result in 2 calls to memcpy(), rather than 120 calls to
shared_handle::advance():
ACE_Message_Queue<ACE_NULL_SYNCH> mq; // 2 message blocks, 120 bytes char results[120]; acestl::message_queue_sequence<ACE_NULL_SYNCH> mqs(mq); std::copy(mqs.begin(), mqs.end(), &results[120]);
We want the same efficiency when transferring from contiguous memory into the message queue sequence, as in the following:
std::copy(&results[0], &results[0] + STLSOFT_NUM_ELEMENTS(results)
, mqs.begin());
The first thing we need for this is to define block copy operations for
message_queue_sequence. Listing 31.10 shows the definition of two new
static methods for the sequence class, overloads named fast_copy().
Listing 31.10 Definition of the message_queue_sequence Algorithm Worker Methods
template <ACE_SYNCH_DECL>
class message_queue_sequence
{
. . .
static char* fast_copy(iterator from, iterator to, char* o)
{
#if defined(ACESTL_MQS_NO_FAST_COPY_TO)
for(; from != to; ++from, ++o)
{
*o = *from;
}
#else /* ? ACESTL_MQS_NO_FAST_COPY_TO */
from.fast_copy(to, o);
#endif /* ACESTL_MQS_NO_FAST_COPY_TO */
return o;
}
static iterator fast_copy(char const* from, char const* to
, iterator o)
{
#if defined(ACESTL_MQS_NO_FAST_COPY_FROM)
for(;from != to; ++from, ++o)
{
*o = *from;
}
#else /* ? ACESTL_MQS_NO_FAST_COPY_FROM */
o.fast_copy(from, to);
#endif /* ACESTL_MQS_NO_FAST_COPY_FROM */
return o;
}
. . .
I've deliberately left in the #defines that suppress the block operations,
just to illustrate in code what the alternative, default, behavior is. These
#defines also facilitate tests with and without block copying enabled. (Anyone
sniff a performance test in the near future?) The block mode code uses new
iterator::fast_copy() methods, shown in Listing 31.11.
Listing 31.11 Definition of the iterator Algorithm Worker Methods
class message_queue_sequence<. . .>::iterator
{
. . .
void fast_copy(char const* from, char const* to)
{
if(from != to)
{
ACESTL_ASSERT(NULL != m_handle);
m_handle->fast_copy(from, to, static_cast<size_type>(to - from));
}
}
void fast_copy(class_type const& to, char* o)
{
if(*this != to)
{
ACESTL_ASSERT(NULL != m_handle);
m_handle->fast_copy(to.m_handle, o);
}
}
Tantalizingly, these do very little beyond invoking the same-named new
methods of the shared_handle class, shown in Listing 31.12. For both in and out
transfers, these methods calculate the appropriate portion of each block to be
read/written and effect the transfer with memcpy().
Listing 31.12 Definition of the shared_handle Algorithm Worker Methods
struct message_queue_sequence<. . .>::iterator::shared_handle
{
. . .
void fast_copy(char const* from, char const* to, size_type n)
{
ACESTL_ASSERT(0 != n);
ACESTL_ASSERT(from != to);
if(0 != n)
{
size_type n1 = m_entryLength - m_entryIndex;
if(n <= n1)
{
::memcpy(&m_entryIndex[m_entry->rd_ptr()], from, n);
}
else
{
::memcpy(&m_entryIndex[m_entry->rd_ptr()], from, n1);
from += n1;
m_entry = nextEntry();
ACESTL_ASSERT(NULL != m_entry);
fast_copy(from, to, n - n1);
}
}
}
void fast_copy(class_type const* to, char* o)
{
size_type n1 = m_entryLength - m_entryIndex;
if( NULL != to &&
m_entry == to ->m_entry)
{
::memcpy(o, &m_entryIndex[m_entry->rd_ptr()], n1);
}
else
{
::memcpy(o, &m_entryIndex[m_entry->rd_ptr()], n1);
o += n1;
m_entry = nextEntry();
if(NULL != m_entry)
{
fast_copy(to, o);
}
}
}
. . .
So far so good, but no one wants to write client code such as the following:
ACE_Message_Queue<ACE_NULL_SYNCH> mq; // 2 locks; total 120 bytes
char results[120];
acestl::message_queue_sequence<ACE_NULL_SYNCH> mqs(mq);
acestl::message_queue_sequence<ACE_NULL_SYNCH>::fast_copy(mqs.begin()
, mqs.end(), &results[120]);
We want the invocation std::copy() to pick up our fast version
automatically when the other iterator type is char (const)*. For this we need
to specialize std::copy().
For a number of reasons, defining partial template specializations in the
std namespace is prohibited. This proves inconvenient in two ways. First, and
most importantly, because message_queue_sequence is a template, we want to
cater to all its specializations and so would want to do something like that
shown in Listing 31.13. (For brevity I'm omitting the namespace qualification
acestl from each message_queue_sequence<S>::iterator shown in this
listing and the next.)
Listing 31.13 Illegal Specializations of std::copy()
// In namespace std
template <typename S>
char* copy( typename message_queue_sequence<S>::iterator from
, typename message_queue_sequence<S>::iterator to, char* o)
{
return message_queue_sequence<S>::fast_copy(from, to, o);
}
template <typename S>
typename message_queue_sequence<S>::iterator copy(char* from, char* to
, typename message_queue_sequence<S>::iterator o)
{
return message_queue_sequence<S>::fast_copy(from, to, o);
}
Since we may not do this, we are forced to anticipate the
specializations of message_queue_sequence and (fully) specialize
std::copy() accordingly, as in Listing 31.14. Note that separate char*
and char const* specializations are required for the
char-pointer-to-iterator block transfer, to ensure that copying from
char* and char const* uses the optimization.
Listing 31.14 Legal Specializations of std::copy()
// In namespace std
template <>
char*
copy( typename message_queue_sequence<ACE_NULL_SYNCH>::iterator from
, typename message_queue_sequence<ACE_NULL_SYNCH>::iterator to
, char* o)
{
return message_queue_sequence<ACE_NULL_SYNCH>::fast_copy(from, to, o);
}
. . . // Same as above, but for ACE_MT_SYNCH
template <>
typename message_queue_sequence<ACE_NULL_SYNCH>::iterator
copy(char* from
, char* to
, typename message_queue_sequence<ACE_NULL_SYNCH>::iterator o)
{
return message_queue_sequence<ACE_NULL_SYNCH>::fast_copy(from, to, o);
}
. . . // Same as above, but for ACE_MT_SYNCH
template <>
typename message_queue_sequence<ACE_NULL_SYNCH>::iterator
copy(char const* from
, char const* to
, typename message_queue_sequence<ACE_NULL_SYNCH>::iterator o)
{
return message_queue_sequence<ACE_NULL_SYNCH>::fast_copy(from, to, o);
}
. . . // Same as above, but for ACE_MT_SYNCH
Fortunately, ACE offers only two specializations, in the form of
ACE_NULL_SYNCH (a #define for
ACE_Null_Mutex, ACE_Null_Condition) and
ACE_MT_SYNCH (a #define for
ACE_Thread_Mutex, ACE_Condition_Thread_Mutex),
yielding only six specializations.
But there's more. If, like me, you avoid like the plague the use of
char as a substitute for C++'s missing byte type, you probably
instead use signed char or unsigned
char, both of which are distinct types from char when it comes to
overload resolution (and template resolution). Passing these to an invocation
of std::copy() will not succeed in invoking the optimized transfer
methods. So, with heads bowed low, we need to provide another six
specializations for signed char and six for unsigned char, yielding a total of
eighteen specializations, for what we'd like to have been two, or at most
three, were we able to partially specialize in the std namespace.
Thankfully, all this effort is worth the payoff. Before we look at
that, I just want to answer one question you might be pondering: Why
only std::copy()? In principle there is no reason to not specialize all
possible standard algorithms. The reason I've not done so is twofold.
First, the sheer effort in doing so would be onerous, to say the least;
to avoid a lot of manual plugging around we'd be pragmatically bound to
use macros, and who likes macros? The second reason is more, well,
reasoned. The whole reason for this optimization is to facilitate
high-speed interpretation of data in its original memory block and
high-speed exchange of data into new storage. In my experience, both of
these involve std::copy(). I should admit one exception to this in our
middleware project that required copy_n(). The copy_n() algorithm was
overlooked for incorporation into the C++-98 standard (but will be
included in the next version) and so appears in STLSoft. There are
specializations of it, this time in the stlsoft namespace, in the same
fashion as for std::copy(). Hence, there are a total of 36 function
specializations in the <acestl/collections/message_queue_sequence.hpp>
header file.
Now that we've examined the optimization mechanism, we'd better make
sure it's worth the not inconsiderable effort. In order to demonstrate
the differences in performance between the optimized block copying
version and the original version, I used a test program that creates an
ACE_Message_Queue instance to which it adds a number of blocks, copies
the contents from a char array using std::copy(),copies them back to
another char array (again with std::copy()), and verifies that the
contents of the two arrays are identical. The number of blocks ranged
from 1 to 10. The block size ranged from 10 to 10,000. The times for
copying from the source char array to the sequence, and from the
sequence to the destination char array, were taken separately, using the
PlatformSTL component performance_counter. Each copying operation was
repeated 20,000 times, in order to obtain measurement resolution in
milliseconds. The code is shown in the extra material for this chapter
on the CD.
Table 31.1 shows a representative sample of the results, expressed as percentages of the time (in milliseconds) taken by the equivalent nonoptimized version. As we might expect, with the very small block size of 10, the difference is negligible. For a buffer size of 100, there's an advantage with the optimized form, but it's not stunning. However, when we get to the more realistic buffer sizes of 1,000 and 10,000, there's no competition-the optimized form is 40-50 times faster.
| Array to Iterator | Iterator to Array | ||||||
|---|---|---|---|---|---|---|---|
| Number of Blocks | Block Size | Nonoptimized | Block | % | Nonoptimized | Block | % |
| 1 | 10 | 7 | 6 | 85.7% | 7 | 9 | 128.6% |
| 2 | 10 | 9 | 8 | 88.9% | 9 | 7 | 77.8% |
| 5 | 10 | 16 | 10 | 62.5% | 16 | 10 | 62.5% |
| 10 | 10 | 27 | 14 | 51.9% | 26 | 14 | 53.8% |
| 1 | 100 | 25 | 7 | 28.0% | 23 | 7 | 30.4% |
| 2 | 100 | 46 | 9 | 19.6% | 42 | 9 | 21.4% |
| 5 | 100 | 108 | 14 | 13.0% | 99 | 14 | 14.1% |
| 10 | 100 | 211 | 23 | 10.9% | 188 | 23 | 12.2% |
| 1 | 1,000 | 207 | 10 | 4.8% | 184 | 11 | 6.0% |
| 2 | 1,000 | 416 | 16 | 3.8% | 391 | 17 | 4.3% |
| 5 | 1,000 | 1,025 | 32 | 3.1% | 898 | 32 | 3.6% |
| 10 | 1,000 | 2,042 | 60 | 2.9% | 1,793 | 61 | 3.4% |
| 1 | 10,000 | 2,038 | 29 | 1.4% | 1,786 | 29 | 1.6% |
| 2 | 10,000 | 4,100 | 101 | 2.5% | 3,570 | 101 | 2.8% |
| 5 | 10,000 | 4,143 | 109 | 2.6% | 3,606 | 101 | 2.8% |
| 10 | 10,000 | 4,103 | 102 | 2.5% | 3,573 | 102 | 2.9% |
This chapter has looked at the features of Scatter/Gather I/O, whose APIs
present considerable challenges to STL adaptation. We've examined an
adaptation, in the form of the scatter_slice_sequence component, and have seen
that such sequences must have genuinely random access iterators (i.e., not
contiguous iterators), for which the identity &*it + 2 == &*(it + 2) does not
hold (see Section 2.3.6). Notwithstanding, we've seen how we can take advantage
of their partial contiguity in order to effect significant performance
improvements, something that is particularly important given their use in file
and/or socket I/O. With minimal sacrifice of the Principle of Transparency,
we've made big gains in the Principle of Composition (and also served the
Principle of Diversity along the way).
Have an opinion about scatter/gather I/O?
Discuss this article in the Articles Forum topic, Gathering Scattered I/O in C++.
STLSoft
http://stlsoft.org
Pantheios
http://pantheios.org
Extended STL
http://extendedstl.com
Imperfect C++
http://imperfectcplusplus.com
ACE
http://www.cse.wustl.edu/~schmidt/ACE.html
|
Sponsored Links
|