The Artima Developer Community
Sponsored Link

The Road to Code
The Taxation of Representation
by Kevlin Henney
July 30, 2003
Summary
Be careful not to confuse the map with the territory. Presentation and representation are different concepts for different purposes. The everyday notation for many quantities, such as money or time of day, is not necessarily a good indicator of how they should best be represented in a program.

Advertisement

Be careful not to confuse the map with the territory. Presentation and representation are different concepts for different purposes. The everyday notation for many quantities, such as money or time of day, is not necessarily a good indicator of how they should best be represented in a program.

The presentation form of many quantities is field based and familiar to human readers... which of course makes not a jot of difference to machine or object. It is natural and understandable for programmers to latch onto the presentation form as the first choice for the representation. For example, representing monetary quantities as units and hundredths — dollar and cent, euro and cent, pound and penny, etc — or time of day as hour, minute and second fields. However, when the main aim is calculation rather than presentation, this field-based representation can make the arithmetic code unnecessarily complex and error prone, and generally takes more development effort and lines of code than choosing a different more obviously scalar representation: the total count of the smallest subdivision, e.g. total cents or total seconds.

The Simplest Thing?

"But surely it's the simplest thing that could possibly work?" comes the pre-digested-mantra defence of field-based representation. Well, not really. Sometimes it's not so much simple as simplistic. It can take more work to make it "work", and even then only "possibly". Looking at a program's data structures can tell you a lot about what to expect of its algorithms. In the case of presentation–representation literalism you can tell that much calculation code is going to be awkward and error prone.

You can readily compare the consequences of the two approaches to representation by considering how to write code to perform an addition or calculate a difference. To add two monetary quantities together using a multiple field representation requires you to perform an addition with an overflow check and a carry followed by another addition, similarly for difference but with an underflow. It feels like going back to school and learning how to do arithmetic all over again.

The most simple-minded representation for time of day would, in addition to the obvious hour, minute and second fields, include a flag to indicate AM or PM. This redundancy has the effect of making the carry-and-cascade control flow a little more convoluted. Indeed, even leaving aside questions of clarity and logic, perhaps the single most compelling reason for generally letting go of the 12-hour clock in favour of the more intuitive 24-hour model is that among the few nations still using the 12-hour model widely, the proportion of people who actually know how to use it correctly are in the minority: midday and midnight are frequently expressed the wrong way around.

So much for "simple" — plenty of "work", though. In contrast to these modulo minutiae, adding two piles of pennies together or subtracting one number of seconds from another is single-statemently trivial. The simplest representation that could possibly work is the one that makes the dependent code simplest to write. Unless the primary use for these types in a program is for presentation — so they are simply holders rather than calculators of values — then a single scalar field will generally make code easier. To recover the presentation view requires no more than the lightest sprinkling of modulo arithmetic.

Home on the Range

What about range issues? Two or three fields can cover a far greater state space than a single field. This is in principle true, but in practice largely irrelevant. Sixty seconds in a minute is not exactly a challenge for any integer type. For a valid time of day you are limited to one of 86400 seconds, all of which fit comfortably inside any self-respecting 32-bit integer with room to spare. And although you may have a larger state space available by using separate fields, you have more state to manage. The additional management of that state model is what takes up the bulk of the calculation code when compared to the more compact single-field representation.

For monetary quantities the exact scaling you use must take into account the application in question. Do you need to work to two decimal places or four? Are you tracking pennies or hundredths of pennies? And what kinds of values are you recording? Trade deficits or domestic budgets? A 64-bit integer allows you to represent over 1019 different values. To put this into some kind of perspective, it has recently been estimated that there are at least of the order of 1023 stars in the universe. There are at least ten times more stars in the universe than there are grains of sand in the deserts and beaches of Earth. So you may not be able to count all the stars or terrestrial grains of sand with it, but a 64-bit integer for money should be more than enough for most purposes, whether you are working in millionths of a penny or in Turkish Lira.

Placing an Order

Of course, the one thing I am not recommending is that you mash all numeric field-based representations into a single integer variable regardless of application. First of all, not all multiple field arrangements define ordered and linear values. Longitude and latitude are respectively scalar quantities, but their pairing is not. You can impose an ordering on them for sorting convenience, but the ordering is extrinsic rather than intrinsic.

Sometimes there are indeed range issues, but for every example that needs that extended range, there are many, many more that don't. Splitting fields for range is the exception rather than the norm.

What most affects the choice of data structure is that of usage and, therefore, algorithms. The terms "data structure" and "algorithm" conjure up a high-brow, mathematical, and academic image in the minds of many programmers. A focus on the inaccessible and the irrelevant, putting the "Oh?" or "Uh oh!" into O notation, with little bearing on the workaday world of commercial application programming. However, far from being a minority sport, data organization and the control flow of tasks against that data are the ingredients of any program. Put like that, data structures and algorithms are the concern of any and every programmer. This is not a call for everyone to go out and inwardly digest the stuff of deepest Knuth. It is a recognition of the basic principle that what lies beneath a well-considered interface should also be well considered. We program against interfaces to help separate concerns, not divorce them.

If you really are focused on presentation rather than calculation, then a field-based representation is a contender. If your need for representing the time of day is for display and setting, and the only arithmetic you perform is to increment by one second at a time, then holding the hour, minute and second explicitly is a reasonable option. In the absence of general arithmetic, the field-based representation is indeed the simplest and most direct.

Note that comparison is always, in principle easy, regardless of representation. Comparing total seconds against total seconds is trivial. It is also not hard to convert from hour, minute and second into total seconds (3600 × the hour + 60 × the minute + the second) to achieve the same comparison. There is no need to compare the hour, the minute and the second separately in a conditional cascade.

The same is also true with field-based dates, but not perhaps in the way that you might first think. Take the date 6th April 2002. In literal fields it can be represented in the common little-endian day-in-the-month, month and year form; the hierarchical big-endian year, month and day-in-the-month form; or the American middle-endian month, day-in-the-month and year form. The big-endian form is the one that has the most appeal to programmers: it is, quite literally, ordered. You could indeed successively compare the year, the month and the day against 2002, 4 and 6, respectively. Or you could compare 10000 × the year + 100 × the month + the day in the month against 20020406. If you are implementing an operation that compares two dates, returning an integer that is less than, equal to or greater than zero depending on whether the left-hand operand is less than, equal to or greater than the right-hand operand, you simply have to subtract one composite number from the other to get the correct result.

Making a Date

In general, dates present perhaps the most interesting representation challenge. Dates form an ordered sequence of values, but the fields into which they are broken depend on history, geography and, well, date. Beyond the mild quirkiness of Western calendars there are date systems that are richer and quirkier. However, regardless of system, there are none that are able to successfully align the day, the year, and any of the subdivisions between them into a system that could be described as regular. This makes date handling all the more interesting.

Leap seconds are not common (roughly one in fifty million), so there is little to dispute when boldly characterizing a minute as having sixty seconds, which in turn forms one sixtieth of an hour, which is in turn one twenty-fourth of a solar day. But how many days in a month? That depends on the month. And the year. You move from straight modulo arithmetic against constants into lookups for lookup tables. Representing a date as year, month, and day-in-the-month fields is good for presentation of, well, the year, the month and the day in the month, but is clumsy for general arithmetic operations. Quick, what will be the date 37 days from now? Given today's date, on what day of the week did 6th April 2002 fall?

With an epoch-based date system — the number of days from a fixed date — the answers are trivial. It is simply a matter of addition or modulo-seven arithmetic. For example, 1st January 1900 (day 1) was a Monday; therefore, 6th April 2002 (day 37351) was a Saturday (Monday + 37350 modulo 7). With an explicit year–month–day representation you have to use the intricacies of Zeller's Congruence to obtain that result.

On the other hand, if your primary goal is to present the date in human-readable form, you have more work to do with an epoch-based scalar than with the more literal representation. Converting back from 37351 to 6th April 2002 requires some careful calculation.

And there are other shades in between these representations. For example, holding the day in the year and the year offers a halfway house suitable for some programs. Pick an application, then pick the representation. Many date-handling libraries claim to be fully general purpose, which is either not quite true — they make some specific assumptions, e.g. handling the Gregorian calendar only — or not that useful — their generality is often at the expense of simplicity. Much generality is wasted on applications that are specific.

A good interface is arrived at by careful and empirical consideration of what is required rather than the power of pure abstract thinking. To obtain economy of expression and execution on both sides of that interface requires more than a casual nod at the concept. Design is context sensitive. Inside many complicated context-free solutions are simpler more specific solutions struggling to get out. The common day-to-day presentation of a concept, whether the value-based examples discussed here or more involved domain models, is not always the best guide to the representation.

Talk Back!

Have an opinion? Readers have already posted 2 comments about this weblog entry. Why not add yours?

RSS Feed

If you'd like to be notified whenever Kevlin Henney adds a new entry to his weblog, subscribe to his RSS feed.

About the Blogger

Kevlin is an independent consultant and trainer based in the UK. His development interests are in patterns, programming, practice and process. He has been a columnist for various magazines and web sites, including Better Software, The Register, Application Development Advisor, Java Report and the C/C++ Users Journal. Kevlin is co-author of A Pattern Language for Distributed Computing and On Patterns and Pattern Languages, two volumes in the Pattern-Oriented Software Architecture series. He is also editor of the 97 Things Every Programmer Should Know site and book.

This weblog entry is Copyright © 2003 Kevlin Henney. All rights reserved.

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use