The Artima Developer Community
Sponsored Link

Heron-Centric: Ruminations of a Language Designer
Trees and Labelled S-Expressions
by Christopher Diggins
November 4, 2005
Summary
An enormous amount of data can be conveniently represented as a labelled tree. This is the basis of many markup languages such as SGML, XML, HTML, and others. Herein I propose an alternative, Labelled S-Expressions and provide code for a working parser in C++.

Advertisement

S-Expressions are a well known way of representing hierarchical data-structures in text form, but they lack the extra information and structure which is gained by associating a label with a data element.

Consider the following S-Expression:

  (cd
    (title Maggie May)
    (artist Rod Stewart)
    (country UK)
    (company Pickwick)
    (price 8.50)
    (year 1990)
  )
One drawback of the format is that the label is only identified as being the first element in the S-Expression. The other problem is that white-space is significant as being a delimiter between labels and the rest of the data. For textual data, it is important that white-space not be part of the grammar, otherwise subtle errors can easily arise. There is no standardized S-Expression format, and each has different ways of grouping data. XML on the other hand is more structured. The following is what the example data structure would look like as XML.
  <cd>
    <title>Maggie May</title>
    <artist>Rod Stewart</artist>
    <country>UK</country>
    <company>Pickwick</company>
    <price>8.50</price>
    <year>1990</year>
  </cd>
XML however is very complex and slow and painful to parse. It is also verbose. Labelled S-Expressions provide the best of both formats, while being extremely easy to parse.
  (cd:
    (title:Maggie May)
    (artist:Rod Stewart)
    (country:UK)
    (company:Pickwick)
    (price:8.50)
    (year:1990)
  )

Grammar

The precise syntax of the Labelled S-Expression format was inspired by Derek Parnell's unnamed markup language. Here is the Grammar for Labelled S-Expressions in Extended Backus-Naur Form (EBNF):
  document ::=
    tagged_content

  tagged_content ::=
    "(" text ":" content ")"

  content ::=
    text? (tagged_content | text)*

  text ::=
    (^("(" | ":" | ")" | "'") | escaped_char)+

  escaped_char ::=
    "'" .
And that is it, only five production rules! Encoding is UTF-8.

Design Goals

Labelled S-Expressions was designed so that it could be parsed as quickly as possible, and be as simple as possible, while still maintaining usefulness as a portable text-based data format.

Parser

The following is a snippet of C++ code for parsing labelled S-Expressions. It breaks a buffer up into nodes, where a node can either point to a label: "(xxx:" or a textual node. The input must already be validated to be a well-formed document.
// Public Domain code by Christopher Diggins

struct lsx_node {
  lsx_node* sibling;
  lsx_node* child;
  const char* text;
};

lsx_node* lsx_parse(const char** pp) {
  if (**pp == ')') {
    ++*pp;
    return NULL;
  }
  lsx_node* node = (lsx_node*)calloc(sizeof(lsx_node), 1);
  node->text = *pp;
  if (**pp == '(') {
    ++*pp;
    while (*(*pp)++ != ':') {
      if (**pp == '\'') ++*pp;
    }
    node->child = lsx_parse(pp);
    lsx_node* child = node->child;
    while (child != NULL) {
      child->sibling = lsx_parse(pp);
      child = child->sibling;
    }
  }
  else {
    while ((**pp != '(') && (**pp != ')')) {
      if (**pp == '\'') ++*pp;
      ++*pp;
    }
  }
  return node;
}

Checking Well-Formedness

Here is a function for checking well-formedness and returns the number of nodes in the data:
// Public Domain code by Christopher Diggins
// returns the number of lsx_nodes
// and returns 0 if document is ill-formed

int lsx_node_count(const char* p) {
  int ret = 0;
  int match = 0;
  if (*p == NULL) return 0;
  while (*p != NULL) {
    switch (*p) {
      case '(' :
        ++ret;
        ++match;
        while (*p != ':') {
          if (*p == NULL) return 0;
          if (*p == '\'') ++p;
          if (*p == NULL) return 0;
          ++p;
        }
        ++p;
        break;
      case ')' :
        --match;
        ++p;
        break;
      case ':' :
        return 0;
      default:
        ++ret;
        while ((*p != '(') && (*p != ')') && (*p != ':'))  {
          if (*p == NULL) return 0;
          if (*p == '\'') ++p;
          if (*p == NULL) return 0;
          ++p;
        }
        break;
    }
  }
  if (*p != NULL) return 0;
  if (match != 0) return 0;
  return ret;
}

XML Converter

Here is a rudimentary XML converter for LSX data:
// Public Domain code by Christopher Diggins

bool lsx_is_label(lsx_node* x) {
  return (*x->text == '(');
}

std::string lsx_get_label(lsx_node* x) {
  assert(lsx_is_label(x));
  const char* p = x->text;
  while (*(++p) != ':') {
    if (*p == '\'') ++p;
  }
  return std::string(x->text + 1, p);
}

std::string lsx_get_text(lsx_node* x) {
  assert(!lsx_is_label(x));
  const char* p = x->text;
  while (*p != ')') {
    if (*p == '\'') ++p;
    ++p;
  }
  return std::string(x->text, p);
}

void lsx_output_as_xml(lsx_node* x) {
  if (x == NULL)
    return;
  if (!lsx_is_label(x)) {
    std::cout << lsx_get_text(x) << std::endl;
  }
  else {
    std::cout << "<" << lsx_get_label(x) << ">" << std::endl;
    lsx_output_as_xml(x->child);
    std::cout << "</" << lsx_get_label(x) << ">" << std::endl;
  }
  lsx_output_as_xml(x->sibling);
};

XSLT

Xml can be mapped to Labelled S-Expressions as evidenced by the following XSLT stylesheet for transforming XML into Labelled S-Expressions.
<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- public domain by Christopher Diggins http://www.cdiggins.com -->
<t:transform version="1.0" xmlns:t="http://www.w3.org/1999/XSL/Transform">
  <t:output method = "html" version = "1.0" encoding = "ISO-8859-1" omit-xml-declaration = "no" standalone = "yes" indent = "yes" />

  <t:template match="/">
    <html>
    <body>
    <dl><dt>(lsx:</dt>
    <dl><t:apply-templates/></dl>
    <dt>)</dt></dl>
    </body>
    </html>
  </t:template>

  <t:template match="node()">
    <dt>(<t:value-of select="name()"/>:</dt>
      <dl><t:apply-templates select="attribute::*"/></dl>
      <dl><t:apply-templates select="child::node()"/></dl>
    <dt>)</dt>
  </t:template>

  <t:template match="attribute::*">
    (<t:value-of select="name()"/>:<t:value-of select="."/>)
  </t:template>

  <t:template match="text()"><t:value-of select="."/></t:template>
</t:transform>

Similar Work

There are a multitude of markup languages floating around, http://www.pault.com/pault/pxml/xmlalternatives.html has a nice list of markup languages. I don't believe any even come close to the simplicity and efficiency of Labelled S-Expressions.

Talk Back!

Have an opinion? Readers have already posted 13 comments about this weblog entry. Why not add yours?

RSS Feed

If you'd like to be notified whenever Christopher Diggins adds a new entry to his weblog, subscribe to his RSS feed.

About the Blogger

Christopher Diggins is a software developer and freelance writer. Christopher loves programming, but is eternally frustrated by the shortcomings of modern programming languages. As would any reasonable person in his shoes, he decided to quit his day job to write his own ( www.heron-language.com ). Christopher is the co-author of the C++ Cookbook from O'Reilly. Christopher can be reached through his home page at www.cdiggins.com.

This weblog entry is Copyright © 2005 Christopher Diggins. All rights reserved.

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use