The Artima Developer Community
Sponsored Link

Weblogs Forum
Trees and Labelled S-Expressions

13 replies on 1 page. Most recent reply: Nov 12, 2005 9:33 PM by Christopher Diggins

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 13 replies on 1 page
Christopher Diggins

Posts: 1215
Nickname: cdiggins
Registered: Feb, 2004

Trees and Labelled S-Expressions (View in Weblogs)
Posted: Nov 3, 2005 10:05 PM
Reply to this message Reply
Summary
An enormous amount of data can be conveniently represented as a labelled tree. This is the basis of many markup languages such as SGML, XML, HTML, and others. Herein I propose an alternative, Labelled S-Expressions and provide code for a working parser in C++.
Advertisement
S-Expressions are a well known way of representing hierarchical data-structures in text form, but they lack the extra information and structure which is gained by associating a label with a data element.

Consider the following S-Expression:

  (cd
    (title Maggie May)
    (artist Rod Stewart)
    (country UK)
    (company Pickwick)
    (price 8.50)
    (year 1990)
  )
One drawback of the format is that the label is only identified as being the first element in the S-Expression. The other problem is that white-space is significant as being a delimiter between labels and the rest of the data. For textual data, it is important that white-space not be part of the grammar, otherwise subtle errors can easily arise. There is no standardized S-Expression format, and each has different ways of grouping data. XML on the other hand is more structured. The following is what the example data structure would look like as XML.
  <cd>
    <title>Maggie May</title>
    <artist>Rod Stewart</artist>
    <country>UK</country>
    <company>Pickwick</company>
    <price>8.50</price>
    <year>1990</year>
  </cd>
XML however is very complex and slow and painful to parse. It is also verbose. Labelled S-Expressions provide the best of both formats, while being extremely easy to parse.
  (cd:
    (title:Maggie May)
    (artist:Rod Stewart)
    (country:UK)
    (company:Pickwick)
    (price:8.50)
    (year:1990)
  )

Grammar

The precise syntax of the Labelled S-Expression format was inspired by Derek Parnell's unnamed markup language. Here is the Grammar for Labelled S-Expressions in Extended Backus-Naur Form (EBNF):
  document ::=
    tagged_content

  tagged_content ::=
    "(" text ":" content ")"

  content ::=
    text? (tagged_content | text)*

  text ::=
    (^("(" | ":" | ")" | "'") | escaped_char)+

  escaped_char ::=
    "'" .
And that is it, only five production rules! Encoding is UTF-8.

Design Goals

Labelled S-Expressions was designed so that it could be parsed as quickly as possible, and be as simple as possible, while still maintaining usefulness as a portable text-based data format.

Parser

The following is a snippet of C++ code for parsing labelled S-Expressions. It breaks a buffer up into nodes, where a node can either point to a label: "(xxx:" or a textual node. The input must already be validated to be a well-formed document.
// Public Domain code by Christopher Diggins

struct lsx_node {
  lsx_node* sibling;
  lsx_node* child;
  const char* text;
};

lsx_node* lsx_parse(const char** pp) {
  if (**pp == ')') {
    ++*pp;
    return NULL;
  }
  lsx_node* node = (lsx_node*)calloc(sizeof(lsx_node), 1);
  node->text = *pp;
  if (**pp == '(') {
    ++*pp;
    while (*(*pp)++ != ':') {
      if (**pp == '\'') ++*pp;
    }
    node->child = lsx_parse(pp);
    lsx_node* child = node->child;
    while (child != NULL) {
      child->sibling = lsx_parse(pp);
      child = child->sibling;
    }
  }
  else {
    while ((**pp != '(') && (**pp != ')')) {
      if (**pp == '\'') ++*pp;
      ++*pp;
    }
  }
  return node;
}

Checking Well-Formedness

Here is a function for checking well-formedness and returns the number of nodes in the data:
// Public Domain code by Christopher Diggins
// returns the number of lsx_nodes
// and returns 0 if document is ill-formed

int lsx_node_count(const char* p) {
  int ret = 0;
  int match = 0;
  if (*p == NULL) return 0;
  while (*p != NULL) {
    switch (*p) {
      case '(' :
        ++ret;
        ++match;
        while (*p != ':') {
          if (*p == NULL) return 0;
          if (*p == '\'') ++p;
          if (*p == NULL) return 0;
          ++p;
        }
        ++p;
        break;
      case ')' :
        --match;
        ++p;
        break;
      case ':' :
        return 0;
      default:
        ++ret;
        while ((*p != '(') && (*p != ')') && (*p != ':'))  {
          if (*p == NULL) return 0;
          if (*p == '\'') ++p;
          if (*p == NULL) return 0;
          ++p;
        }
        break;
    }
  }
  if (*p != NULL) return 0;
  if (match != 0) return 0;
  return ret;
}

XML Converter

Here is a rudimentary XML converter for LSX data:
// Public Domain code by Christopher Diggins

bool lsx_is_label(lsx_node* x) {
  return (*x->text == '(');
}

std::string lsx_get_label(lsx_node* x) {
  assert(lsx_is_label(x));
  const char* p = x->text;
  while (*(++p) != ':') {
    if (*p == '\'') ++p;
  }
  return std::string(x->text + 1, p);
}

std::string lsx_get_text(lsx_node* x) {
  assert(!lsx_is_label(x));
  const char* p = x->text;
  while (*p != ')') {
    if (*p == '\'') ++p;
    ++p;
  }
  return std::string(x->text, p);
}

void lsx_output_as_xml(lsx_node* x) {
  if (x == NULL)
    return;
  if (!lsx_is_label(x)) {
    std::cout << lsx_get_text(x) << std::endl;
  }
  else {
    std::cout << "<" << lsx_get_label(x) << ">" << std::endl;
    lsx_output_as_xml(x->child);
    std::cout << "</" << lsx_get_label(x) << ">" << std::endl;
  }
  lsx_output_as_xml(x->sibling);
};

XSLT

Xml can be mapped to Labelled S-Expressions as evidenced by the following XSLT stylesheet for transforming XML into Labelled S-Expressions.
<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- public domain by Christopher Diggins http://www.cdiggins.com -->
<t:transform version="1.0" xmlns:t="http://www.w3.org/1999/XSL/Transform">
  <t:output method = "html" version = "1.0" encoding = "ISO-8859-1" omit-xml-declaration = "no" standalone = "yes" indent = "yes" />

  <t:template match="/">
    <html>
    <body>
    <dl><dt>(lsx:</dt>
    <dl><t:apply-templates/></dl>
    <dt>)</dt></dl>
    </body>
    </html>
  </t:template>

  <t:template match="node()">
    <dt>(<t:value-of select="name()"/>:</dt>
      <dl><t:apply-templates select="attribute::*"/></dl>
      <dl><t:apply-templates select="child::node()"/></dl>
    <dt>)</dt>
  </t:template>

  <t:template match="attribute::*">
    (<t:value-of select="name()"/>:<t:value-of select="."/>)
  </t:template>

  <t:template match="text()"><t:value-of select="."/></t:template>
</t:transform>

Similar Work

There are a multitude of markup languages floating around, http://www.pault.com/pault/pxml/xmlalternatives.html has a nice list of markup languages. I don't believe any even come close to the simplicity and efficiency of Labelled S-Expressions.


Sean Conner

Posts: 19
Nickname: spc476
Registered: Aug, 2005

Re: Trees and Labelled S-Expressions Posted: Nov 3, 2005 11:39 PM
Reply to this message Reply
Your code may be fast, but I'd rather have slow code that works. I'll give you the benefit of the doubt and assume the code is C99 (since it sure isn't C89 standard). But even then, it doesn't quite work as intended. I have the following for a print routine:


void lsx_print(struct lsx_node *n,int l)
{
while(n != NULL)
{
if (n->text != NULL)
printf("%*.*c%s\n",l,l,n->text);
if (n->child)
lsx_print(n->child,l+1);
n = n->sibling;
}
}


I feed your parser the following:

('(cd'::blah)

I get the following:


(cd'::blah)
:blah)


Which I don't think is correct (never mind the fact that as written, the entire labelled S expression has to exist as one line).

I'm guessing you probably knocked this code out without really testing it to show just how easy it is to parse, but the code as is, doesn't really work as you intended.

Greg Jorgensen

Posts: 65
Nickname: gregjor
Registered: Feb, 2004

Re: Trees and Labelled S-Expressions Posted: Nov 4, 2005 2:08 AM
Reply to this message Reply
I think you'll find lots of prior art. This looks just like nested tuples/lists/dictionaries in Python, or PHP's nested arrays. Perl and Ruby have similar constructions. I wrote a web application a couple of years ago that stores inventory data in a text file as nested PHP arrays. The entire inventory can be loaded into PHP using the eval function to parse it:

array (
1009 =>
array (
'type' => 'n',
'price' => '',
'year' => '2004',
'make' => 'FOREST RIVER',
'model' => 'LEXINGTON',
'length' => '27',
'status' => 'Available',
'stock' => '101',
),
1017 =>
array (
'type' => 'n',
'price' => '',
'year' => '2004',
'make' => 'CHINOOK',
'model' => 'CONCOURSE',
'length' => '21',
'status' => 'Available',
'stock' => '173',
),
...
)

The parser looks like this:

$f = fopen($filename, 'rb');
$s = fread($f, filesize($filename));
eval('$inventory=' . $s . ';');
// display inventory
print_r($inventory);

The code to store inventory in a file is just as simple with PHP's var_export function.

I've seen the same kind of thing many many times, and back before XML had any traction.

If you've looked at YAML you've noticed that it is pretty much the same thing without the parentheses.

> The following is a snippet of C code for parsing labelled
> S-Expressions (warning: it is much faster than what you are
> used to!)

It may be faster but I'm used to code that compiles and works. Did you get that code to run? Hold off bragging about how fast it is until it works, is robust enough to handle human-edited inputs, can deal with syntax errors, doesn't store read-only pointers into the original text buffer, etc. Code snippets used for illustrating a technique should be a lot more readable than this.

Christopher Diggins

Posts: 1215
Nickname: cdiggins
Registered: Feb, 2004

Re: Trees and Labelled S-Expressions Posted: Nov 4, 2005 9:10 AM
Reply to this message Reply
> Your code may be fast, but I'd rather have slow code that
> works.

Well then I invite you to fix it!

> I'll give you the benefit of the doubt and assume
> the code is C99 (since it sure isn't C89 standard).

It's actually C++. I tried to make it as close to C as possible, but I probably screwed up.

> I'm guessing you probably knocked this code out without
> really testing it to show just how easy it is to parse,
> but the code as is, doesn't really work as you intended.

I did test it some, but apparently I made some mistakes. Do you ever make mistakes in your code? Or do you spend all your time making criticisms, without offering alternatives?

Christopher Diggins

Posts: 1215
Nickname: cdiggins
Registered: Feb, 2004

Re: Trees and Labelled S-Expressions Posted: Nov 4, 2005 9:18 AM
Reply to this message Reply
> I think you'll find lots of prior art.

Which I linked to!!

> If you've looked at YAML you've noticed that it is pretty
> much the same thing without the parentheses.

I don't know how you make that connection:
http://yaml.org/spec/current.html

YAML is a lot more complicated, and whitespace is significant.

> It may be faster but I'm used to code that compiles and
> works. Did you get that code to run?

It's C++.

> Hold off bragging
> about how fast it is until it works, is robust enough to
> handle human-edited inputs, can deal with syntax errors,

Here is the code for validating well-formedness (you have to call this first):


// returns the number of lsx_nodes
// and returns 0 if document is ill-formed
int lsx_node_count(const char* p) {
int ret = 0;
int match = 0;
if (*p == NULL) return 0;
while (*p != NULL) {
switch (*p) {
case '(' :
++ret;
++match;
while (*p != ':') {
if (*p == NULL) return 0;
if (*p == '\'') ++p;
if (*p == NULL) return 0;
++p;
}
++p;
break;
case ')' :
--match;
++p;
break;
case ':' :
return 0;
default:
++ret;
while ((*p != '(') && (*p != ')') && (*p != ':')) {
if (*p == NULL) return 0;
if (*p == '\'') ++p;
if (*p == NULL) return 0;
++p;
}
break;
}


> doesn't store read-only pointers into the original text
> buffer, etc.

That is an implementation detail. You should be able to figure it out yourself, if you actuallly are a programmer.

> Code snippets used for illustrating a
> technique should be a lot more readable than this.

Then I invite you to rewrite it. It shouldn't be hard for you, since you are obviously so much smarter than everyone else.

Christopher Diggins

Posts: 1215
Nickname: cdiggins
Registered: Feb, 2004

Re: Trees and Labelled S-Expressions Posted: Nov 4, 2005 9:38 AM
Reply to this message Reply
It turns out that your print function is simply insufficient. You have to parse the text when outputting it. Here is a more complete output code in C++:

string lsx_get_label(lsx_node* x) {
  assert(lsx_is_label(x));
  const char* p = x->text;
  while (*(++p) != ':') {
    if (*p == '\'') ++p;
  }
  return string(x->text + 1, p);
}
 
string lsx_get_text(lsx_node* x) {
  assert(!lsx_is_label(x));
  const char* p = x->text;
  while (*p != ')') {
    if (*p == '\'') ++p;
    ++p;
  }
  return string(x->text, p);
}
 
void lsx_output_as_text(lsx_node* x) {
  if (x == NULL)
    return;
  if (!lsx_is_label(x)) {
    cout << lsx_get_text(x) << endl;
  }
  else {
    cout << "(" << lsx_get_label(x) << ":" <<
    lsx_output_as_text(x->child);
    cout << ")" << endl;
  }
  lsx_output_as_text(x->sibling);
}

Christopher Diggins

Posts: 1215
Nickname: cdiggins
Registered: Feb, 2004

Re: Trees and Labelled S-Expressions Posted: Nov 4, 2005 10:14 AM
Reply to this message Reply
> I did test it some, but apparently I made some mistakes.
> Do you ever make mistakes in your code? Or do you spend
> all your time making criticisms, without offering
> alternatives?

I am sorry Sean, this was uncalled for. I realize that you were trying to be helpful. To be honest, my frustration was more at Jorgenson. Lately he has been ignorantly criticizing everything I post. When you are subject to a constant barrage of criticism, you can become rather thin-skinned.

Reno C.

Posts: 10
Nickname: saxml
Registered: Oct, 2005

Re: Trees and Labelled S-Expressions Posted: Nov 4, 2005 10:45 AM
Reply to this message Reply
> content ::= text? (tagged_content | text?)*

Shouldn't it be
text? (tagged_content | text)*
instead ?

I'm using my own YARD-like parser and I've got to remove this optional quantifier.

Christopher Diggins

Posts: 1215
Nickname: cdiggins
Registered: Feb, 2004

Re: Trees and Labelled S-Expressions Posted: Nov 4, 2005 10:54 AM
Reply to this message Reply
> > content ::= text? (tagged_content | text?)*
>
> Shouldn't it be
text? (tagged_content | text)*

> instead ?
>
> I'm using my own YARD-like parser and I've got to remove
> this optional quantifier.

Yes! You are right, my bad. That rule, as it was, would match anything. Thanks for pointing that out.

Out of curiosity, is your parser open-source and available online?

Reno C.

Posts: 10
Nickname: saxml
Registered: Oct, 2005

Re: Trees and Labelled S-Expressions Posted: Nov 4, 2005 11:24 AM
Reply to this message Reply
> Out of curiosity, is your parser open-source and available online?

It was not available online because it's only a personal test. (and i'm not a professional programer).
I've just put it online: http://sax.scriptsdb.org/birdy.hpp

YARD and biscuit (mb2sync) were the main inspirations!

Christopher Diggins

Posts: 1215
Nickname: cdiggins
Registered: Feb, 2004

Re: Trees and Labelled S-Expressions Posted: Nov 4, 2005 11:41 AM
Reply to this message Reply
> > Out of curiosity, is your parser open-source and
> available online?
>
> It was not available online because it's only a personal
> test. (and i'm not a professional programer).
> I've just put it online:
> http://sax.scriptsdb.org/birdy.hpp
>
> YARD and biscuit (mb2sync) were the main inspirations!

Very nice code! It has inspired me to look into simplifying the YARD code. Thanks for sharing it.

Greg Jorgensen

Posts: 65
Nickname: gregjor
Registered: Feb, 2004

Re: Trees and Labelled S-Expressions Posted: Nov 4, 2005 1:41 PM
Reply to this message Reply
> You should be able to figure it out yourself,
> if you actuallly are a programmer.
> ...
> Then I invite you to rewrite it. It shouldn't be
> hard for you, since you are obviously so much
> smarter than everyone else.
> ...
> To be honest, my frustration was more at Jorgenson.
> Lately he has been ignorantly criticizing everything
> I post. When you are subject to a constant barrage of
> criticism, you can become rather thin-skinned.

I'm not interested in rewriting your code. You are the one posting code samples, so you shouldn't be too surprised when people criticize your code, especially after they've actually bothered to try to compile and run it. If you don't like your code or ideas criticized stop posting a new article every day in a forum visited by programmers.

You were thin-skinned before I ever responded to your articles . You routinely insult people who try to engage you in discussion, and you bristle at any criticism or attempt to get you to explain yourself. You present yourself as a language designer and C++ expert, so you should expect readers to hold you to a high standard. If you post vague statements about design flaws or performance problems or how such-and-such is unsuitable for your purposes, and then refuse to explain yourself or engage in a professional discussion about your ideas, how can anyone in this forum take you seriously?

As a long-time Artima reader I am a little upset at how you've spammed the site to promote yourself and to try to build a buzz around the non-existent Heron language. Even your Artima profile is essentially fishing for a job offer. I'm done ignorantly responding to your posts, but I do enjoy reading the responses -- I have actually learned a few things from astute readers who have taken the time to refute or expand on your topics. Unfortunately you eventually insult or bore them away.

Sean Conner

Posts: 19
Nickname: spc476
Registered: Aug, 2005

Re: Trees and Labelled S-Expressions Posted: Nov 4, 2005 8:04 PM
Reply to this message Reply
Using your validator and wrapping some code around it, I got the following:


GOOD: ('(a'::b)
GOOD: ( : )
GOOD: (:)
GOOD: (a:b)
GOOD: (a:)
GOOD: (:b)
BAD: (a:b:c)
GOOD: (:(:))
BAD: (a:(b)
BAD: (a:())
BAD: ((a:b):b)
GOOD: (a:b(c:d))
GOOD: (a : b)
GOOD: ( a : b)
GOOD: ('::':)
GOOD: ('':b)
BAD: ()
BAD: (:()()())
GOOD: (:(:)(:)(:))
BAD: (::)


So, which are actually good and which are not?

Christopher Diggins

Posts: 1215
Nickname: cdiggins
Registered: Feb, 2004

Re: Trees and Labelled S-Expressions Posted: Nov 12, 2005 9:33 PM
Reply to this message Reply
> Using your validator and wrapping some code around it, I
> got the following:
>
>

> GOOD: ('(a'::b)
> GOOD: ( : )
> GOOD: (:)
> GOOD: (a:b)
> GOOD: (a:)
> GOOD: (:b)
> BAD: (a:b:c)
> GOOD: (:(:))
> BAD: (a:(b)
> BAD: (a:())
> BAD: ((a:b):b)
> GOOD: (a:b(c:d))
> GOOD: (a : b)
> GOOD: ( a : b)
> GOOD: ('::':)
> GOOD: ('':b)
> BAD: ()
> BAD: (:()()())
> GOOD: (:(:)(:)(:))
> BAD: (::)
>

>
> So, which are actually good and which are not?

AFAIK that labelling is entirely correct according to the grammar I posted.

Flat View: This topic has 13 replies on 1 page
Topic: Generics and covariant return types Previous Topic   Next Topic Topic: Separation of Concerns


Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2014 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use - Advertise with Us