Artima Developer Spotlight Forum - Ted Neward: So You Say You Want to Kill XML...

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Artima Developer Spotlight Forum
Ted Neward: So You Say You Want to Kill XML...

16 replies on 2 pages. Most recent reply: Jul 28, 2008 7:55 AM by rodolfodpk

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 16 replies on 2 pages [ 1 2 | » ]

Frank Sommers

Posts: 2642
Nickname: fsommers
Registered: Jan, 2002

Ted Neward: So You Say You Want to Kill XML...

Posted: Jul 15, 2008 8:27 PM

When Google open-sourced its internal Protocol Buffers data serialization format, its description of the tool invited a comparison with XML:

A flexible, efficient, automated mechanism for serializing structured data—think XML, but smaller, faster, and simpler...

As a de facto standard for Web-based data distribution, as well as an internal data storage format for many enterprise applications, XML has become so ubiquitous that few question its use for data serialization. Google's wide public support for a non-XML data serialization format generated a fair amount of controversy. Among the more reasoned analyses is Ted Neward's recent blog post, So You Say You Want to Kill XML...., where he compares Protocol Buffers—and binary formats more generally—to XML:

The Protocol Buffer approach looks like a good one, but let's not let the details get lost in the shouting: Protocol Buffers, as with any binary protocol format and/or RPC mechanism... are great for those situations where performance is critical and both ends of the system are well-known and controlled. If Google wants to open up their services such that third-parties can call into those systems using the Protocol Buffers approach, then more power to them...

In the end, if you want an endpoint that is loosely coupled and offers the maximum flexibility, stick with XML, either wrapped in a SOAP envelope or in a RESTful envelope as dictated by the underlying transport (which means HTTP, since REST over anything else has never really been defined clearly by the Restafarians). If you need a binary format, then Protocol Buffers are certainly one answer... but so is ICE, or even CORBA

One of Neward's issues is with the claim that Protocol Buffers are programming language-neutral, similar to XML:

Protocol Buffers' claim to be language and/or platform-neutral is hardly justifiable, given that they have zero support for the .NET platform out of the box... Without it coming out of the box, it's not fair to claim language- and platform-neutrality, unless, of course, they are willing to suggest that COM's Structured Storage was also language- and platform-neutral... Frankly, any binary format, regardless of how convoluted, could be claimed to be language- and platform-neutral under those conditions, which I think makes the claim spurious to make.

XML still holds the edge here, by a long shot--until we see implementations of Protocol Buffers for Perl/Parrot, C, D, .NET, Ruby, JavaScript, mainframes and others, PBs will have to take second-place status behind XML in terms of "reach" across the wide array of systems. Does that diminish their usefulness? Hardly. It just depends on how far a developer wants their data format to stretch.

Neward also takes issue with the frequent criticism that XML is verbose and inefficient, something that the Protocol Buffers Google project description also mentions:

The goal of XML was never to be small or fast, but still clearly simple. And, despite your personal opinion about the ecosystem that has grown up around XML (SOAP, WS-*, and so on), it's still fairly easy to defend the idea that XML itself is a simple technology, particularly if we make some basic assumptions around things that usually complicate text like character sets and encoding and such... Why, then, did XML take on a role as data-transfer format if, on the surface of things, using text here was such a bad idea?

Certainly the interoperability argument doesn't require a text-based format, it was just always cited that way. In fact, both the CORBA supporters and the developers over at ZeroC will both agree with Google in suggesting that a binary format can and will be an efficient and effective interoperability format.

Other issues Neward discusses is XML's claim to be self-descriptive, something that can also be turned around in Protocol Buffers' favor:

Because XML documents are intended to be self-descriptive, the Protocol Buffer format can contain just the data, and leave the format and structure to be enforced by the code on either side of the producer/consumer relationship. Whether you consider this a Good Thing or a Bad Thing probably stands as a good indicator of whether you like the Protocol Buffer approach or the XML approach better.

What do you think of Neward's analysis of Protocol Buffers vs XML?

James Watson

Posts: 2024
Nickname: watson
Registered: Sep, 2005

Re: Ted Neward: So You Say You Want to Kill XML...

Posted: Jul 16, 2008 9:47 AM

> Because XML documents are intended to be
> self-descriptive, the Protocol Buffer format can contain
> just the data, and leave the format and structure to be
> enforced by the code on either side of the
> producer/consumer relationship. Whether you consider this
> a Good Thing or a Bad Thing probably stands as a good
> indicator of whether you like the Protocol Buffer approach
> or the XML approach better.

I think this is the heart of the question that needs to be asked before adopting the format. The main problem with formats (whether they are 'binary' is irrelevant at this point) where each messages doesn't describe their structure is that it's impossible (in general) to validate that the data you received was what the sender intended. You need to be absolutely sure that your definition on both sides of the wire are the same at all times.

That's not necessarily a deal-breaker, it's just important to consider.

The other point that I think is poorly understood in general is "The Myth of the One True Schema." Some technical leads at my previous employer decided that we would have one allowed schema for each type of message and built the system processing these messages accordingly. However, the organizations we wanted to work with had already picked out their own preferred formats. Often they we needed them more than they needed us so we used their format. This was a painful learning experience as we had all our eggs in the "one schema" approach. We didn't have a good strategy for addressing this and ended up with an unmaintainable mess as we scrambled to address this and versioning issues.

One solution to this issue are transformation tools such as XSLT. My opinion is that anyone planning to build an integration layer needs to allow for transformation at all communication points. If PB does in fact not provide anything for this, then I agree that this is a major gap. I disagree with Ted Neward that developers dislike XSLT because it is complex. The problem for most developers is that it's not a strictly imperative language.

It seems to me that all hierarchical formats are basically equivalent in terms of what they can represent. Assuming that's the case, it should be possible to build SAX and DOM parsers over any hierarchical format and with a little magic, enable tools like XSLT to operate on those formats. Of course some parts of XML are not present in other formats so not all of XSLT would be pertinent but I don't see why you couldn't take advantage of the general capabilities of these tools, including XML to object binding.

Gregor Zeitlinger

Posts: 108
Nickname: gregor
Registered: Aug, 2005

Re: Ted Neward: So You Say You Want to Kill XML...

Posted: Jul 16, 2008 12:31 PM

> I disagree with Ted Neward that developers dislike XSLT
> because it is complex.
You don't think XSLT is too complex?
Even the easiest things are confusing in XSLT.
And I'm not even speaking of the loose typing...

James Watson

Posts: 2024
Nickname: watson
Registered: Sep, 2005

Re: Ted Neward: So You Say You Want to Kill XML...

Posted: Jul 16, 2008 2:16 PM

> > I disagree with Ted Neward that developers dislike XSLT
> > because it is complex.
> You don't think XSLT is too complex?
> Even the easiest things are confusing in XSLT.

Can you give an example?

I really don't find XSLT to be very complicated and I think it's well adapted to it's intended purpose.

James Watson

Posts: 2024
Nickname: watson
Registered: Sep, 2005

Re: Ted Neward: So You Say You Want to Kill XML...

Posted: Jul 17, 2008 7:44 AM

I got sidetracked yesterday and forgot to point out that Ted Neward makes a fairly nonsensical argument for XML:

> The goal of XML was never to be small or fast, but
> still clearly simple. And, despite your personal opinion
> about the ecosystem that has grown up around XML (SOAP,
> WS-*, and so on), it's still fairly easy to defend the
> idea that XML itself is a simple technology, particularly
> if we make some basic assumptions around things that
> usually complicate text like character sets and encoding
> and such... Why, then, did XML take on a role as
> data-transfer format if, on the surface of things, using
> text here was such a bad idea?

Would this argument hold up in court?

"Look, I have never promised I would never kill anyone."

Whether the point of XML was to be fast or not is completely irrelevant. If it's slow, it's slow.

I also take issue with the idea that it's slow and bloated because it's text-based. There are other text-based protocols that are less bloated and can be parsed faster.

There are two orthogonal choices here:

1. text-based or not
2. self-describing* or not

This article seems to conflate them.

You can have self-descriptive formats that are no text-based and non-self-describing formats that are text-based.

* I hate this term because it leads a common belief that XML can allow things that can't be done with an equivalent non-self describing format which is of course not true. Unfortunately I don't know of a better term.

Vijay Kandy

Posts: 37
Nickname: vijaykandy
Registered: Jan, 2007

Re: Ted Neward: So You Say You Want to Kill XML...

Posted: Jul 17, 2008 10:29 AM

XSLT uses DOM and loads up the entire input XML into memory. I worked for an insurance company where input files were more than 20MB each. The style sheet would be at least twice the size (or more) because XSLT is again XML and is too verbose. XSLT is good for smaller XML files but I'd use XPath + a simple template engine for input files in MB.

James Watson

Posts: 2024
Nickname: watson
Registered: Sep, 2005

Re: Ted Neward: So You Say You Want to Kill XML...

Posted: Jul 17, 2008 10:52 AM

> XSLT uses DOM and loads up the entire input XML into
> memory. I worked for an insurance company where input
> files were more than 20MB each. The style sheet would be
> at least twice the size (or more) because XSLT is again
> XML and is too verbose.

I can't be sure without any details but if your XSLT is longer than the input, you probably aren't using the full power of the language. I've seen a lot of stylesheets that are basically hardcoded if-else chains which completely miss the point.

It's also possible that XSLT was not the proper tool for what you were doing. XSLT is a great language for transforming documents. I don't like it for implementing logic based on personal experience.

> XSLT is good for smaller XML files
> but I'd use XPath + a simple template engine for input
> files in MB.

XSLT doesn't use DOM or any other object model. XSLT is a language. XSLT processor you were using may use DOM or is configured to use DOM. You can use SAX or other underlying models for XSLT. In Java, the underlying model is defined by the type of source you provide. If you provide a DOMSource you will use DOM (surprise). If you provide a SAX source it will use SAX. The disadvantage with SAX being that if you are constantly going back to the root of your document, SAX might take too long. Luckily most transformations don't require this and even if need to do it, you can usually do two transformations to get around this limitation.

Bill Pyne

Posts: 165
Nickname: billpyne
Registered: Jan, 2007

Re: Ted Neward: So You Say You Want to Kill XML...

Posted: Jul 17, 2008 12:15 PM

The July 2008 issue of "Communications of the ACM" has a good article related to this topic entitled "XML Fever".

Gregor Zeitlinger

Posts: 108
Nickname: gregor
Registered: Aug, 2005

Re: Ted Neward: So You Say You Want to Kill XML...

Posted: Jul 17, 2008 1:09 PM

> > > I disagree with Ted Neward that developers dislike
> XSLT
> > > because it is complex.
> > You don't think XSLT is too complex?
> > Even the easiest things are confusing in XSLT.
>
> Can you give an example?
Loops are complicated, if you cannot use for-each. Then you have to use recursion.
The fact that XSLT is XML itself makes the syntax fairly hard to read, too.

> I really don't find XSLT to be very complicated and I
> think it's well adapted to it's intended purpose.
Maybe I've just seen examples where it was not used for it's intended purpose. Or there are just not so many intended uses for data-centric XML (which is what most people have to deal with)

James Watson

Posts: 2024
Nickname: watson
Registered: Sep, 2005

Re: Ted Neward: So You Say You Want to Kill XML...

Posted: Jul 17, 2008 1:44 PM

> > > > I disagree with Ted Neward that developers dislike
> > XSLT
> > > > because it is complex.
> > > You don't think XSLT is too complex?
> > > Even the easiest things are confusing in XSLT.
> >
> > Can you give an example?
> Loops are complicated, if you cannot use for-each. Then
> you have to use recursion.

I generally don't need loop outside of for-each and I use that rarely. You need to use XSLT as a functional language and really make use of pattern matching. Otherwise it's going to be a pain in the ass. I think the functional aspect of XSLT is what makes it hard for most developers.

I don't want to minimize that hurdle, I just don't think 'complexity' is the right word.

> The fact that XSLT is XML itself makes the syntax fairly
> hard to read, too.

A good editor makes all the difference here (fairly weak argument, I know.) JEdit does some pretty nice syntax highlighting.

Maybe it isn't necessary to be written in XML to accomplish this but one nice feature of XSLT is that you can write out XML directly into the template for output instead of making calls to functions for that purpose. It took me a while to realize that.

> > I really don't find XSLT to be very complicated and I
> > think it's well adapted to it's intended purpose.
> Maybe I've just seen examples where it was not used for
> it's intended purpose. Or there are just not so many
> intended uses for data-centric XML (which is what most
> people have to deal with)

I think it's really powerful for transformations that don't require much logic. Once you get into calculating and other stuff, it's starts to get a little hairy. My ideal transformation layer would allow for composing xslt with imperative transformations.

One good example of data-centric XSLT is an input document of the style:


<data>
  <field name="foo" value="one"/>
  <field name="bar" value="two"/>
</data>

and you need:


<data>
  <foo>one</foo>
  <bar>two</bar>
</data>

or vice versa.

You can write this in a completely generic way in a few lines of code. You can add a few tweaks as needed and layer other transformations and it's a lot easier to work with and maintain than a standard Java (for example) approach.

Vijay Kandy

Posts: 37
Nickname: vijaykandy
Registered: Jan, 2007

Re: Ted Neward: So You Say You Want to Kill XML...

Posted: Jul 17, 2008 4:30 PM

> I can't be sure without any details but if your XSLT is
> longer than the input, you probably aren't using the full
> power of the language. I've seen a lot of stylesheets
> that are basically hardcoded if-else chains which
> completely miss the point.
>
> It's also possible that XSLT was not the proper tool for
> what you were doing. XSLT is a great language for
> transforming documents. I don't like it for implementing
> logic based on personal experience.

I was converting an XML file to another XML format. XSLT is one tool for that job. But from experience I think XPath/XQuery code is a lot lesser than XSLT code for the same input. That's just my opinion. Did you try XPath?

> XSLT doesn't use DOM or any other object model. XSLT is a
> language. XSLT processor you were using may use DOM or is
> configured to use DOM. You can use SAX or other
> underlying models for XSLT. In Java, the underlying model
> is defined by the type of source you provide. If you
> provide a DOMSource you will use DOM (surprise). If you
> provide a SAX source it will use SAX. The disadvantage
> with SAX being that if you are constantly going back to
> the root of your document, SAX might take too long.
> Luckily most transformations don't require this and even
> n if need to do it, you can usually do two transformations
> to get around this limitation.

I was talking about the XSL Transformation. Xalan-J for e.g., loads the source and build a tree or a data structure representing the source in memory (the source can be an InputSource or File or DOM). If I use a SAX source, that won't result in an internal data structure and instead the transformer relies only on SAX events?

James Watson

Posts: 2024
Nickname: watson
Registered: Sep, 2005

Re: Ted Neward: So You Say You Want to Kill XML...

Posted: Jul 17, 2008 5:36 PM

> I was converting an XML file to another XML format. XSLT
> is one tool for that job. But from experience I think
> XPath/XQuery code is a lot lesser than XSLT code for the
> same input. That's just my opinion. Did you try XPath?

I use XPath in XSLT stylesheets. I wouldn't want to use XSLT without XPath. I find the question quite odd, actually.

> I was talking about the XSL Transformation. Xalan-J for
> e.g., loads the source and build a tree or a data
> structure representing the source in memory (the source
> can be an InputSource or File or DOM). If I use a SAX
> X source, that won't result in an internal data structure
> and instead the transformer relies only on SAX events?

I imagine it would but I'm not sure it's DOM. Even if it is, the ratio of stylesheets to input/output documents is generally very small. I have a hard time believing that the memory use of the stylesheets in-memory representation is enough to be a serious issue for you. How many stylesheets do you have loaded at any one time?

Wilfred Springer

Posts: 176
Nickname: springerw
Registered: Sep, 2006

Re: Ted Neward: So You Say You Want to Kill XML...

Posted: Jul 17, 2008 11:54 PM

> I think this is the heart of the question that needs to be
> asked before adopting the format. The main problem with
> formats (whether they are 'binary' is irrelevant at this
> point) where each messages doesn't describe their
> structure is that it's impossible (in general) to validate
> that the data you received was what the sender intended.
> You need to be absolutely sure that your definition on
> n both sides of the wire are the same at all times.

Are you saying that with XML this is all well understood? I am not to sure about that.

XML namespaces may be used for identifying the structure, but many people consider that to be malpractice, since XML namespaces do not uniquely map to a (version of a) schema. If situations where the XML namespace is also uniquely identifying the version of the schema, that is local application policy, and not defined anywhere by XML itself.

In fact, many people consider that malpractice, since it would break forward incompatibility. So as an alternative, many people have adopted the policy of having an XML namespace identify a range of minor versions, and have the minor version encoded in the payload itself.

I guess what I am saying is: even XML doesn't directly support validating "that the data you received was what the sender intended." It's all policy, not defined by XML itself.

> It seems to me that all hierarchical formats are basically
> equivalent in terms of what they can represent. Assuming
> that's the case, it should be possible to build SAX and
> DOM parsers over any hierarchical format and with a little
> magic, enable tools like XSLT to operate on those formats.
> Of course some parts of XML are not present in other
> r formats so not all of XSLT would be pertinent but I
> don't see why you couldn't take advantage of the general
> capabilities of these tools, including XML to object
> binding.

Unfortunately, the model behind XML (Infoset) has been reverse engineered from the syntax, and is not general enough to easily map to other representations. Like, it defines unparsed entities (I think) attributes and elements. Hardly relevant say in JSON representations.

I tried to build a SAX parser on an hierarchical binary data representation format in the past, and it was awkward and painful...

James Watson

Posts: 2024
Nickname: watson
Registered: Sep, 2005

Re: Ted Neward: So You Say You Want to Kill XML...

Posted: Jul 18, 2008 6:19 AM

> XML namespaces may be used for identifying the structure,
> but many people consider that to be malpractice, since XML
> namespaces do not uniquely map to a (version of a) schema.
> If situations where the XML namespace is also uniquely
> identifying the version of the schema, that is local
> application policy, and not defined anywhere by XML
> itself.
>
> In fact, many people consider that malpractice, since it
> would break forward incompatibility. So as an alternative,
> many people have adopted the policy of having an XML
> namespace identify a range of minor versions, and have the
> minor version encoded in the payload itself.
>
> I guess what I am saying is: even XML doesn't directly
> support validating "that the data you received was what
> the sender intended." It's all policy, not defined by XML
> itself.

I think you are taking my statement as being stronger than I intended. This is likely my own fault for not being clear.

All I mean is that in a format where the structure is no included in each message, you could end up in situations where you parse the message, get some data and think everything thing is fine but what you see is not what the sender meant. With a format like XML, you can validate against a schema and if the format is not what you expect, you will know.

This problem pops up with fixed length text files. The sender can put two fields together and not pad the first one properly, for example. The receiver may not know that this has occurred especially if the total message length isn't checked (which it often isn't). The beginning of the data from the second field will be pulled in to the first and everything will process just fine. I had a reoccurring issue with data from COBOL that contained null characters. The parse we were using treated the low characters as string terminators and would ignore everything that followed them. Because there was no structure in the data, there is no way for the parse to understand that something went wrong.

I definitely don't want to give the impression that I believe that the XML document is all you need and there isn't a need for an agreement on structure. I'll leave that to the jive-asses.

> Unfortunately, the model behind XML (Infoset) has been
> reverse engineered from the syntax, and is not general
> enough to easily map to other representations. Like, it
> defines unparsed entities (I think) attributes and
> elements. Hardly relevant say in JSON representations.
>
> I tried to build a SAX parser on an hierarchical binary
> data representation format in the past, and it was awkward
> and painful...

My thought it was OK that there are things in XML that are not relevant to other structures. What's important is that everything in those other structures can be represented as some subset of XML. It seems to me that if you can convert a given hierarchical format to XML in a completely generic way, you can create a SAX parse over it.

Of course going from XML to the other format would be more difficult. I have less use for that, however.

Sebastian Kübeck

Posts: 44
Nickname: sebastiank
Registered: Sep, 2005

Re: Ted Neward: So You Say You Want to Kill XML...

Posted: Jul 19, 2008 11:39 AM

In the last couple of years, I have been implementing numerous protocols. Some of them have been XML based, some not. Here the lessons I learned from that experience:

1. The effort of implementing a protocol is in no way related to the fact that it's XML or anyway human readable or not. I depends on various things if an implementation is simple or not. One of them is the protocol design itself, another is how it is implemented on both ends.

2. An XML language is self-descriptive only if it is designed that way. In that way, it's similar to source code.

3. The whole WS-* thing is more suitable for sales people than for real life implementation. It's size and complexity is almost a guarantee for incompatibilities and it delivers barely more then CICS has back in the 60es.

3. XML wasn't really made as a lingua franca for just everything in IT. Especially when it comes to data-transfer application, people seem to find it hard to design and implement useful XML languages.
Protocol Buffers seem to be much more focused on data-transfer and storage purposes. That could simplify design and implementation of protocols for most developers.

4. The impact of XML parsers and data binding tools on the challenge of protocol implementation is mostly overrated.
Often enough, they introduce another level of complexity as well as another source of possible incompatibilities.

My personal conclusion:
XML is great language when used properly for the right purpose! I like the Protocol Buffers approach for it's simplicity and elegance.

Flat View: This topic has 16 replies on 2 pages [ 1 2 | » ]

Previous Topic

Next Topic