Saturday, January 9, 2010

XML can suck as a data format

There's a lot of hype about XML as a data format these days. Everything is supposed to be XML. This isn't such a bad idea. Any time you decide to use XML you don't need to invent your own syntax.

(Side note: I think many people are confused about what XML actually is. I like to think of it as a syntax, rather than a language. When you make your own XML format, your basically inventing the semantics that go with the syntax.)

Now, I think XML is great as a format for documents. I think that the combination of XML and XSLT could do a lot for achieving the kind of awesome flexibility of LaTeX while providing a stronger separation of semantics and presentation.

The trouble is that XML often falls down when it comes to dealing with data (as opposed to a document). The bread and butter data structures of any programming language are sequences (variously called lists, vectors, arrays) and maps (variously called hashes, dictionaries, and associative arrays). For serializing data structures, YAML or JSON probably work better. As Tim Bray said: "It seems to me that the great thing about JSON is that it exists for one purpose: to put structs on the wire." [1] Also, data is often much less deeply nested and much more predictable than the documents XML was designed to handle. Eric S. Raymond offers several syntax styles for common requirements.

The advantage of the non-XML data formats Raymond provides is that they are much less verbose than XML. The poor human editor can easily get lost in all the elements and attributes, which defeats the whole purpose of a text format: to be easily scannable by us poor humans.

Here is an example of a data format I use for my Sudoku project. It is based on Raymond's RFC 822 metaformat.

squares: . .|. . .|. . . .
1 .|. . 5|3 4 2 .
------
. .|1 3 .|.|. . .
---------
7 2|8 . .|. . . 9
----- --- --
.|. 2|. 9 .|8 .|.
-- --- -----
5 . . .|. . 1|9 6
---------
. . .|.|. 8 5|. .
------
. 3 5 1|7 . .|. 2
. . . .|. . .|. .
regions: 0 1 9 10 18 19 27 28 36,
2 3 4 11 12 13 20 21 22,
5 6 7 8 14 15 16 17 23,
24 25 26 32 33 34 35 42 43,
29 30 31 39 40 41 49 50 51,
37 38 45 46 47 48 54 55 56,
57 63 64 65 66 72 73 74 75,
58 59 60 67 68 69 76 77 78,
44 52 53 61 62 70 71 79 80

This describes a Sudoku puzzle with irregular regions. The parser treats all . as empty squares, and ignores anything in the squares field that is not a number or a .. This allows the inclusion of pipes and hyphens as delimiters for the human editor. (They're a lot easier to interpret than the region indices.)

The obvious representation in XML showcases the huge ratio of markup to data that can occur when using XML. I have uploaded it here. Admittedly, it's over-the-top, but I think it illustrates the point that XML is in some ways definitely not an improvement over the file description above.

It's not like it's very hard to roll your own parser for a simple format like this. I have a recipe for it on ActiveState.

1 comment:

  1. Karl,

    In webkit browsers, .xml is rendered like .html, so your uploaded file comes off looking like a string of numbers with no tags. While this is hardly a pressing issue, I thought you might like to know.

    ReplyDelete