Wednesday, January 27, 2010

Wrestling with C

I am by no means a critic of the C language. I think of it as portable assembler. If you need fine-grained control of every machine cycle, you use C. If not, you don't.  It's inappropriate for just about anything else.

However, if you have control over everything the machine does, you have to tell the machine to do things that would happen automatically in other languages. Initializing integers to zero instead of whatever is laying around in memory. This caused a surprising problem in my advanced graphics class today.

The professor has a library of matrix manipulation routines designed for translating, rotating, and scaling objects in three dimensions. He used them for a sample program with two dimensional graphics. However, sometimes the graphic would have a bunch of points at the corner of the window, even though it was supposed to be in the middle. What he eventually figured out was that he hadn't initialized the array that contained the z-coordinates. He assumed, quite reasonably, that since he was working in 2 dimensions, the z-coordinates would be irrelevant. The problem occurred if there was a NaN (not a number) somewhere in the memory allocated for the z array. The NaN would propagate through the matrix multiplication, turning all the coordinates into NaNs.

I have a healthy respect for C, but I can't deny that it causes all manner of headaches.

Wednesday, January 20, 2010

The elegance of Haskell

I'm usually pretty enamored with Python, but I think Haskell has it beat when it comes to generating the Fibonacci sequence. The most obvious way to write the Fibonacci sequence in Python is with a generator.

def fib():
    x,y = 0,1
    while True:
        yield x
        x,y = y, x+y

Hasekll, in contrast, uses a lazily evaluated list. At first I thought of such a list as a generator, but when I saw this example I realized that the lazy list was a lot more powerful.

fibs = 0 : 1 : [ a + b | (a, b) <- zip fibs (tail fibs)]

fibs = 0 : 1 : zipWith (+) fibs (tail fibs)

In Haskell, 0 : 1 : 2 : 3 is equivalent to [0, 1, 2, 3]: it generates a new list. In this case, you're starting with [0, 1], then adding more stuff to the end. In the first example, it's a list comprehension, approximately equivalent to [a+b for (a, b) in zip(fibs, fibs[1:])] in Python. However, you can't build a succint one-liner like this in Python because fibs has to be defined before you can iterate over it. In Haskell, you start with a short list and build the rest of it implicitly.

The second line is even cooler. Instead of going through the trouble of defining a list comprehension, you just use zipWith. This would be equivalent to the following Python snippet:

def zip_with(function, sequence1, sequence2):
    return [function(item1, item2) for item1, item2 in zip(sequence1, sequence2)]

This is a great example of how carefully chosen higher-order functions can simplify your code. When I first started using list comprehensions in Python, I was doing things like [str(i) for i in xrange(10)] all the time, when it's much more clear and succint to use map(str, xrange(10)).

Tuesday, January 12, 2010

Implementing DES

It's a quite outdated now, but back in the day the venerable Data Encryption Standard was a really popular form of symmetric-key cryptography. I have learned about DES twice in my life, once my freshman year of High School and once this year in my Cryptography class. The second time around, having taken Computer Architecture, I was struck by how neatly DES's operations mapped to the most common machine instructions.

I decided that this would be a good opportunity to practice some assembly programming. It seemed tailor-made for the purpose.

I haven't actually written the assembly program. I wanted to get a working copy in Python running first, as a point of reference. I have made two implementations in python. The first converts all the integers to lists of ones and zeros, and uses Python's list manipulation abilities. The second stays at a low level and just uses bit shifting and that sort of thing.

I'll post a more extensive write-up later. For now, let's just say that I'm quite disappointed that the low-level version really isn't noticeably faster than the high level version. Both encrypt about 150,000 blocks per second. It may just be that Python has too much overhead in general, and if you want to get all the benefits of operating closer to the metal you have to code it in C. Global variables and function calls are both rather slow in Python, according to Guido.[1] I'll have to do some profiling to see what needs speeding up.

Saturday, January 9, 2010

XML can suck as a data format

There's a lot of hype about XML as a data format these days. Everything is supposed to be XML. This isn't such a bad idea. Any time you decide to use XML you don't need to invent your own syntax.

(Side note: I think many people are confused about what XML actually is. I like to think of it as a syntax, rather than a language. When you make your own XML format, your basically inventing the semantics that go with the syntax.)

Now, I think XML is great as a format for documents. I think that the combination of XML and XSLT could do a lot for achieving the kind of awesome flexibility of LaTeX while providing a stronger separation of semantics and presentation.

The trouble is that XML often falls down when it comes to dealing with data (as opposed to a document). The bread and butter data structures of any programming language are sequences (variously called lists, vectors, arrays) and maps (variously called hashes, dictionaries, and associative arrays). For serializing data structures, YAML or JSON probably work better. As Tim Bray said: "It seems to me that the great thing about JSON is that it exists for one purpose: to put structs on the wire." [1] Also, data is often much less deeply nested and much more predictable than the documents XML was designed to handle. Eric S. Raymond offers several syntax styles for common requirements.

The advantage of the non-XML data formats Raymond provides is that they are much less verbose than XML. The poor human editor can easily get lost in all the elements and attributes, which defeats the whole purpose of a text format: to be easily scannable by us poor humans.

Here is an example of a data format I use for my Sudoku project. It is based on Raymond's RFC 822 metaformat.

squares: . .|. . .|. . . .
1 .|. . 5|3 4 2 .
. .|1 3 .|.|. . .
7 2|8 . .|. . . 9
----- --- --
.|. 2|. 9 .|8 .|.
-- --- -----
5 . . .|. . 1|9 6
. . .|.|. 8 5|. .
. 3 5 1|7 . .|. 2
. . . .|. . .|. .
regions: 0 1 9 10 18 19 27 28 36,
2 3 4 11 12 13 20 21 22,
5 6 7 8 14 15 16 17 23,
24 25 26 32 33 34 35 42 43,
29 30 31 39 40 41 49 50 51,
37 38 45 46 47 48 54 55 56,
57 63 64 65 66 72 73 74 75,
58 59 60 67 68 69 76 77 78,
44 52 53 61 62 70 71 79 80

This describes a Sudoku puzzle with irregular regions. The parser treats all . as empty squares, and ignores anything in the squares field that is not a number or a .. This allows the inclusion of pipes and hyphens as delimiters for the human editor. (They're a lot easier to interpret than the region indices.)

The obvious representation in XML showcases the huge ratio of markup to data that can occur when using XML. I have uploaded it here. Admittedly, it's over-the-top, but I think it illustrates the point that XML is in some ways definitely not an improvement over the file description above.

It's not like it's very hard to roll your own parser for a simple format like this. I have a recipe for it on ActiveState.