Sunday, November 22, 2009

Smart XML Processing with Regexes

Recently, Jeff Atwood wrote about parsing HTML with regular expressions. I want to speak about it briefly, because I came across this issue last week. I gathered from his post that the lesson is to consider your options with an open mind, and only block a possible solution if you really understand the alternatives. Use facts and knowledge to choose your implementation details, not superstition and theoretical best practices. Best practices usually are created for a reason, but that's not to say there's never a reason to turn your head on them.

This post hit home with me because I had an XML file to parse that was over a gigabyte. From this XML file, I needed a very small handful of the data, and it was very regular XML. XML parsing is a solved problem, but most XML libraries I've used would easily choke on such a file.

Instead of even considering attempting to process this data with a normal XML processor, I wrote a simple Ruby script to extract the information. It looped over each line, looking for key parts of the data with lines like:

if line["<expectedTag>"]
# deal with this tag

Then, I processed the key tags and data I was looking for with regular expressions, such as:

data = line[/<expectedTag>(.+)</expectedTag>/, 1]

The above was done within the if blocks. The key point being regexes would have been too slow alone, so I used the simple indexer method to quickly determine if the line contained something that mattered to me. Then I used the regex to pull the data that I actually wanted.

Can you write XML to break my processing? Of course! The question is... does it matter? And that answer was no. I only need to process this data once, maybe another time sometime in the distant future, but the XML is so regular that I know it will work for all the data. On top of this, if I missed some data, it wouldn't matter in the slightest for my purposes. So, in short, proper XML processing would have severely slowed me down (ignoring all lines that don't contain a keyword is much faster), and it would have produced no real benefit.

I ended up processing all the data in little over a minute or two, and I considered it a huge success. Over a gigabyte of XML to process seemed a rather daunting task initially!


Noah Gibbs said...

More importantly, you knew where the gigabyte of XML came from, and you were able to do reasonable inspection on it.

A lot of Jeff Atwood's advice there comes from web sites, where you spent a lot of time and effort trying to keep attackers from being able to worm past badly-written regexp-based parsers. Fundamentally, parsing known-good data from a trusted source is a whole other world from that.

Mike Stone said...

Thanks for the comment Noah! I think you reiterated my point, or at least the point I had intended to make. That is, think about your problem and what may solve it best, not what someone says to blindly always or never do. Granted, regexes for XML/HTML processing has a completely different tone when the source of the data is known vs unknown, but that doesn't stop the fact that you should always consider your alternatives.

Anonymous said...

You might also want to look at vtd-xml