Sunday, November 22, 2009

Smart XML Processing with Regexes

Recently, Jeff Atwood wrote about parsing HTML with regular expressions. I want to speak about it briefly, because I came across this issue last week. I gathered from his post that the lesson is to consider your options with an open mind, and only block a possible solution if you really understand the alternatives. Use facts and knowledge to choose your implementation details, not superstition and theoretical best practices. Best practices usually are created for a reason, but that's not to say there's never a reason to turn your head on them.

This post hit home with me because I had an XML file to parse that was over a gigabyte. From this XML file, I needed a very small handful of the data, and it was very regular XML. XML parsing is a solved problem, but most XML libraries I've used would easily choke on such a file.

Instead of even considering attempting to process this data with a normal XML processor, I wrote a simple Ruby script to extract the information. It looped over each line, looking for key parts of the data with lines like:

if line["<expectedTag>"]
# deal with this tag
end


Then, I processed the key tags and data I was looking for with regular expressions, such as:

data = line[/<expectedTag>(.+)</expectedTag>/, 1]


The above was done within the if blocks. The key point being regexes would have been too slow alone, so I used the simple indexer method to quickly determine if the line contained something that mattered to me. Then I used the regex to pull the data that I actually wanted.

Can you write XML to break my processing? Of course! The question is... does it matter? And that answer was no. I only need to process this data once, maybe another time sometime in the distant future, but the XML is so regular that I know it will work for all the data. On top of this, if I missed some data, it wouldn't matter in the slightest for my purposes. So, in short, proper XML processing would have severely slowed me down (ignoring all lines that don't contain a keyword is much faster), and it would have produced no real benefit.

I ended up processing all the data in little over a minute or two, and I considered it a huge success. Over a gigabyte of XML to process seemed a rather daunting task initially!

2 comments:

Noah Gibbs said...

More importantly, you knew where the gigabyte of XML came from, and you were able to do reasonable inspection on it.

A lot of Jeff Atwood's advice there comes from web sites, where you spent a lot of time and effort trying to keep attackers from being able to worm past badly-written regexp-based parsers. Fundamentally, parsing known-good data from a trusted source is a whole other world from that.

Mike Stone said...

Thanks for the comment Noah! I think you reiterated my point, or at least the point I had intended to make. That is, think about your problem and what may solve it best, not what someone says to blindly always or never do. Granted, regexes for XML/HTML processing has a completely different tone when the source of the data is known vs unknown, but that doesn't stop the fact that you should always consider your alternatives.