This post hit home with me because I had an XML file to parse that was over a gigabyte. From this XML file, I needed a very small handful of the data, and it was very regular XML. XML parsing is a solved problem, but most XML libraries I've used would easily choke on such a file.
Instead of even considering attempting to process this data with a normal XML processor, I wrote a simple Ruby script to extract the information. It looped over each line, looking for key parts of the data with lines like:
if line["<expectedTag>"]
# deal with this tag
end
Then, I processed the key tags and data I was looking for with regular expressions, such as:
data = line[/<expectedTag>(.+)</expectedTag>/, 1]
The above was done within the if blocks. The key point being regexes would have been too slow alone, so I used the simple indexer method to quickly determine if the line contained something that mattered to me. Then I used the regex to pull the data that I actually wanted.
Can you write XML to break my processing? Of course! The question is... does it matter? And that answer was no. I only need to process this data once, maybe another time sometime in the distant future, but the XML is so regular that I know it will work for all the data. On top of this, if I missed some data, it wouldn't matter in the slightest for my purposes. So, in short, proper XML processing would have severely slowed me down (ignoring all lines that don't contain a keyword is much faster), and it would have produced no real benefit.
I ended up processing all the data in little over a minute or two, and I considered it a huge success. Over a gigabyte of XML to process seemed a rather daunting task initially!
2 comments:
More importantly, you knew where the gigabyte of XML came from, and you were able to do reasonable inspection on it.
A lot of Jeff Atwood's advice there comes from web sites, where you spent a lot of time and effort trying to keep attackers from being able to worm past badly-written regexp-based parsers. Fundamentally, parsing known-good data from a trusted source is a whole other world from that.
Thanks for the comment Noah! I think you reiterated my point, or at least the point I had intended to make. That is, think about your problem and what may solve it best, not what someone says to blindly always or never do. Granted, regexes for XML/HTML processing has a completely different tone when the source of the data is known vs unknown, but that doesn't stop the fact that you should always consider your alternatives.
Post a Comment