One of the first things I needed to do while building the GeeQE iPhone application was process the CC data dump from Stack Overflow. The dump contains XML files representing tables from Stack Overflow with the largest file being posts.xml weighing in at 1.2G as of September. I decided it would be pretty easy to use Ruby to parse the XML and load the data into MySQL so I went about finding the right parser for the job.
If you haven't processed large amounts of XML before one thing to realize is that you don't want to use a DOM parser because it is going to load the entire XML structure into memory. What you want is a SAX parser that can work on the XML stream as it comes in. With this in mind I started looking around and quickly found an older benchmark post that gave me an educated guess that the LibXML library was going to be the fastest parser for Ruby. After figuring out how to use it I decided to also give a couple other libraries a shot to see how they stacked up, the other two I looked at were REXML and Nokogiri.
The following is a set of example using each library in streaming SAX mode. Each processes the 1.2G posts.xml file from the dump and does nothing more than check that the element represents a "row". I have also included a sample runtime for each:
REXML SAX example
require 'rubygems' require 'rexml/document' require "rexml/streamlistener" include REXML class PostCallbacks include StreamListener def tag_start(element, attributes) if element == 'row' # Process row of data here end end end source = File.new "posts.xml" Document.parse_stream(source, PostCallbacks.new)
REXML runtime
time ruby rexmltest.rb real 47m22.871s user 42m0.711s sys 3m31.943s
Nokogiri SAX example
require 'rubygems' require 'nokogiri' include Nokogiri class PostCallbacks < XML::SAX::Document def start_element(element, attributes) if element == 'row' # Process row of data here end end end parser = XML::SAX::Parser.new(PostCallbacks.new) parser.parse_file("posts.xml") [/code] <b>Nokogiri runtime</b> time ruby nokogiri.rb real 4m45.347s user 4m7.504s sys 0m19.332s
LibXML SAX example
require 'rubygems' require 'libxml' include LibXML class PostCallbacks include XML::SaxParser::Callbacks def on_start_element(element, attributes) if element == 'row' # Process row of data here end end end parser = XML::SaxParser.file("posts.xml") parser.callbacks = PostCallbacks.new parser.parse
LibXML runtime
time ruby libxmltest.rb real 1m55.657s user 1m41.938s sys 0m5.718s
From the above you can see that LibXML is the fastest. I thought that Nokogiri would be a lot closer in execution time given that it uses libxml2 but it is still 2 times slower. The slowest by far was REXML clocking in more than 20 times slower than LibXML. Nokogiri seemed easier to debug when things went wrong than LibXML so had I needed to construct a more complex application to load the data I would have probably used it instead.