Handling large XML files in PHP


Handling XML files in PHP is relatively easy when it comes to limited document size. Let's have the following example xml document with products:

<?xml version="1.0" encoding="UTF-8"?>
<products>
    <product id="1">
        <name>product1</name>
        <price>10.5</price>
    </product>
    <product id="2">
        <name>product2</name>
        <price>9.5</price>
    </product>
    <product id="3">
        <name>product3</name>
        <price>11.5</price>
    </product>

</products>

Reading this one with SimpleXML is ... simple:

function print_product($product) {
    $attrs = $product->attributes();
    echo "\n" . $attrs['id'] . ": " . $product->name .
        " (" . $product->price . ")";       
}

$xml = simplexml_load_file('../resources/xml-sample1.xml');

foreach ($xml->product as $product) {
    print_product($product);
}

Output:

1: product1 (10.5)
2: product2 (9.5)
3: product3 (11.5)

Also we can search for specific product by name or id:

echo "Products with id=2:\n";
foreach ($xml->product as $product) {
    $attrs = $product->attributes();
    if ($attrs['id'] == 2) {
        print_product($product);
    }
}

Output:

Products with id=2:

2: product2 (9.5)


echo "Products with name=product3:\n";
foreach ($xml->product as $product) {
    if ($product->name == 'product3') {
        print_product($product);
    }
}

Output:

Products with name=product3:

3: product3 (11.5)


Now let's expand out product feed to 1000000 products:

$fp = fopen('../resources/xml-sample2.xml', 'w');
fwrite($fp, '<?xml version="1.0" encoding="UTF-8"?><products>');
for ($i=0; $i<1000000; $i++) {
    fwrite($fp, "<product id=\"$i\">
            <name>product$i</name>
            <price>" . rand(10, 20). "</price>
        </product>");
}
fwrite($fp, '</products>');
fclose($fp);

$xml = simplexml_load_file('../resources/xml-sample2.xml');
print_r($xml);

Output:

PHP Fatal error:  Allowed memory size of 33554432 bytes exhausted (tried to allocate 12 bytes) in xml-sample1.php on line 45

So we've exhausted out memory - xml-sample2.xml is around 88M. Note that the following does not get out of memory error, but still PHP process allocates almost 600M system memory:

$xml = simplexml_load_file('../resources/xml-sample2.xml');

foreach ($xml->product as $product) {
    print_product($product);
}

In order to handle huge XML files with minimal memory footprint we should use streaming XML parsers. Streaming parsers process XML documents as a stream and produces events when one of the following is found in the stream opening tag, closing tag, character data fragment. All you need to implement is handlers for each function type. The following is a simple implementation that prints all products in the feed:

class ProductsParser {
    var $product;
    var $product_elem;
   
    // invoked every time open tag is encountered in the stream
    // $tag contains the name of the tag and $attributes is a key-value array of tag attributes
    function startElement($parser, $tag, $attributes) {
        switch($tag) {
            case 'product':
                $this->product=array('id'=>$attributes['id'], 'name'=>'', 'price'=>'');
                break;
            case 'name':
            case 'price':
                if ($this->product) {
                    $this->product_elem = $tag;
                }
                break;
        }
    }
   
    // invoked on each closing tag
    function endElement($parser, $tag) {
        switch($tag) {
            case 'product':
                if ($this->product) {
                    $this->handle_product();
                    $this->product = null;
                }
                break;
            case 'name':
            case 'price':
                $this->product_elem = null;
                break;
        }
    }
   
    // invoked each time cdata text is processed
    // note that this may be just a fragment of cdata element so
    // consider that single cdata element can be processed
    // with multiple invocations of this method
    function cdata($parser, $cdata) {
        if ($this->product && $this->product_elem) {
            $this->product[$this->product_elem] .= $cdata;
        }
    }
   
    // invoked each time a complete product is decoded from the stream
    function handle_product() {
        $this->print_product();
    }

    // prints decoded product
    function print_product() {
        echo "\n" . $this->product['id'] . ": " . $this->product['name'] .
            " (" . $this->product['price'] . ")";       
    }
   
}

$xml_handler = new ProductsParser();
$parser = xml_parser_create();

xml_set_object($parser, $xml_handler);
xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, false);
xml_set_element_handler($parser, "startElement", "endElement");
xml_set_character_data_handler($parser, "cdata");

$fp = fopen('../resources/xml-sample2.xml', 'r');
while ($data = fread($fp, 4096)) {
    xml_parse($parser, $data, feof($fp));
    flush();
}
fclose($fp);

Few clarifications here. File is read in 4k chunks and only single chunk is processed at a time. Each time opening tag is found in the current chunk, its name and attributes are read and startElement() is invoked. If this happens to be <product> we have a new product - just initialize the variable holding current product with the attributes (in our case just id). Next tag in the stream would be name. We raise the flag that we are about to read CDATA content of the name element. Following is one or few invocations of cdata() which we append to the current product's name. Then closing tag for name is encountered and we lower the flag for reading name. The same is with price. When </product> causes invocation of endElement() then the current product is completely read - we can pass it to the product handling functions - print it on the screen.

It is better to move all product handling logic - like filtering, searching or database operations in handle_product() function. For example we print each product one by one instead of accumulating all products and then print them at once. This minimizes memory usage of the script.

 

 

Comments:

Martin (01-01-2011 16:15) :
How can I add these date to a SQL DB? When trying it the following way, only one set of data is inserted. I am thankful for any help.

---
$db="INSERT into my_table values ('".$this->product['name']."','".$this->product['price']."')");
mysql_query($db);
---

Only one set of data will be inserted.

bobi (03-01-2011 17:01) :
Just put this code in the handle_product() body - it'll be executed for every product in the document.

Nikolas (21-07-2011 15:56) :
Hello, thank you for this article. I am new on handling big files and I found this very helpful.

I Have two questions though.

Where I can find the xml-sample2.xml for testing purposes?

Do you think is better to use this method instead of XMLReader?

Tommy (11-08-2011 22:05) :
Thank you.. simple, elegant solution. Was having lots of issues with other methods!!

mario (08-05-2013 11:49) :
Hi! I just want to ask how to parse xml document with UTF-8 coding, where to set this coding, because when i set UTF-8 coding in xml document and then run this parser some letter are wrong due to slovak č,ľ,ž and other letters.

thank a lot!

Nadejda Chotorova (BG) (04-08-2014 13:05) :
Thank you for your post! You helped me a lot!

Tom (27-05-2015 20:33) :
what if we have a nested element with the same name like..
<products>
<product id="1">
<name>product1</name>
<price>10.5</price>
<desc>
<product>othername1</product>
</desc>
</product>
<product id="2">
<name>product2</name>
<price>9.5</price>
<desc>
<product>othername2</product>
</desc>
</product>
</products>

what would the code be for startElement....?

bobi (28-05-2015 07:34) :
You need to save some state in this case, for example:

$product_stack = 0;

function startElement($parser, $tag, $attributes) {
switch($tag) {
case 'product':
if ($this->product_stack == 0) {
$this->product=array('id'=>$attributes['id'], 'name'=>'', 'price'=>'');
}
$this->product_stack++;
break;
case 'name':
case 'price':
if ($this->product) {
$this->product_elem = $tag;
}
break;
}
}

function endElement($parser, $tag) {
switch($tag) {
case 'product':
if ($this->product && $this->product_stack == 1) {
$this->handle_product();
$this->product = null;
}
$this->product_stack--;
break;
case 'name':
case 'price':
$this->product_elem = null;
break;
}
}

Back to articles list

This page was last modified on 2018-12-09 23:05:15