pugixml documentation

Introduction
Quick start
Reference
W3C compliance
Comparison with existing parsers
FAQ
Bugs
Future work
Changelog
Acknowledgements
License

Introduction

pugixml is just another XML parser. This is a successor to pugxml (well, to be honest, the only part that is left as is is wildcard matching code; the rest was either heavily refactored or rewritten from scratch). The main features are:

low memory consumption and fragmentation (the win over pugxml is ~1.3 times, TinyXML - ~2.5 times, Xerces (DOM) - ~4.3 times ¹). Exact numbers can be seen in Comparison with existing parsers section.
extremely high parsing speed (the win over pugxml is ~6 times, TinyXML - ~10 times, Xerces-DOM - ~17.6 times ¹
extremely high parsing speed (well, I'm repeating myself, but it's so fast, that it outperforms Expat by 2.8 times on test XML) ²
more or less standard-conformant (it will parse any standard-compliant file correctly, with the exception of DTD related issues)
pretty much error-ignorant (it will not choke on something like <text>You & Me</text>, like expat will; it will parse files with data in wrong encoding; and so on)
clean interface (a heavily refactored pugxml's one)
more or less Unicode-aware (actually, it assumes UTF-8 encoding of the input data, though it will readily work with ANSI - no UTF-16 for now (see Future work), with helper conversion functions (UTF-8 <-> UTF-16/32 (whatever is the default for std::wstring & wchar_t))
fully standard compliant C++ code (approved by Comeau strict mode); the library is multiplatform (see reference for platforms list)
high flexibility. You can control many aspects of file parsing and DOM tree building via parsing options.

Okay, you might ask - what's the catch? Everything is so cute - it's small, fast, robust, clean solution for parsing XML. What is missing? Ok, we are fair developers - so here is a misfeature list:

memory consumption. It beats every DOM-based parser that I know of - but when SAX parser comes, there is no chance. You can't process a 2 Gb XML file with less than 4 Gb of memory - and do it fast. Though pugixml behaves better, than all other DOM-based parser, so if you're stuck with DOM, it's not a problem.
memory consumption. Ok, I'm repeating myself. Again. When other parsers will allow you to provide XML file in a constant storage (or even as a memory mapped area), pugixml will not. So you'll have to copy the entire data into a non-constant storage. Moreover, it should persist during the parser's lifetime (the reasons for that and more about lifetimes is written below). Again, if you're ok with DOM - it should not be a problem, because the overall memory consumption is less (well, though you'll need a contiguous chunk of memory, which can be a problem).
lack of validation, DTD processing, XML namespaces, proper handling of encoding. If you need those - go take MSXML or XercesC or anything like that.
lack of UTF-16/32 parsing. This is not implemented for now, but this is the features for the next release.

¹ The tests were done on a 1 mb XML file with a 4 levels deep tree with a small amount of text. The times are that of building DOM tree. pugixml was run in default parsing mode, so differences in speed are even bigger with minimal settings.
² Obviously, you can't estimate time of building DOM tree for a SAX parser, so the times of reading the data into storage that closely represented the structure of an XML file were measured.

Quick start

Here there is a small collection of code snippets to help the reader begin using pugixml.

For everything you can do with pugixml, you need a document. There are several ways to obtain it:


#include <fstream>
#include <iostream>

#include "pugixml.hpp"

using namespace std;
using namespace pugi;

int main()
{
    // Several ways to get XML document

    {
        // Load from string
        xml_document doc;

        cout << doc.load("<sample-xml>some text <b>in bold</b> here</sample-xml>") << endl;
    }

    {
        // Load from file
        xml_document doc;

        cout << doc.load_file("sample.xml") << endl;
    }

    {
        // Load from any input stream (STL)
        xml_document doc;

        std::ifstream in("sample.xml");
        cout << doc.load(in) << endl;
    }

    {
        // More advanced: parse the specified string without duplicating it
        xml_document doc;

        char* s = new char[100];
        strcpy(s, "<sample-xml>some text <b>in bold</b> here</sample-xml>");
        cout << doc.parse(transfer_ownership_tag(), s) << endl;
    }

    {
        // Even more advanced: assume manual lifetime control
        xml_document doc;

        char* s = new char[100];
        strcpy(s, "<sample-xml>some text <b>in bold</b> here</sample-xml>");
        cout << doc.parse(transfer_ownership_tag(), s) << endl;

        delete[] s; // <-- after this point, all string contents of document is invalid!
    }

    {
        // Or just create document from code?
        xml_document doc;

        // add nodes to document (see next samples)
    }
}

_Winnie C++ Colorizer

This sample should print a row of 1, meaning that all load/parse functions returned true (of course, if sample.xml does not exist or is malformed, there will be 0's)

Once you have your document, there are several ways to extract data from it.


#include <iostream>

#include "pugixml.hpp"

using namespace std;
using namespace pugi;

struct bookstore_traverser: public xml_tree_walker
{
    virtual bool for_each(xml_node& n)
    {
        for (int i = 0; i < depth(); ++i) cout << "  "; // indentation

        if (n.type() == node_element) cout << n.name() << endl;
        else cout << n.value() << endl;

        return true; // continue traversal
    }
};

int main()
{
    xml_document doc;
    doc.load("<bookstore><book title='ShaderX'><price>3</price></book><book title='GPU Gems'><price>4</price></book></bookstore>");

    // If you want to iterate through nodes...

    {
        // Get a bookstore node
        xml_node bookstore = doc.child("bookstore");

        // Iterate through books
        for (xml_node book = bookstore.child("book"); book; book = book.next_sibling("book"))
        {
            cout << "Book " << book.attribute("title").value() << ", price " << book.child("price").first_child().value() << endl;
        }

        // Output:
        // Book ShaderX, price 3
        // Book GPU Gems, price 4
    }

    {
        // Alternative way to get a bookstore node (wildcards)
        xml_node bookstore = doc.child_w("*[sS]tore"); // this will select bookstore, anyStore, Store, etc.

        // Iterate through books with STL compatible iterators
        for (xml_node::iterator it = bookstore.begin(); it != bookstore.end(); ++it)
        {
            // Note the use of helper function child_value()
            cout << "Book " << it->attribute("title").value() << ", price " << it->child_value("price") << endl;
        }
        
        // Output:
        // Book ShaderX, price 3
        // Book GPU Gems, price 4
    }

    {
        // You can also traverse the whole tree (or a subtree)
        bookstore_traverser t;

        doc.traverse(t);
        
        // Output:
        // bookstore
        //   book
        //     price
        //       3
        //   book
        //     price
        //       4

        doc.first_child().traverse(t);

        // Output:
        // book
        //   price
        //     3
        // book
        //   price
        //     4
    }

    // If you want a distinct node...

    {
        // You can specify the way to it through child() functions
        cout << doc.child("bookstore").child("book").next_sibling().attribute("title").value() << endl;

        // Output:
        // GPU Gems
    
        // You can use a sometimes convenient path function
        cout << doc.first_element_by_path("bookstore/book/price").child_value() << endl;
        
        // Output:
        // 3

        // And you can use powerful XPath expressions
        cout << doc.select_single_node("/bookstore/book[@title = 'ShaderX']/price").node().child_value() << endl;
        
        // Output:
        // 3

        // Of course, XPath is much more powerful

        // Compile query that prints total price of all Gems book in store
        xpath_query query("sum(/bookstore/book[contains(@title, 'Gems')]/price)");

        cout << query.evaluate_number(doc) << endl;

        // Output:
        // 4

        // You can apply the same XPath query to any document. For example, let's add another Gems
        // book (more detail about modifying tree in next sample):
        xml_node book = doc.child("bookstore").append_child();
        book.set_name("book");
        book.append_attribute("title") = "Game Programming Gems 2";
        
        xml_node price = book.append_child();
        price.set_name("price");

        xml_node price_text = price.append_child(node_pcdata);
        price_text.set_value("5.3");
    
        // Now let's reevaluate query
        cout << query.evaluate_number(doc) << endl;

        // Output:
        // 9.3
    }
}

_Winnie C++ Colorizer

Finally, let's get into more details about tree modification and saving.


#include <iostream>

#include "pugixml.hpp"

using namespace std;
using namespace pugi;

int main()
{
    // For this example, we'll start with an empty document and create nodes in it from code
    xml_document doc;

    // Append several children and set values/names at once
    doc.append_child(node_comment).set_value("This is a test comment");
    doc.append_child().set_name("application");

    // Let's add a few modules
    xml_node application = doc.child("application");

    // Save node wrapper for convenience
    xml_node module_a = application.append_child();
    module_a.set_name("module");
    
    // Add an attribute, immediately setting it's value
    module_a.append_attribute("name").set_value("A");

    // You can use operator=
    module_a.append_attribute("folder") = "/work/app/module_a";

    // Or even assign numbers
    module_a.append_attribute("status") = 85.4;

    // Let's add another module
    xml_node module_c = application.append_child();
    module_c.set_name("module");
    module_c.append_attribute("name") = "C";
    module_c.append_attribute("folder") = "/work/app/module_c";

    // Oh, we missed module B. Not a problem, let's insert it before module C
    xml_node module_b = application.insert_child_before(node_element, module_c);
    module_b.set_name("module");
    module_b.append_attribute("folder") = "/work/app/module_b";

    // We can do the same thing for attributes
    module_b.insert_attribute_before("name", module_b.attribute("folder")) = "B";
    
    // Let's add some text in module A
    module_a.append_child(node_pcdata).set_value("Module A description");

    // Well, there's not much left to do here. Let's output our document to file using several formatting options

    doc.save_file("sample_saved_1.xml");
    
    // Contents of file sample_saved_1.xml (tab size = 4):
    // <?xml version="1.0"?>
    // <!--This is a test comment-->
    // <application>
    //     <module name="A" folder="/work/app/module_a" status="85.4">Module A description</module>
    //     <module name="B" folder="/work/app/module_b" />
    //     <module name="C" folder="/work/app/module_c" />
    // </application>

    // Let's use two spaces for indentation instead of tab character
    doc.save_file("sample_saved_2.xml", "  ");

    // Contents of file sample_saved_2.xml:
    // <?xml version="1.0"?>
    // <!--This is a test comment-->
    // <application>
    //   <module name="A" folder="/work/app/module_a" status="85.4">Module A description</module>
    //   <module name="B" folder="/work/app/module_b" />
    //   <module name="C" folder="/work/app/module_c" />
    // </application>
    
    // Let's save a raw XML file
    doc.save_file("sample_saved_3.xml", "", format_raw);
    
    // Contents of file sample_saved_3.xml:
    // <?xml version="1.0"?><!--This is a test comment--><application><module name="A" folder="/work/app/module_a" status="85.4">Module A description</module><module name="B" folder="/work/app/module_b" /><module name="C" folder="/work/app/module_c" /></application>

    // Finally, you can print a subtree to any output stream (including cout)
    doc.child("application").child("module").print(cout);

    // Output:
    // <module name="A" folder="/work/app/module_a" status="85.4">Module A description</module>
}

_Winnie C++ Colorizer

Note, that these examples do not cover the whole pugixml API. For further information, look into reference section.

Reference

pugixml is a library for parsing XML files, which means that you give it XML data some way, and it gives you the DOM tree and the ways to traverse it and to get some useful information from it. The library source consist of two headers, pugixml.hpp and pugiconfig.hpp, and two source files, pugixml.cpp and pugixpath.cpp. You can either compile cpp files in your project, or build a static library. All library classes reside in namespace pugi, so you can either use fully qualified names (pugi::xml_node) or write a using declaration (using namespace pugi;, using pugi::xml_node) and use plain names. All classes have eitther xml_ or xpath_ prefix.

By default it's supposed that you compile the source file with your project (add it into your project, or add relevant entry in your Makefile, or do whatever you need to do with your compilation environment). The library is written in standard-conformant C++ and was tested on following platforms:

Windows 32-bit (MSVC 3 7.0 (2002), MSVC 7.1 (2003), MSVC 8.0 (2005), ICC⁴ 8.0, ICC 8.1, GCC 3.4.2 (MinGW), BCC⁵ 5.82)
Linux 32-bit (GCC 3.2)
Sony Playstation Portable (GCC 3.4.2; in PUGIXML_NO_STL mode)
Microsoft Xbox (MSVC 7.1)

The documentation for pugixml classes, functions and constants is available here.

³ MSVC is Microsoft Visual C++ Compiler
⁴ ICC is Intel C++ Compiler
⁵ BCC is Borland C++ Compiler

W3C compliance

pugixml is not a compliant XML parser. The main reason for that is that it does not reject most malformed XML files. The more or less complete list of incompatibilities follows (I will be talking of ones when using parse_w3c mode):

The parser is completely DOCTYPE-ignorant, that is, it does not even skip all possible DOCTYPEs correctly, let alone use them for parsing
It accepts multiple attributes with the same name in one node
It is charset-ignorant
It accepts invalid attribute values (those with < in them) and does not reject invalid entity references or character references (in fact, it does not do DOCTYPE parsing, so it does not perform entity reference expansion)
It does not reject comments with -- inside
It does not reject PI with the names of 'xml' and alike
And some other things that I forgot to mention

In short, it accepts some malformed XML files and does not do anything that is related to DOCTYPE. This is because the main goal was developing fast, easy-to-use and error ignorant (so you can get something even from a malformed document) parser, there are some good validating and conformant parsers already.

Comparison with existing parsers

This table summarizes the comparison in terms of time and memory consumption between pugixml and other parsers. For DOM parsers (all, except Expat, irrXML and SAX parser of XercesC), the process is as follows:

construct DOM tree from file, which is preloaded in memory (all parsers take const char* and size as an input). 'parse time' means number of CPU clocks which is spent, 'parse allocs' - number of allocations, 'parse memory' - peak memory consumption
traverse DOM tree to fill information from it into some structure (which is the same for all parsers, of course). 'walk time' means number of CPU clocks which is spent, 'walk allocs' - number of allocations

For SAX parsers, the parse step is skipped (hence the N/A in relevant table cells), structure is filled during 'walk' step.

For all parsers, 'total time' column means total time spent on the whole process, 'total allocs' - total allocation count, 'total memory' - peak memory consumption for the whole process.

The tests were performed on a 1 Mb XML file with a small amount of text. They were compiled with Microsoft Visual C++ 8.0 (2005) compiler in Release mode, with checked iterators/secure STL turned off. The test system is AMD Sempron 2500+, 512 Mb RAM.

parser	parse time	parse allocs	parse memory	walk time	walk allocs	total time	total allocs	total memory
irrXML	N/A	N/A	N/A	352 Mclocks	697 245	356 Mclocks	697 284	906 kb
Expat	N/A	N/A	N/A	97 Mclocks	19	97 Mclocks	23	1028 kb
TinyXML	168 Mclocks	50 163	5447 kb	37 Mclocks	0	242 Mclocks	50 163	5447 kb
PugXML	100 Mclocks	106 597	2747 kb	38 Mclocks	0	206 Mclocks	131 677	2855 kb
XercesC SAX	N/A	N/A	N/A	411 Mclocks	70 380	411 Mclocks	70 495	243 kb
XercesC DOM	300 Mclocks	30 491	9251 kb	65 Mclocks	1	367 Mclocks	30 492	9251 kb
pugixml	17 Mclocks	40	2154 kb	14 Mclocks	0	32 Mclocks	40	2154 kb
pugixml (test of non-destructive parsing)	12 Mclocks	51	1632 kb	21 Mclocks	0	34 Mclocks	51	1632 kb

Note, that non-destructive parsing mode was just a test and is not yet in pugixml.

FAQ

Q: I do not have/want STL support. How can I compile pugixml without STL?

A: There is an undocumented define PUGIXML_NO_STL. If you uncomment the relevant line in pugixml header file, it will compile without any STL classes. The reason it is undocumented are that it will make some documented functions not available (specifically, xml_document::load, that operates on std::istream, xml_node::path function, saving functions (xml_node::print, xml_document::save), XPath-related functions and classes and as_utf16 and as_utf8 conversion functions). Otherwise, it will work fine.

Q: Do paths that are accepted by first_element_by_path have to end with delimiter?

A: Either way will work, both /path/to/node/ and /path/to/node is fine.

I'm always open for questions; feel free to write them to arseny.kapoulkine@gmail.com.

Bugs

I'm always open for bug reports; feel free to write them to arseny.kapoulkine@gmail.com. Please provide as much information as possible - version of pugixml, compiling and OS environment (compiler and it's version, STL version, OS version, etc.), the description of the situation in which the bug arises, the code and data files that show the bug, etc. - the more, the better. Though, please, do not send executable files.

Note, that you can also submit bug reports/suggestions at project page.

Future work

Here are some improvements that will be done in future versions (they are sorted by priority, the upper ones will get there sooner).

Support for UTF-16 files (parsing BOM to get file's type and converting UTF-16 file to UTF-8 buffer if necessary)
More intelligent parsing of DOCTYPE (it does not always skip DOCTYPE for now)
XML 1.1 changes (changed EOL handling, normalization issues, etc.)
Name your own?

Changelog

15.07.2006 - v0.1

First private release for testing purposes

6.11.2006 - v0.2

First public release. Changes:

Introduced child_value(name) and child_value_w(name)
Fixed child_value() (for empty nodes)
Fixed xml_parser_impl warning at W4
parse_eol_pcdata and parse_eol_attribute flags + parse_minimal optimizations
Optimizations of strconv_t

21.02.2007 - v0.3

Refactored, reworked and improved version. Changes:

Interface:
- Added XPath
- Added tree modification functions
- Added no STL compilation mode
- Added saving document to file
- Refactored parsing flags
- Removed xml_parser class in favor of xml_document
- Added transfer ownership parsing mode
- Modified the way xml_tree_walker works
- Iterators are now non-constant
Implementation:
- Support of several compilers and platforms
- Refactored and sped up parsing core
- Improved standard compliancy
- Added XPath implementation
- Fixed several bugs

Acknowledgements

Kristen Wegner for pugxml parser
Neville Franks for contributions to pugxml parser

License

The pugixml parser is distributed under the MIT license:

Copyright (c) 2006-2007 Arseny Kapoulkine

Permission is hereby granted, free of charge, to any person
obtaining a copy of this software and associated documentation
files (the "Software"), to deal in the Software without
restriction, including without limitation the rights to use,
copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following
conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.

Revised 21 February, 2007