From 69cc3fcb3a28d4b7f69dfa5f4dcc025eb53332d8 Mon Sep 17 00:00:00 2001
From: "arseny.kapoulkine"
pugixml is just another XML parser. This is a successor to
+pugxml (well, to be honest, the only part
+that is left as is is wildcard matching code, the rest was either heavily refactored or rewritten
+from scratch). The main features (call it USP) are: Okay, you might ask - what's the catch? Everything is so cute - it's small, fast, robust, clean solution
+for parsing XML. What is missing? Ok, we are fair developers - so here is a misfeature list: pugixml is a DOM-based parser. This means, that the XML document is converted to a tree.
+Each XML tag is converted to a node in DOM tree. If a tag is contained in some other tag, its node
+is a child to the outer tag's one. Comments, CDATA sections and PIs (Processing Instructions) also are
+transformed into tree nodes, as is the standalone text. Each node has its type. Here is an example of an XML document:
+
+
+
+
+
+
+
+ pugixml documentation
+
+Contents
+
+
+
+
+
+
+Introduction
+
+
+
+
+
+
+
+
+1 The tests were done on a 1 mb XML file with a 4 levels deep tree
+with a small amount of text. The times are that of building DOM tree.
+2 Obviously, you can't estimate time of building DOM tree for a
+SAX parser, so the times of reading the data into storage that closely represented the structure of
+an XML file were measured.
+
+
+
+
+Document Object Model
+
+
+<?xml version="1.0"?>
+<mesh name="mesh_root">
+ <!-- here is a mesh node -->
+ some text
+ <![CDATA[[someothertext]]>
+ some more text
+ <node attr1="value1" />
+ <node attr1="value2">
+ <?TARGET somedata?>
+ <innernode/>
+ </node>
+</mesh>
+
+
+It gets converted to the following tree (note, that with some parsing options comments, PIs and CDATA
+sections are not stored in the tree, and with some options there are also nodes with whitespaces
+and the contents of PCDATA sections is a bit different (with trailing/leading whitespaces). So generally
+the resulting DOM tree depends on the parsing options):
The parent-children relations are shown with lines. Some nodes have previous and next siblings +(for example, the next sibling for node_comment node is node_pcdata with value "some text", and the +previous sibling for node_element with name "mesh" is node_pi with target "xml" (target for PI nodes +is stored in the node name)).
+pugixml is a library for parsing XML files, which means that you give it XML data some way, +and it gives you the DOM tree and the ways to traverse it and to get some useful information from it. +The library source consist of two files, the header pugixml.hpp, and the source code pugixml.cpp. +You can either compile cpp file in your project, or build a static library (or perhaps even a DLL), +or make the whole code use inline linkage and make one big file (as it was done in pugxml). +All library classes reside in namespace pugi, so you can either use fully qualified +names (pugi::xml_node) or write a using declaration (using namespace pugi;, using +pugi::xml_node) and use plain names. All classes have the xml_ prefix.
+ +By default it's supposed that you compile the source file with your project (add it into your +project, or add relevant entry in your Makefile, or do whatever you need to do with your compilation +environment). The library is written in standard-conformant C++ and was tested on win32 platform +(MSVC 7.1 (2003), MSVC 8.0 (2005)).
+ + +xml_parser class is the core of parsing process; you initiate parsing with it, you get DOM +tree from it, the nodes and attributes are stored in it. You have two ways to load a file: either +provide a string with XML-data (it has to be null-terminated, and it will be modified during parsing +process, so it can not be a piece of read-only memory), or with an std::istream object (any input +stream, like std::ifstream, std::istringstream, etc.) - in this case the parser will allocate +the necessary amount of memory (equivalent to stream's size) and read everything from the stream.
+ +The functions for parsing are: +
+ void parse(std::istream& stream, unsigned int optmsk = parse_noset); |
_Winnie C++ Colorizer |
stream
,
+read the chunk of data from the stream and parse it with provided options (optmsk
).
+The stream does not have to persist after the call to the function, the lifetime of internal buffer
+with stream's data is managed by pugixml.
++ char* parse(char* xmlstr, unsigned int optmsk = parse_noset); + |
_Winnie C++ Colorizer |
+ xml_parser(std::istream& stream, unsigned int optmsk = parse_default); |
_Winnie C++ Colorizer |
+ xml_parser(char* xmlstr, unsigned int optmsk = parse_default); |
_Winnie C++ Colorizer |
If you want to provide XML data after the creation of the parser, use the default ctor. Otherwise +you are free to use either parsing ctors or default ctor and later - parsing function.
+ +After parsing an XML file, you'll get a DOM tree. To get access to it (or, more precisely, to its +root), call either document() function or cast xml_parser object to xml_node by +using the following functions:
+ ++ operator xml_node() const; + xml_node document() const; + |
_Winnie C++ Colorizer |
Ok, easy part behind - now let's dive into parsing options. There is a variety of them, and you +must choose them wisely to get the needed results and the best speed/least memory overhead. At first, +there are flags that determine which parts of the document will be put into DOM tree, and which will +be just skipped:
+ +Then there are flags that determine how the processing of the retrieved data is done. There are +several reasons for these flags, mainly: +
Finally, there are two more flags, that indicate closing tag parsing. When pugixml meets a +close tags, there are three ways: +
Did I say finally? Ok, so finally there are some helper flags, or better groups of flags. +These are: +
A couple of words on flag usage. The parsing options are just a set of bits, with each bit corresponding +to one flag. You can turn the flag on by OR-ing the options value with this flag's constant: +
+ parse_w3c | parse_wnorm_pcdata ++or turn the flag off by AND-ing the options value with the NEGation of this flag's constant: +
+ parse_w3c & ~parse_comments ++You can access the current options of parser by options() method: +
+ unsigned int options() const; + unsigned int options(unsigned int optmsk); + |
_Winnie C++ Colorizer |
If xml_parser is a heart of constructing a DOM tree from file, xml_node is a heart +of processing the tree. This is a simple wrapper, so it's small (4/8 bytes, depending on the size of +pointer), you're free to copy it and it does not own anything. I'll continue with a list of methods +with their description, with one note in advance. Some functions, that do something according to a +string-like parameter, have a pair with a suffix _w. The _w suffix tells, that this +function is doing a wildcard matching, instead of simple string comparison. You're free to use wildcards +* (that is equal to any sequence of characters (possibly empty)), ? (that is equal to +any character) and character sets ([Abc] means 'any symbol of A, b and c', [A-Z4] means +'any symbol from A to Z, or 4', [!0-9] means 'any symbol, that is not a digit'). So the wildcard +?ell_[0-9][0-9]_* will match strings like 'cell_23_xref', 'hell_00_', but will not match the +strings like 'ell_23_xref', 'cell_0_x' or 'cell_0a_x'.
+ ++ /// Access iterators for this node's collection of child nodes. + iterator begin() const; + iterator end() const; + + /// Access iterators for this node's collection of child nodes (same as begin/end). + iterator children_begin() const; + iterator children_end() const; + + /// Access iterators for this node's collection of attributes. + attribute_iterator attributes_begin() const; + attribute_iterator attributes_end() const; + + /// Access iterators for this node's collection of siblings. + iterator siblings_begin() const; + iterator siblings_end() const; + |
_Winnie C++ Colorizer |
Functions, returning the iterators to walk through children/siblings/attributes. More on that in +Iterators section.
+ ++ operator unspecified_bool_type() const; + |
_Winnie C++ Colorizer |
This is a safe bool-like conversion operator. You can check node's validity (if (xml_node), + if (!xml_node), if (node1 && node2 && !node3 && cond1 && ...) - you get the idea) with +it. +
+ ++ bool operator==(const xml_node& r) const; + bool operator!=(const xml_node& r) const; + bool operator<(const xml_node& r) const; + bool operator>(const xml_node& r) const; + bool operator<=(const xml_node& r) const; + bool operator>=(const xml_node& r) const; + |
_Winnie C++ Colorizer |
Comparison operators
+ ++ bool empty() const; + |
_Winnie C++ Colorizer |
if (node.empty())
is equivalent to if (!node)
+ xml_node_type type() const; + const char* name() const; + const char* value() const; + |
_Winnie C++ Colorizer |
Access node's properties (type, name and value). If there is no name/value, the corresponding functions +return "" - they never return NULL.
+ ++ xml_node child(const char* name) const; + xml_node child_w(const char* name) const; + |
_Winnie C++ Colorizer |
Get a child node with specified name, or xml_node() (this is an invalid node) if nothing is +found
+ ++ xml_attribute attribute(const char* name) const; + xml_attribute attribute_w(const char* name) const; + |
_Winnie C++ Colorizer |
Get an attribute with specified name, or xml_attribute() (this is an invalid attribute) if +nothing is found
+ ++ xml_node sibling(const char* name) const; + xml_node sibling_w(const char* name) const; + |
_Winnie C++ Colorizer |
Get a node's sibling with specified name, or xml_node() if nothing is found.
+node.sibling(name)
is equivalent to node.parent().child(name)
.
+ xml_node next_sibling(const char* name) const; + xml_node next_sibling_w(const char* name) const; + xml_node next_sibling() const; + |
_Winnie C++ Colorizer |
These functions get the next sibling, that is, one of the siblings of that node, that is to the
+right. next_sibling()
just returns the right brother of the node (or xml_node()),
+the two other functions are searching for the sibling with the given name
+ xml_node previous_sibling(const char* name) const; + xml_node previous_sibling_w(const char* name) const; + xml_node previous_sibling() const; + |
_Winnie C++ Colorizer |
These functions do exactly the same as next_sibling
ones, with the exception that they
+search for the left siblings.
+ xml_node parent() const; + |
_Winnie C++ Colorizer |
Get a parent node. The parent node for the root one (the document) is considered to be the document +itself.
+ ++ const char* child_value() const; + |
_Winnie C++ Colorizer |
Look for the first node of type node_pcdata or node_cdata among the +children of the current node and return its contents (or "" if nothing is found)
+ ++ xml_attribute first_attribute() const; + xml_attribute last_attribute() const; + |
_Winnie C++ Colorizer |
These functions get the first and last attributes of the node (or xml_attribute() if the node +has no attributes).
+ ++ xml_node first_child() const; + xml_node last_child() const; + |
_Winnie C++ Colorizer |
These functions get the first and last children of the node (or xml_node() if the node has +no children).
+ ++ template <typename OutputIterator> void all_elements_by_name(const char* name, OutputIterator it) const; + template <typename OutputIterator> void all_elements_by_name_w(const char* name, OutputIterator it) const; + |
_Winnie C++ Colorizer |
Get all elements with the specified name in the subtree (depth-first search) and return them with +the help of output iterator (i.e. std::back_inserter)
+ ++ template <typename Predicate> xml_attribute find_attribute(Predicate pred) const; + template <typename Predicate> xml_node find_child(Predicate pred) const; + template <typename Predicate> xml_node find_element(Predicate pred) const; + |
_Winnie C++ Colorizer |
Find attribute, child or a node in the subtree (find_element - depth-first search) with the help +of the given predicate. Predicate should behave like a function which accepts a xml_node or +xml_attribute (for find_attribute) parameter and returns bool. The first entity for which +the predicate returned true is returned. If predicate returned false for all entities, xml_node() +or xml_attribute() is returned.
+ ++ xml_node first_element(const char* name) const; + xml_node first_element_w(const char* name) const; + + xml_node first_element_by_value(const char* name, const char* value) const; + xml_node first_element_by_value_w(const char* name, const char* value) const; + + xml_node first_element_by_attribute(const char* name, const char* attr_name, const char* attr_value) const; + xml_node first_element_by_attribute_w(const char* name, const char* attr_name, const char* attr_value) const; + + xml_node first_element_by_attribute(const char* attr_name, const char* attr_value) const; + xml_node first_element_by_attribute_w(const char* attr_name, const char* attr_value) const; + |
_Winnie C++ Colorizer |
Find the first node (depth-first search), which corresponds to the given criteria (i.e. either has +a matching name, or a matching value, or has an attribute with given name/value, or has an attribute +and has a matching name). Note that _w versions treat all parameters as wildcards.
+ ++ xml_node first_node(xml_node_type type) const; + |
_Winnie C++ Colorizer |
Return a first node (depth-first search) with a given type, or xml_node().
+ ++ std::string path(char delimiter = '/') const; + |
_Winnie C++ Colorizer |
Get a path of the node (i.e. the string of names of the nodes on the path from the DOM tree root +to the node, separated with delimiter (/ by default).
+ ++ xml_node first_element_by_path(const char* path, char delimiter = '/') const; + |
_Winnie C++ Colorizer |
Get the first element that has the following path. The path can be absolute (beginning with delimiter) or +relative, '..' means 'up-level' (so if we are at the path mesh/fragment/geometry/stream, ../.. +will lead us to mesh/fragment, and /mesh will lead us to mesh).
+ ++ bool traverse(xml_tree_walker& walker) const; + |
_Winnie C++ Colorizer |
Traverse the subtree (beginning with current node) with the walker, return the result. See +Miscellaneous section for details.
+ + +Like xml_node, xml_attribute is a simple wrapper of the node's attribute.
+ ++ bool operator==(const xml_attribute& r) const; + bool operator!=(const xml_attribute& r) const; + bool operator<(const xml_attribute& r) const; + bool operator>(const xml_attribute& r) const; + bool operator<=(const xml_attribute& r) const; + bool operator>=(const xml_attribute& r) const; + |
_Winnie C++ Colorizer |
Comparison operators.
+ ++ operator unspecified_bool_type() const; + |
_Winnie C++ Colorizer |
Safe bool conversion - like in xml_node, use this to check for validity.
+ ++ bool empty() const; + |
_Winnie C++ Colorizer |
Like with xml_node, if (attr.empty())
is equivalent to if (!attr)
.
+
+ xml_attribute next_attribute() const; + xml_attribute previous_attribute() const; + |
_Winnie C++ Colorizer |
Get the next/previous attribute of the node, that owns the current attribute. Return xml_attribute() +if no such attribute is found.
+ ++ const char* name() const; + const char* value() const; + |
_Winnie C++ Colorizer |
Get the name and value of the attribute. These methods never return NULL - they return "" instead.
+ ++ int as_int() const; + double as_double() const; + float as_float() const; + |
_Winnie C++ Colorizer |
Convert the value of an attribute to the desired type. If the conversion is not successfull, return +default value (0 for int, 0.0 for double, 0.0f for float). These functions rely on CRT functions ato*.
+ ++ bool as_bool() const; + |
_Winnie C++ Colorizer |
Convert the value of an attribute to bool. This method returns true if the first character of the +value is '1', 't', 'T', 'y' or 'Y'. Otherwise it returns false.
+ + +Sometimes you have to cycle through the children or the attributes of the node. You can do it either +by using next_sibling, previous_sibling, next_attribute and previous_attribute +(along with first_child, last_child, first_attribute and last_attribute), +or you can use an iterator-like interface. There are two iterator types, xml_node_iterator and +xml_attribute_iterator. They are bidirectional constant iterators, which means that you can +either increment or decrement them, and use dereferencing and member access operators to get constant +access to node/attribute (the constness of iterators may change with the introducing of mutable trees).
+ +In order to get the iterators, use corresponding functions of xml_node. Note that _end() +functions return past-the-end iterator, that is, in order to get the last attribute, you'll have to +do something like: + +
+ if (node.attributes_begin() != node.attributes_end()) // we have at least one attribute + { + xml_attribute last_attrib = *(--node.attributes_end()); + ... + } + |
_Winnie C++ Colorizer |
If you want to traverse a subtree, you can use traverse function. There is a class +xml_tree_walker, which has some functions that you can override in order to get custom traversing +(the default one just does nothing). + +
+ virtual bool begin(const xml_node&); + virtual bool end(const xml_node&); + |
_Winnie C++ Colorizer |
These functions are called when the processing of the node starts/ends. First begin() +is called, then all children of the node are processed recursively, then end() is called. If +any of these functions returns false, the traversing is stopped and the traverse() function +returns false.
+ ++ virtual void push(); + virtual void pop(); + |
_Winnie C++ Colorizer |
These functions are called before and after the processing of node's children. If node has no children, +none of these is called. The default behavior is to increment/decrement current node depth.
+ ++ virtual int depth() const; + |
_Winnie C++ Colorizer |
Get the current depth. You can use this function to do your own indentation, for example.
+ +Lets get to some minor notes. You can safely write something like: + +
+ bool value = node.child("stream").attribute("compress").as_bool(); + |
_Winnie C++ Colorizer |
As parsing is done in-situ, the XML data is to persist during the lifetime of xml_parser. If +the parsing is called via a function of xml_parser, that accepts char*, you have to ensure +yourself, that the string will outlive the xml_parser object.
+ +The memory for nodes and attributes is allocated in blocks of data (the blocks form a linked list; +the default size of the block is 32 kb, though you can change it via changing a memory_block_size +constant in pugixml.hpp file. Remember that the first block is allocated on stack (it resides +inside xml_parser object), and all subsequent blocks are allocated on heap, so expect a stack overflow +when setting too large memory block size), so the xml_parser object (which contains the blocks) +should outlive all xml_node and xml_attribute objects (as well as iterators), which belong +to the parser's tree. Again, you should ensure it yourself.
+ +Ok, so you are not much of documentation reader, are you? So am I. Let's assume that you're going +to parse an xml file... something like this: + +
+<?xml version="1.0" encoding="UTF-8"?> +<mesh name="Cathedral"> + <fragment name="Cathedral"> + <geometry> + <stream usage="main" source="StAnna.dmesh" compress="true" /> + <stream usage="ao" source="StAnna.ao" /> + </geometry> + </fragment> + <fragment name="Cathedral"> + ... + </fragment> + ... +</mesh> ++ +
<mesh> is a root node, it has 0 or more <fragment>s, each of them has a <geometry> +node, and there are <stream> nodes with the shown attributes. We'd like to parse the file and... +well, and do something with it's contents. There are several methods of doing that; I'll show 2 of them +(the remaining one is using iterators).
+ +Here we exploit the knowledge of the strict hierarchy of our XML document and read the nodes from +DOM tree accordingly. When we have an xml_node object, we can get the desired information from +it (name, value, attributes list, nearby nodes in a tree - siblings, parent and children).
+ ++#include <fstream> +#include <vector> +#include <algorithm> +#include <iterator> + +#include "pugixml.hpp" + +using namespace pugi; + +int main() +{ + std::ifstream in("mesh.xml"); + in.unsetf(std::ios::skipws); + + std::vector<char> buf; + std::copy(std::istream_iterator<char>(in), std::istream_iterator<char>(), std::back_inserter(buf)); + buf.push_back(0); // zero-terminate + + xml_parser parser(&buf[0], pugi::parse_w3c); + + xml_node doc = parser.document(); + + if (xml_node mesh = doc.first_element("mesh")) + { + // store mesh.attribute("name").value() + + for (xml_node fragment = mesh.first_element("fragment"); fragment; fragment = fragment.next_sibling()) + { + // store fragment.attribute("name").value() + + if (xml_node geometry = fragment.first_element("geometry")) + for (xml_node stream = geometry.first_element("stream"); stream; stream = stream.next_sibling()) + { + // store stream.attribute("usage").value() + // store stream.attribute("source").value() + + if (stream.attribute("compress")) + // store stream.attribute("compress").as_bool() + + } + } + } +} + |
_Winnie C++ Colorizer |
We can also write a class that will traverse the DOM tree and store the information from nodes based +on their names, depths, attributes, etc. This way is well known by the users of SAX parsers. To do that, +we have to write an implementation of xml_tree_walker interface
+ ++#include <fstream> +#include <vector> +#include <algorithm> +#include <iterator> + +#include "pugixml.hpp" + +using namespace pugi; + +struct mesh_parser: public xml_tree_walker +{ + virtual bool begin(const xml_node& node) + { + if (strcmp(node.name(), "mesh") == 0) + { + // store node.attribute("name").value() + } + else if (strcmp(node.name(), "fragment") == 0) + { + // store node.attribute("name").value() + } + else if (strcmp(node.name(), "geometry") == 0) + { + // ... + } + else if (strcmp(node.name(), "stream") == 0) + { + // store node.attribute("usage").value() + // store node.attribute("source").value() + + if (node.attribute("compress")) + // store stream.attribute("compress").as_bool() + } + else return false; + + return true; + } +}; + +int main() +{ + std::ifstream in("mesh.xml"); + in.unsetf(std::ios::skipws); + + std::vector<char> buf; + std::copy(std::istream_iterator<char>(in), std::istream_iterator<char>(), std::back_inserter(buf)); + buf.push_back(0); // zero-terminate + + xml_parser parser(&buf[0], pugi::parse_w3c); + + mesh_parser mp; + + if (!parser.document().traverse(mp)) + // generate an error +} + |
_Winnie C++ Colorizer |
So, let's talk a bit about parsing process, and about the reason for providing XML data as a contiguous +writeable block of memory. Parsing is done in-situ. This means, that the strings, representing the +parts of DOM tree (node names, attribute names and values, CDATA content, etc.) are not separately +allocated on heap, but instead are parts of the original data. This is the keypoint to parsing speed, +because it helps achieve the minimal amount of memory allocations (more on that below) and minimal +amount of copying data.
+ +In-situ parsing can be done in two ways, with zero-segmenting the string (that is, set the past-the-end +character for the part of XML string to 0, see +this image for further details), and storing pointer + size of the string instead of pointer to +the beginning of ASCIIZ string.
+ +Originally, pugxml had only the first way, but then authors added the second method, 'non-segmenting' +or non-destructive parsing. The advantages of this method are: you no longer need non-constant storage; +you can even read data from memory-mapped files directly. Well, there are disadvantages. +For one thing, you can not do any of the transformations in-situ. The transformations that are required +by XML standard are: +
In order to be able to modify the tree (change attribute/node names & values) with in-situ parsing, +one needs to implement two ways of storing data (both in-situ and not). The DOM tree is now mutable, +but it will change in the future releases (without introducing speed/memory overhead, except on clean- +up stage).
+ +The parsing process itself is more or less straightforward, when you see it - but the impression +is fake, because the explicit jumps are made (i.e. we know, that if we come to a closing brace (>), +we should expect CDATA after it (or a new tag), so let's just jump to the corresponding code), and, +well, there can be bugs (see Bugs section).
+ +And, to make things worse, memory allocation (which is done only for node and attribute structures) +is done in pools. The pools are single-linked lists with predefined block size (32 kb by default), and +well, it increases speed a lot (allocations are slow, and the memory gets fragmented when allocating +a bunch of 16-byte (attribute) or 40-byte (node) structures)
+ +pugixml is not a compliant XML parser. The main reason for that is that it does not reject +most malformed XML files. The more or less complete list of incompatibilities follows (I will be talking +of ones when using parse_w3c mode): + +
I'm always open for questions; feel free to write them to zeux@mathcentre.com. +
+ +I'm always open for bug reports; feel free to write them to zeux@mathcentre.com. +Please provide as much information as possible - version of pugixml, compiling and OS environment +(compiler and it's version, STL version, OS version, etc.), the description of the situation in which +the bug arises, the code and data files that show the bug, etc. - the more, the better. Though, please, +do not send executable files.
+ +Here are some improvements that will be done in future versions (they are sorted by priority, the +upper ones will get there sooner).
+ +The pugixml parser is released into the public domain (though this may change).
+ +Revised 15 July, 2006
+© Copyright Zeux 2006. All Rights Reserved.
+ + diff --git a/docs/tree.png b/docs/tree.png new file mode 100644 index 0000000..14d48d6 Binary files /dev/null and b/docs/tree.png differ -- cgit v1.2.3