Loading document

+ Loading document +

Loading document from file
Loading document from memory
Loading document from C++ IOstreams
Handling parsing errors
Parsing options
Encodings
Conformance to W3C specification

+ pugixml provides several functions for loading XML data from various places + - files, C++ iostreams, memory buffers. All functions use an extremely fast + non-validating parser. This parser is not fully W3C conformant - it can load + any valid XML document, but does not perform some well-formedness checks. While + considerable effort is made to reject invalid XML documents, some validation + is not performed because of performance reasons. Also some XML transformations + (i.e. EOL handling or attribute value normalization) can impact parsing speed + and thus can be disabled. However for vast majority of XML documents there + is no performance difference between different parsing options. Parsing options + also control whether certain XML nodes are parsed; see Parsing options for + more information. +

+ XML data is always converted to internal character format (see Unicode interface) + before parsing. pugixml supports all popular Unicode encodings (UTF-8, UTF-16 + (big and little endian), UTF-32 (big and little endian); UCS-2 is naturally + supported since it's a strict subset of UTF-16) and handles all encoding conversions + automatically. Unless explicit encoding is specified, loading functions perform + automatic encoding detection based on first few characters of XML data, so + in almost all cases you do not have to specify document encoding. Encoding + conversion is described in more detail in Encodings. +

+ Loading document from file +

+ The most common source of XML data is files; pugixml provides a separate + function for loading XML document from file: +

xml_parse_result xml_document::load_file(const char* path, unsigned int options = parse_default, xml_encoding encoding = encoding_auto);
+

+ This function accepts file path as its first argument, and also two optional + arguments, which specify parsing options (see Parsing options) and + input data encoding (see Encodings). The path has the target + operating system format, so it can be a relative or absolute one, it should + have the delimiters of target system, it should have the exact case if target + file system is case-sensitive, etc. File path is passed to system file opening + function as is. +

+ load_file destroys the existing + document tree and then tries to load the new tree from the specified file. + The result of the operation is returned in an xml_parse_result + object; this object contains the operation status, and the related information + (i.e. last successfully parsed position in the input file, if parsing fails). + See Handling parsing errors for error handling details. +

+ + + + + +

	Note
	+ As of version 0.9, there is no function for loading XML document from wide + character path. Unfortunately, there is no portable way to do this; the + version 1.0 will provide such function only for platforms with the corresponding + functionality. You can use stream-loading functions as a workaround if + your STL implementation can open file streams via `wchar_t` + paths. +

+ This is an example of loading XML document from file (samples/load_file.cpp): +

+ +

pugi::xml_document doc;
+
+pugi::xml_parse_result result = doc.load_file("tree.xml");
+
+std::cout << "Load result: " << result.description() << ", mesh name: " << doc.child("mesh").attribute("name").value() << std::endl;
+

+ Loading document from memory +

+ Sometimes XML data should be loaded from some other source than file, i.e. + HTTP URL; also you may want to load XML data from file using non-standard + functions, i.e. to use your virtual file system facilities or to load XML + from gzip-compressed files. All these scenarios require loading document + from memory. First you should prepare a contiguous memory block with all + XML data; then you have to invoke one of buffer loading functions. These + functions will handle the necessary encoding conversions, if any, and then + will parse the data into the corresponding XML tree. There are several buffer + loading functions, which differ in the behavior and thus in performance/memory + usage: +

xml_parse_result xml_document::load_buffer(const void* contents, size_t size, unsigned int options = parse_default, xml_encoding encoding = encoding_auto);
+xml_parse_result xml_document::load_buffer_inplace(void* contents, size_t size, unsigned int options = parse_default, xml_encoding encoding = encoding_auto);
+xml_parse_result xml_document::load_buffer_inplace_own(void* contents, size_t size, unsigned int options = parse_default, xml_encoding encoding = encoding_auto);
+

+ All functions accept the buffer which is represented by a pointer to XML + data, contents, and data + size in bytes. Also there are two optional arguments, which specify parsing + options (see Parsing options) and input data encoding (see Encodings). + The buffer does not have to be zero-terminated. +

+ load_buffer function works + with immutable buffer - it does not ever modify the buffer. Because of this + restriction it has to create a private buffer and copy XML data to it before + parsing (applying encoding conversions if necessary). This copy operation + carries a performance penalty, so inplace functions are provided - load_buffer_inplace and load_buffer_inplace_own + store the document data in the buffer, modifying it in the process. In order + for the document to stay valid, you have to make sure that the buffer's lifetime + exceeds that of the tree if you're using inplace functions. In addition to + that, load_buffer_inplace + does not assume ownership of the buffer, so you'll have to destroy it yourself; + load_buffer_inplace_own assumes + ownership of the buffer and destroys it once it is not needed. This means + that if you're using load_buffer_inplace_own, + you have to allocate memory with pugixml allocation function (you can get + it via get_memory_allocation_function). +

+ The best way from the performance/memory point of view is to load document + using load_buffer_inplace_own; + this function has maximum control of the buffer with XML data so it is able + to avoid redundant copies and reduce peak memory usage while parsing. This + is the recommended function if you have to load the document from memory + and performance is critical. +

+ There is also a simple helper function for cases when you want to load the + XML document from null-terminated character string: +

xml_parse_result xml_document::load(const char_t* contents, unsigned int options = parse_default);
+

+ It is equivalent to calling load_buffer + with size = + strlen(contents). + This function assumes native encoding for input data, so it does not do any + encoding conversion. In general, this function is fine for loading small + documents from string literals, but has more overhead and less functionality + than buffer loading functions. +

+ This is an example of loading XML document from memory using different functions + (samples/load_memory.cpp): +

+ +

const char source[] = "<mesh name='sphere'><bounds>0 0 1 1</bounds></mesh>";
+size_t size = sizeof(source);
+

+ +

// You can use load_buffer to load document from immutable memory block:
+pugi::xml_parse_result result = doc.load_buffer(source, size);
+

+ +

// You can use load_buffer_inplace to load document from mutable memory block; the block's lifetime must exceed that of document
+char* buffer = new char[size];
+memcpy(buffer, source, size);
+
+// The block can be allocated by any method; the block is modified during parsing
+pugi::xml_parse_result result = doc.load_buffer_inplace(buffer, size);
+
+// You have to destroy the block yourself after the document is no longer used
+delete[] buffer;
+

+ +

// You can use load_buffer_inplace_own to load document from mutable memory block and to pass the ownership of this block
+// The block has to be allocated via pugixml allocation function - using i.e. operator new here is incorrect
+char* buffer = static_cast<char*>(pugi::get_memory_allocation_function()(size));
+memcpy(buffer, source, size);
+
+// The block will be deleted by the document
+pugi::xml_parse_result result = doc.load_buffer_inplace_own(buffer, size);
+

+ +

// You can use load to load document from null-terminated strings, for example literals:
+pugi::xml_parse_result result = doc.load("<mesh name='sphere'><bounds>0 0 1 1</bounds></mesh>");
+

+ Loading document from C++ IOstreams +

+ For additional interoperability pugixml provides functions for loading document + from any object which implements C++ std::istream + interface. This allows you to load documents from any standard C++ stream + (i.e. file stream) or any third-party compliant implementation (i.e. Boost + Iostreams). There are two functions, one works with narrow character streams, + another handles wide character ones: +

xml_parse_result xml_document::load(std::istream& stream, unsigned int options = parse_default, xml_encoding encoding = encoding_auto);
+xml_parse_result xml_document::load(std::wistream& stream, unsigned int options = parse_default);
+

+ load with std::istream + argument loads the document from stream from the current read position to + the end, treating the stream contents as a byte stream of the specified encoding + (with encoding autodetection as necessary). Thus calling xml_document::load + on an opened std::ifstream object is equivalent to calling + xml_document::load_file. +

+ load with std::wstream + argument treats the stream contents as a wide character stream (encoding + is always encoding_wchar). + Because of this, using load + with wide character streams requires careful (usually platform-specific) + stream setup (i.e. using the imbue + function). Generally use of wide streams is discouraged, however it provides + you the ability to load documents from non-Unicode encodings, i.e. you can + load Shift-JIS encoded data if you set the correct locale. +

+ This is a simple example of loading XML document from file using streams + (samples/load_stream.cpp); read + the sample code for more complex examples involving wide streams and locales: +

+ +

std::ifstream stream("weekly-utf-8.xml");
+pugi::xml_parse_result result = doc.load(stream);
+

+ Stream loading requires working seek/tell functions and therefore may fail + when used with some stream implementations like gzstream. +

+ Handling parsing errors +

+ All document loading functions return the parsing result via xml_parse_result object. It contains parsing + status, the offset of last successfully parsed character from the beginning + of the source stream, and the encoding of the source stream: +

struct xml_parse_result
+{
+    xml_parse_status status;
+    ptrdiff_t offset;
+    xml_encoding encoding;
+
+    operator bool() const;
+    const char* description() const;
+};
+

+ Parsing status is represented as the xml_parse_status + enumeration and can be one of the following: +

+ status_ok means that no error was encountered + during parsing; the source stream represents the valid XML document which + was fully parsed and converted to a tree.

+ +
+ status_file_not_found is only + returned by load_file + function and means that file could not be opened. +
+ status_io_error is returned by load_file function and by load functions with std::istream/std::wstream arguments; it means that some + I/O error has occured during reading the file/stream. +
+ status_out_of_memory means that + there was not enough memory during some allocation; any allocation failure + during parsing results in this error. +
+ status_internal_error means that + something went horribly wrong; currently this error does not occur

+ +
+ status_unrecognized_tag means + that parsing stopped due to a tag with either an empty name or a name + which starts with incorrect character, such as #. +
+ status_bad_pi means that parsing stopped + due to incorrect document declaration/processing instruction +
+ status_bad_comment, status_bad_cdata, + status_bad_doctype and status_bad_pcdata + mean that parsing stopped due to the invalid construct of the respective + type +
+ status_bad_start_element means + that parsing stopped because starting tag either had no closing > symbol or contained some incorrect + symbol +
+ status_bad_attribute means that + parsing stopped because there was an incorrect attribute, such as an + attribute without value or with value that is not quoted (note that + <node + attr=1> is + incorrect in XML) +
+ status_bad_end_element means + that parsing stopped because ending tag had incorrect syntax (i.e. extra + non-whitespace symbols between tag name and >) +
+ status_end_element_mismatch + means that parsing stopped because the closing tag did not match the + opening one (i.e. <node></nedo>) or because some tag was not closed + at all +

+ description() + member function can be used to convert parsing status to a string; the returned + message is always in English, so you'll have to write your own function if + you need a localized string. However please note that the exact messages + returned by description() + function may change from version to version, so any complex status handling + should be based on status + value. +

+ If parsing failed because the source data was not a valid XML, the resulting + tree is not destroyed - despite the fact that load function returns error, + you can use the part of the tree that was successfully parsed. Obviously, + the last element may have an unexpected name/value; for example, if the attribute + value does not end with the necessary quotation mark, like in <node + attr="value>some data</node> example, the value of + attribute attr will contain + the string value>some data</node>. +

+ In addition to the status code, parsing result has an offset + member, which contains the offset of last successfully parsed character if + parsing failed because of an error in source data; otherwise offset is 0. For parsing efficiency reasons, + pugixml does not track the current line during parsing; this offset is in + units of pugi::char_t (bytes for character mode, wide + characters for wide character mode). Many text editors support 'Go To Position' + feature - you can use it to locate the exact error position. Alternatively, + if you're loading the document from memory, you can display the error chunk + along with the error description (see the example code below). +

+ + + + + +

	Caution
	+ Offset is calculated in the XML buffer in native encoding; if encoding + conversion is performed during parsing, offset can not be used to reliably + track the error position. +

+ Parsing result also has an encoding + member, which can be used to check that the source data encoding was correctly + guessed. It is equal to the exact encoding used during parsing (i.e. with + the exact endianness); see Encodings for more information. +

+ Parsing result object can be implicitly converted to bool; + if you do not want to handle parsing errors thoroughly, you can just check + the return value of load functions as if it was a bool: + if (doc.load_file("file.xml")) { ... + } else { ... }. +

+ This is an example of handling loading errors (samples/load_error_handling.cpp): +

+ +

pugi::xml_document doc;
+pugi::xml_parse_result result = doc.load(source);
+
+if (result)
+    std::cout << "XML [" << source << "] parsed without errors, attr value: [" << doc.child("node").attribute("attr").value() << "]\n\n";
+else
+{
+    std::cout << "XML [" << source << "] parsed with errors, attr value: [" << doc.child("node").attribute("attr").value() << "]\n";
+    std::cout << "Error description: " << result.description() << "\n";
+    std::cout << "Error offset: " << result.offset << " (error at [..." << (source + result.offset) << "]\n\n";
+}
+

+ Parsing options +

+ All document loading functions accept the optional parameter options. This is a bitmask that customizes + the parsing process: you can select the node types that are parsed and various + transformations that are performed with the XML text. Disabling certain transformations + can improve parsing performance for some documents; however, the code for + all transformations is very well optimized, and thus the majority of documents + won't get any performance benefit. As a rule of thumb, only modify parsing + flags if you want to get some nodes in the document that are excluded by + default (i.e. declaration or comment nodes). +

+ + + + + +

	Note
	+ You should use the usual bitwise arithmetics to manipulate the bitmask: + to enable a flag, use `mask \| flag`; + to disable a flag, use `mask & ~flag`. +

+ These flags control the resulting tree contents: +

+ parse_declaration determines if XML + document declaration (node with type node_declaration) + are to be put in DOM tree. If this flag is off, it is not put in the + tree, but is still parsed and checked for correctness. This flag is + off by default.

+ +
+ parse_pi determines if processing instructions + (nodes with type node_pi) are to be put + in DOM tree. If this flag is off, they are not put in the tree, but are + still parsed and checked for correctness. Note that <?xml ...?> + (document declaration) is not considered to be a PI. This flag is off by default.

+ +
+ parse_comments determines if comments + (nodes with type node_comment) are + to be put in DOM tree. If this flag is off, they are not put in the tree, + but are still parsed and checked for correctness. This flag is off by default.

+ +
+ parse_cdata determines if CDATA sections + (nodes with type node_cdata) are to + be put in DOM tree. If this flag is off, they are not put in the tree, + but are still parsed and checked for correctness. This flag is on by default.

+ +
+ parse_ws_pcdata determines if PCDATA + nodes (nodes with type node_pcdata) + that consist only of whitespace characters are to be put in DOM tree. + Often whitespace-only data is not significant for the application, and + the cost of allocating and storing such nodes (both memory and speed-wise) + can be significant. For example, after parsing XML string <node> <a/> </node>, <node> + element will have three children when parse_ws_pcdata + is set (child with type node_pcdata + and value " ", + child with type node_element + and name "a", and + another child with type node_pcdata + and value " "), + and only one child when parse_ws_pcdata + is not set. This flag is off by default. +

+ These flags control the transformation of tree element contents: +

+ parse_escapes determines if character + and entity references are to be expanded during the parsing process. + Character references have the form &#...; or + &#x...; (... is Unicode numeric + representation of character in either decimal (&#...;) + or hexadecimal (&#x...;) form), entity references + are <, >, &, + ' and " (note + that as pugixml does not handle DTD, the only allowed entities are predefined + ones). If character/entity reference can not be expanded, it is left + as is, so you can do additional processing later. Reference expansion + is performed in attribute values and PCDATA content. This flag is on by default.

+ +
+ parse_eol determines if EOL handling (that + is, replacing sequences 0x0d 0x0a by a single 0x0a + character, and replacing all standalone 0x0d + characters by 0x0a) is to + be performed on input data (that is, comments contents, PCDATA/CDATA + contents and attribute values). This flag is on + by default.

+ +
+ parse_wconv_attribute determines + if attribute value normalization should be performed for all attributes. + This means, that whitespace characters (new line, tab and space) are + replaced with space (' '). + New line characters are always treated as if parse_eol + is set, i.e. \r\n + is converted to single space. This flag is on + by default.

+ +
+ parse_wnorm_attribute determines + if extended attribute value normalization should be performed for all + attributes. This means, that after attribute values are normalized as + if parse_wconv_attribute + was set, leading and trailing space characters are removed, and all sequences + of space characters are replaced by a single space character. The value + of parse_wconv_attribute + has no effect if this flag is on. This flag is off + by default. +

+ + + + + +

	Note
	+ `parse_wconv_attribute` option + performs transformations that are required by W3C specification for attributes + that are declared as `CDATA`; `parse_wnorm_attribute` + performs transformations required for `NMTOKENS` attributes. + In the absence of document type declaration all attributes behave as if + they are declared as `CDATA`, thus `parse_wconv_attribute` + is the default option. +

+ Additionally there are two predefined option masks: +

+ parse_minimal has all options turned + off. This option mask means that pugixml does not add declaration nodes, + PI nodes, CDATA sections and comments to the resulting tree and does + not perform any conversion for input data, so theoretically it is the + fastest mode. However, as discussed above, in practice parse_default is usually equally fast. +

+ +
+ parse_default is the default set of flags, + i.e. it has all options set to their default values. It includes parsing + CDATA sections (comments/PIs are not parsed), performing character and + entity reference expansion, replacing whitespace characters with spaces + in attribute values and performing EOL handling. Note, that PCDATA sections + consisting only of whitespace characters are not parsed (by default) + for performance reasons. +

+ This is an example of using different parsing options (samples/load_options.cpp): +

+ +

const char* source = "<!--comment--><node>&lt;</node>";
+
+// Parsing with default options; note that comment node is not added to the tree, and entity reference &lt; is expanded
+doc.load(source);
+std::cout << "First node value: [" << doc.first_child().value() << "], node child value: [" << doc.child_value("node") << "]\n";
+
+// Parsing with additional parse_comments option; comment node is now added to the tree
+doc.load(source, pugi::parse_default | pugi::parse_comments);
+std::cout << "First node value: [" << doc.first_child().value() << "], node child value: [" << doc.child_value("node") << "]\n";
+
+// Parsing with additional parse_comments option and without the (default) parse_escapes option; &lt; is not expanded
+doc.load(source, (pugi::parse_default | pugi::parse_comments) & ~pugi::parse_escapes);
+std::cout << "First node value: [" << doc.first_child().value() << "], node child value: [" << doc.child_value("node") << "]\n";
+
+// Parsing with minimal option mask; comment node is not added to the tree, and &lt; is not expanded
+doc.load(source, pugi::parse_minimal);
+std::cout << "First node value: [" << doc.first_child().value() << "], node child value: [" << doc.child_value("node") << "]\n";
+

+ Encodings +

+ pugixml supports all popular Unicode encodings (UTF-8, UTF-16 (big and little + endian), UTF-32 (big and little endian); UCS-2 is naturally supported since + it's a strict subset of UTF-16) and handles all encoding conversions. Most + loading functions accept the optional parameter encoding. + This is a value of enumeration type xml_encoding, + that can have the following values: +

+ encoding_auto means that pugixml will + try to guess the encoding based on source XML data. The algorithm is + a modified version of the one presented in Appendix F.1 of XML recommendation; + it tries to match the first few bytes of input data with the following + patterns in strict order:

+
- + If first four bytes match UTF-32 BOM (Byte Order Mark), encoding + is assumed to be UTF-32 with the endianness equal to that of BOM; +
- + If first two bytes match UTF-16 BOM, encoding is assumed to be + UTF-16 with the endianness equal to that of BOM; +
- + If first three bytes match UTF-8 BOM, encoding is assumed to be + UTF-8; +
- + If first four bytes match UTF-32 representation of <, + encoding is assumed to be UTF-32 with the corresponding endianness; +
- + If first four bytes match UTF-16 representation of <?, + encoding is assumed to be UTF-16 with the corresponding endianness; +
- + If first two bytes match UTF-16 representation of <, + encoding is assumed to be UTF-16 with the corresponding endianness + (this guess may yield incorrect result, but it's better than UTF-8); +
- + Otherwise encoding is assumed to be UTF-8.
  
  + +
+
+ encoding_utf8 corresponds to UTF-8 encoding + as defined in Unicode standard; UTF-8 sequences with length equal to + 5 or 6 are not standard and are rejected. +
+ encoding_utf16_le corresponds to + little-endian UTF-16 encoding as defined in Unicode standard; surrogate + pairs are supported. +
+ encoding_utf16_be corresponds to + big-endian UTF-16 encoding as defined in Unicode standard; surrogate + pairs are supported. +
+ encoding_utf16 corresponds to UTF-16 + encoding as defined in Unicode standard; the endianness is assumed to + be that of target platform. +
+ encoding_utf32_le corresponds to + little-endian UTF-32 encoding as defined in Unicode standard. +
+ encoding_utf32_be corresponds to + big-endian UTF-32 encoding as defined in Unicode standard. +
+ encoding_utf32 corresponds to UTF-32 + encoding as defined in Unicode standard; the endianness is assumed to + be that of target platform. +
+ encoding_wchar corresponds to the encoding + of wchar_t type; it has + the same meaning as either encoding_utf16 + or encoding_utf32, depending + on wchar_t size. +

+ The algorithm used for encoding_auto + correctly detects any supported Unicode encoding for all well-formed XML + documents (since they start with document declaration) and for all other + XML documents that start with <; if your XML document + does not start with < and has encoding that is different + from UTF-8, use the specific encoding. +

+ + + + + +

	Note
	+ The current behavior for Unicode conversion is to skip all invalid UTF + sequences during conversion. This behavior should not be relied upon; moreover, + in case no encoding conversion is performed, the invalid sequences are + not removed, so you'll get them as is in node/attribute contents. +

+ Conformance to W3C specification +

+ pugixml is not fully W3C conformant - it can load any valid XML document, + but does not perform some well-formedness checks. While considerable effort + is made to reject invalid XML documents, some validation is not performed + because of performance reasons. +

+ There is only one non-conformant behavior when dealing with valid XML documents: + pugixml does not use information supplied in document type declaration for + parsing. This means that entities declared in DOCTYPE are not expanded, and + all attribute/PCDATA values are always processed in a uniform way that depends + only on parsing options. +

+ As for rejecting invalid XML documents, there are a number of incompatibilities + with W3C specification, including: +

+ Multiple attributes of the same node can have equal names. +
+ All non-ASCII characters are treated in the same way as symbols of English + alphabet, so some invalid tag names are not rejected. +
+ Attribute values which contain < are not rejected. +
+ Invalid entity/character references are not rejected and are instead + left as is. +
+ Comment values can contain --. +
+ XML data is not required to begin with document declaration; additionally, + document declaration can appear after comments and other nodes. +
+ Invalid document type declarations are silently ignored in some cases. +