From 8630ce4e99264ebfff51276ffa457c06cd549869 Mon Sep 17 00:00:00 2001 From: Arseny Kapoulkine Date: Sun, 18 Oct 2015 20:15:09 -0700 Subject: docs: Update to 1.7 --- docs/manual.adoc | 82 +++++++++++++++------- docs/manual.html | 192 +++++++++++++++++++++++++++++++++------------------ docs/quickstart.html | 10 +-- 3 files changed, 185 insertions(+), 99 deletions(-) (limited to 'docs') diff --git a/docs/manual.adoc b/docs/manual.adoc index bab2f80..03db435 100644 --- a/docs/manual.adoc +++ b/docs/manual.adoc @@ -38,7 +38,7 @@ Thanks to *Neville Franks* for contributions to pugxml parser. Thanks to *Artyom Palvelev* for suggesting a lazy gap contraction approach. -Thanks to *Vyacheslav Egorov* for documentation proofreading. +Thanks to *Vyacheslav Egorov* for documentation proofreading and fuzz testing. [[overview.license]] === License @@ -216,6 +216,8 @@ pugixml uses several defines to control the compilation process. There are two w [[PUGIXML_WCHAR_MODE]]`PUGIXML_WCHAR_MODE` define toggles between UTF-8 style interface (the in-memory text encoding is assumed to be UTF-8, most functions use `char` as character type) and UTF-16/32 style interface (the in-memory text encoding is assumed to be UTF-16/32, depending on `wchar_t` size, most functions use `wchar_t` as character type). See <> for more details. +[[PUGIXML_COMPACT]]`PUGIXML_COMPACT` define activates a different internal representation of document storage that is much more memory efficient for documents with a lot of markup (i.e. nodes and attributes), but is slightly slower to parse and access. For details see <>. + [[PUGIXML_NO_XPATH]]`PUGIXML_NO_XPATH` define disables XPath. Both XPath interfaces and XPath implementation are excluded from compilation. This option is provided in case you do not need XPath functionality and need to save code space. [[PUGIXML_NO_STL]]`PUGIXML_NO_STL` define disables use of STL in pugixml. The functions that operate on STL types are no longer present (i.e. load/save via iostream) if this macro is defined. This option is provided in case your target platform does not have a standard-compliant STL implementation. @@ -233,25 +235,15 @@ NOTE: In that example `PUGIXML_API` is inconsistent between several source files [[install.portability]] === Portability -pugixml is written in standard-compliant C{plus}{plus} with some compiler-specific workarounds where appropriate. pugixml is compatible with the C{plus}{plus}11 standard, but does not require C{plus}{plus}11 support. Each version is tested with a unit test suite (with code coverage about 99%) on the following platforms: - -* Microsoft Windows: -** Borland C{plus}{plus} Compiler 5.82 -** Digital Mars C{plus}{plus} Compiler 8.51 -** Intel C{plus}{plus} Compiler 8.0, 9.0 x86/x64, 10.0 x86/x64, 11.0 x86/x64 -** Metrowerks CodeWarrior 8.0 -** Microsoft Visual C{plus}{plus} 6.0, 7.0 (2002), 7.1 (2003), 8.0 (2005) x86/x64, 9.0 (2008) x86/x64, 10.0 (2010) x86/x64, 11.0 (2011) x86/x64/ARM, 12.0 (2013) x86/x64/ARM and some CLR versions -** MinGW (GCC) 3.4, 4.4, 4.5, 4.6 x64 - -* Linux (GCC 4.4.3 x86/x64, GCC 4.8.1 x64, Clang 3.2 x64) -* FreeBSD (GCC 4.2.1 x86/x64) -* Apple MacOSX (GCC 4.0.1 x86/x64/PowerPC, Clang 3.5 x64) -* Sun Solaris (sunCC x86/x64) -* Microsoft Xbox 360 -* Nintendo Wii (Metrowerks CodeWarrior 4.1) -* Sony Playstation Portable (GCC 3.4.2) -* Sony Playstation 3 (GCC 4.1.1, SNC 310.1) -* Various portable platforms (Android NDK, BlackBerry NDK, Samsung bada, Windows CE) +pugixml is written in standard-compliant C{plus}{plus} with some compiler-specific workarounds where appropriate. pugixml is compatible with the C{plus}{plus}11 standard, but does not require C{plus}{plus}11 support. Each version is tested with a unit test suite with code coverage exceeding 99%. + +pugixml runs on a variety of desktop platforms (including Microsoft Windows, Linux, FreeBSD, Apple MacOSX and Sun Solaris), game consoles (inclusing Microsoft Xbox 360, Microsoft Xbox One, Nintendo Wii, Sony Playstation Portable and Sony Playstation 3) and mobile platforms (including Android, BlackBerry, Samsung bada and Microsoft Windows CE). + +pugixml supports various architectures, such as x86/x86-64, PowerPC, ARM, MIPS and SPARC. In general it should run on any architecture since it does not use architecture-specific code and does not rely on features such as unaligned memory access. + +pugixml can be compiled using any C++ compiler; it was tested with all versions of Microsoft Visual C{plus}{plus} from 6.0 up to 2015, GCC from 3.4 up to 5.2, Clang from 3.2 up to 3.7, as well as a variety of other compilers (e.g. Borland C{plus}{plus}, Digital Mars C{plus}{plus}, Intel C{plus}{plus}, Metrowerks CodeWarrior and PathScale). The code is written to avoid compilation warnings even on reasonably high warning levels. + +Note that some platforms may have very bare-bones support of C++; in some cases you'll have to use `PUGIXML_NO_STL` and/or `PUGIXML_NO_EXCEPTIONS` to compile without issues. This mostly applies to old game consoles and embedded systems. [[dom]] == Document object model @@ -379,7 +371,7 @@ Both `xml_node` and `xml_attribute` have the default constructor which initializ `xml_node` and `xml_attribute` try to behave like pointers, that is, they can be compared with other objects of the same type, making it possible to use them as keys in associative containers. All handles to the same underlying object are equal, and any two handles to different underlying objects are not equal. Null handles only compare as equal to themselves. The result of relational comparison can not be reliably determined from the order of nodes in file or in any other way. Do not use relational comparison operators except for search optimization (i.e. associative container keys). [[xml_attribute::hash_value]][[xml_node::hash_value]] -If you want to use `xml_node` or `xml_attribute` objects as keys in hash-based associative containers, you can use the `hash_value` member functions. They return the hash values that are guaranteed to be the same for all handles to the same underlying object. The hash value for null handles is 0. +If you want to use `xml_node` or `xml_attribute` objects as keys in hash-based associative containers, you can use the `hash_value` member functions. They return the hash values that are guaranteed to be the same for all handles to the same underlying object. The hash value for null handles is 0. Note that hash value does not depend on the content of the node, only on the location of the underlying structure in memory - this means that loading the same document twice will likely produce different hash values, and copying the node will not preserve the hash. [[xml_attribute::unspecified_bool_type]][[xml_node::unspecified_bool_type]][[xml_attribute::empty]][[xml_node::empty]] Finally handles can be implicitly cast to boolean-like objects, so that you can test if the node/attribute is empty with the following code: `if (node) { ... }` or `if (!node) { ... } else { ... }`. Alternatively you can check if a given `xml_node`/`xml_attribute` handle is null by calling the following methods: @@ -418,7 +410,7 @@ bool xml_node::set_name(const wchar_t* value); [[char_t]][[string_t]] There is a special type, `pugi::char_t`, that is defined as the character type and depends on the library configuration; it will be also used in the documentation hereafter. There is also a type `pugi::string_t`, which is defined as the STL string of the character type; it corresponds to `std::string` in char mode and to `std::wstring` in wchar_t mode. -In addition to the interface, the internal implementation changes to store XML data as `pugi::char_t`; this means that these two modes have different memory usage characteristics. The conversion to `pugi::char_t` upon document loading and from `pugi::char_t` upon document saving happen automatically, which also carries minor performance penalty. The general advice however is to select the character mode based on usage scenario, i.e. if UTF-8 is inconvenient to process and most of your XML data is non-ASCII, wchar_t mode is probably a better choice. +In addition to the interface, the internal implementation changes to store XML data as `pugi::char_t`; this means that these two modes have different memory usage characteristics - generally UTF-8 mode is more memory and performance efficient, especially if `sizeof(wchar_t)` is 4. The conversion to `pugi::char_t` upon document loading and from `pugi::char_t` upon document saving happen automatically, which also carries minor performance penalty. The general advice however is to select the character mode based on usage scenario, i.e. if UTF-8 is inconvenient to process and most of your XML data is non-ASCII, wchar_t mode is probably a better choice. [[as_utf8]][[as_wide]] There are cases when you'll have to convert string data between UTF-8 and wchar_t encodings; the following helper functions are provided for such purposes: @@ -497,7 +489,7 @@ allocation_function get_memory_allocation_function(); deallocation_function get_memory_deallocation_function(); ---- -Allocation function is called with the size (in bytes) as an argument and should return a pointer to a memory block with alignment that is suitable for storage of primitive types (usually a maximum of `void*` and `double` types alignment is sufficient) and size that is greater than or equal to the requested one. If the allocation fails, the function has to return null pointer (throwing an exception from allocation function results in undefined behavior). +Allocation function is called with the size (in bytes) as an argument and should return a pointer to a memory block with alignment that is suitable for storage of primitive types (usually a maximum of `void*` and `double` types alignment is sufficient) and size that is greater than or equal to the requested one. If the allocation fails, the function has to either return null pointer or to throw an exception. Deallocation function is called with the pointer that was returned by some call to allocation function; it is never called with a null pointer. If memory management functions are not thread-safe, library thread safety is not guaranteed. @@ -536,6 +528,17 @@ When the document is loaded from file/buffer, unless an inplace loading function All additional memory, such as memory for document structure (node/attribute objects) and memory for node/attribute names/values is allocated in pages on the order of 32 Kb; actual objects are allocated inside the pages using a memory management scheme optimized for fast allocation/deallocation of many small objects. Because of the scheme specifics, the pages are only destroyed if all objects inside them are destroyed; also, generally destroying an object does not mean that subsequent object creation will reuse the same memory. This means that it is possible to devise a usage scheme which will lead to higher memory usage than expected; one example is adding a lot of nodes, and them removing all even numbered ones; not a single page is reclaimed in the process. However this is an example specifically crafted to produce unsatisfying behavior; in all practical usage scenarios the memory consumption is less than that of a general-purpose allocator because allocation meta-data is very small in size. +[[dom.memory.compact]] +==== Compact mode + +By default nodes and attributes are optimized for efficiency of access. This can cause them to take a significant amount of memory - for documents with a lot of nodes and not a lot of contents (short attribute values/node text), and depending on the pointer size, the document structure can take noticeably more memory than the document itself (e.g. on a 64-bit platform in UTF-8 mode a markup-heavy document with the file size of 2.1 Mb can use 2.1 Mb for document buffer and 8.3 Mb for document structure). + +If you are processing big documents or your platform is memory constrained and you're willing to sacrifice a bit of performance for memory, you can compile pugixml with `PUGIXML_COMPACT` define which will activate compact mode. Compact mode uses a different representation of the document structure that assumes locality of reference between nodes and attributes to optimize memory usage. As a result you get significantly smaller node/attribute objects; usually most objects in most documents don't require additional storage, but in the worst case - if assumptions about locality of reference don't hold - additional memory will be allocated to store the extra data required. + +The compact storage supports all existing operations - including tree modification - with the same amortized complexity (that is, all basic document manipulations are still O(1) on average). The operations are slightly slower; you can usually expect 10-50% slowdown in terms of processing time unless your processing was memory-bound. + +On 32-bit architectures document structure in compact mode is typically reduced by around 2.5x; on 64-bit architectures the ratio is around 5x. Thus for big markup-heavy documents compact mode can make the difference between the processing of a multi-gigabyte document running completely from RAM vs requiring swapping to disk. Even if the document fits into memory, compact storage can use CPU caches more efficiently by taking less space and causing less cache/TLB misses. + [[loading]] == Loading document @@ -1668,7 +1671,9 @@ NOTE: You should use the usual bitwise arithmetics to manipulate the bitmask: to These flags control the resulting tree contents: -* [[format_indent]]`format_indent` determines if all nodes should be indented with the indentation string (this is an additional parameter for all saving functions, and is `"\t"` by default). If this flag is on, before every node the indentation string is output several times, where the amount of indentation depends on the node's depth relative to the output subtree. This flag has no effect if <> is enabled. This flag is *on* by default. +* [[format_indent]]`format_indent` determines if all nodes should be indented with the indentation string (this is an additional parameter for all saving functions, and is `"\t"` by default). If this flag is on, the indentation string is printed several times before every node, where the amount of indentation depends on the node's depth relative to the output subtree. This flag has no effect if <> is enabled. This flag is *on* by default. + +* [[format_indent_attributes]]`format_indent_attributes` determines if all attributes should be printed on a new line, indented with the indentation string according to the attribute's depth. This flag implies <>. This flag has no effect if <> is enabled. This flag is *off* by default. * [[format_raw]]`format_raw` switches between formatted and raw output. If this flag is on, the nodes are not indented in any way, and also no newlines that are not part of document text are printed. Raw mode can be used for serialization where the result is not intended to be read by humans; also it can be useful if the document was parsed with <> flag, to preserve the original document formatting as much as possible. This flag is *off* by default. @@ -1865,7 +1870,7 @@ When you call `select_nodes` with an expression string as an argument, a query o * You can use query objects to evaluate XPath expressions which result in booleans, numbers or strings; * You can get the type of expression value via query object. -Query objects correspond to `xpath_query` type. They are immutable and non-copyable: they are bound to the expression at creation time and can not be cloned. If you want to put query objects in a container, allocate them on heap via `new` operator and store pointers to `xpath_query` in the container. +Query objects correspond to `xpath_query` type. They are immutable and non-copyable: they are bound to the expression at creation time and can not be cloned. If you want to put query objects in a container, either allocate them on heap via `new` operator and store pointers to `xpath_query` in the container, or use a C++11 compiler (query objects are movable in C++11). [[xpath_query::ctor]] You can create a query object with the constructor that takes XPath expression as an argument: @@ -2097,6 +2102,31 @@ Because of the differences in document object models, performance considerations :!numbered: +[[v1.7]] +=== v1.7 ^19.10.2015^ + +Major release, featuring performance and memory improvements along with some new features. Changes: + +* Compact mode: + . Introduced a new tree storage mode that takes significantly less memory (2-5x smaller DOM) at some performance cost. + . The mode can be enabled using `PUGIXML_COMPACT` define. + +* New integer parsing/formatting implementation: + . Functions that convert from and to integers (e.g. `as_int`/`set_value`) do not rely on CRT any more. + . New implementation is 3-5x faster and is always correct wrt overflow or underflow. This is a behavior change - where previously `as_uint()` would return UINT_MAX on a value "-1", it now returns 0. + +* New features: + . XPath objects (`xpath_query`, `xpath_node_set`, `xpath_variable_set`) are now movable if your compiler supports C++11. Additionally, `xpath_variable_set` is copyable. + . Added `format_indent_attributes` that makes the resulting XML friendlier to line diff/merge tools. + . Added a variant of `xml_node::attribute` function with a hint that can improve lookup performance. + . Custom allocation functions are now allowed (but not required) to throw instead of returning a null pointer. + +* Bug fixes: + . Fix Clang 3.7 crashes in out-of-memory cases (C++ DR 1748) + . Fix XPath crashes on SPARC64 (and other 32-bit architectures where doubles have to be aligned to 8 bytes) + . Fix xpath_node_set assignment to provide strong exception guarantee + . Fix saving for custom xml_writer implementations that can throw from write() + [[v1.6]] === v1.6 ^10.04.2015^ @@ -2459,6 +2489,7 @@ This is the reference for all macros, types, enumerations, classes and functions [source,subs="+macros"] ---- #define +++PUGIXML_WCHAR_MODE+++ +#define +++PUGIXML_COMPACT+++ #define +++PUGIXML_NO_XPATH+++ #define +++PUGIXML_NO_STL+++ #define +++PUGIXML_NO_EXCEPTIONS+++ @@ -2546,6 +2577,7 @@ enum +++xpath_value_type+++ // Formatting options bit flags: const unsigned int +++format_default+++ const unsigned int +++format_indent+++ +const unsigned int +++format_indent_attributes+++ const unsigned int +++format_no_declaration+++ const unsigned int +++format_no_escapes+++ const unsigned int +++format_raw+++ diff --git a/docs/manual.html b/docs/manual.html index 8b23adc..380215d 100644 --- a/docs/manual.html +++ b/docs/manual.html @@ -6,7 +6,7 @@ -pugixml 1.6 manual +pugixml 1.7 manual