From 900a1cc94353b9202dcaee66b95d67e31331940e Mon Sep 17 00:00:00 2001 From: Arseny Kapoulkine Date: Tue, 29 Aug 2017 20:46:30 -0700 Subject: docs: Clarify Unicode validation behavior It has always been the case that pugixml does not perform Unicode validation or name/tag Unicode character class validation, but it wasn't very obvious from documentation. Fixes #162 --- docs/manual.adoc | 3 ++- docs/manual.html | 7 +++++-- 2 files changed, 7 insertions(+), 3 deletions(-) (limited to 'docs') diff --git a/docs/manual.adoc b/docs/manual.adoc index 7f4fc8b..b901a54 100644 --- a/docs/manual.adoc +++ b/docs/manual.adoc @@ -811,12 +811,13 @@ There is only one non-conformant behavior when dealing with valid XML documents: As for rejecting invalid XML documents, there are a number of incompatibilities with W3C specification, including: * Multiple attributes of the same node can have equal names. -* All non-ASCII characters are treated in the same way as symbols of English alphabet, so some invalid tag names are not rejected. +* Tag and attribute names are not fully validated for consisting of allowed characters, so some invalid tags are not rejected * Attribute values which contain `<` are not rejected. * Invalid entity/character references are not rejected and are instead left as is. * Comment values can contain `--`. * XML data is not required to begin with document declaration; additionally, document declaration can appear after comments and other nodes. * Invalid document type declarations are silently ignored in some cases. +* Unicode validation is not performed so invalid UTF sequences are not rejected. [[access]] == Accessing document data diff --git a/docs/manual.html b/docs/manual.html index 627f570..1bed481 100644 --- a/docs/manual.html +++ b/docs/manual.html @@ -1941,7 +1941,7 @@ The current behavior for Unicode conversion is to skip all invalid UTF sequences

Multiple attributes of the same node can have equal names.

  • -

    All non-ASCII characters are treated in the same way as symbols of English alphabet, so some invalid tag names are not rejected.

    +

    Tag and attribute names are not fully validated for consisting of allowed characters, so some invalid tags are not rejected

  • Attribute values which contain < are not rejected.

    @@ -1958,6 +1958,9 @@ The current behavior for Unicode conversion is to skip all invalid UTF sequences
  • Invalid document type declarations are silently ignored in some cases.

  • +
  • +

    Unicode validation is not performed so invalid UTF sequences are not rejected.

    +
  • @@ -5672,7 +5675,7 @@ If exceptions are disabled, then in the event of parsing failure the query is in -- cgit v1.2.3