README.md
1 # saxes 2 3 A sax-style non-validating parser for XML. 4 5 Saxes is a fork of [sax](https://github.com/isaacs/sax-js) 1.2.4. All mentions 6 of sax in this project's documentation are references to sax 1.2.4. 7 8 Designed with [node](http://nodejs.org/) in mind, but should work fine in the 9 browser or other CommonJS implementations. 10 11 Saxes does not support Node versions older than 10. 12 13 ## Notable Differences from Sax. 14 15 * Saxes aims to be much stricter than sax with regards to XML 16 well-formedness. Sax, even in its so-called "strict mode", is not strict. It 17 silently accepts structures that are not well-formed XML. Projects that need 18 better compliance with well-formedness constraints cannot use sax as-is. 19 20 Consequently, saxes does not support HTML, or pseudo-XML, or bad XML. Saxes 21 will report well-formedness errors in all these cases but it won't try to 22 extract data from malformed documents like sax does. 23 24 * Saxes is much much faster than sax, mostly because of a substantial redesign 25 of the internal parsing logic. The speed improvement is not merely due to 26 removing features that were supported by sax. That helped a bit, but saxes 27 adds some expensive checks in its aim for conformance with the XML 28 specification. Redesigning the parsing logic is what accounts for most of the 29 performance improvement. 30 31 * Saxes does not aim to support antiquated platforms. We will not pollute the 32 source or the default build with support for antiquated platforms. If you want 33 support for IE 11, you are welcome to produce a PR that adds a *new build* 34 transpiled to ES5. 35 36 * Saxes handles errors differently from sax: it provides a default onerror 37 handler which throws. You can replace it with your own handler if you want. If 38 your handler does nothing, there is no `resume` method to call. 39 40 * There's no `Stream` API. A revamped API may be introduced later. (It is still 41 a "streaming parser" in the general sense that you write a character stream to 42 it.) 43 44 * Saxes does not have facilities for limiting the size the data chunks passed to 45 event handlers. See the FAQ entry for more details. 46 47 ## Conformance 48 49 Saxes supports: 50 51 * [XML 1.0 fifth edition](https://www.w3.org/TR/2008/REC-xml-20081126/) 52 * [XML 1.1 second edition](https://www.w3.org/TR/2006/REC-xml11-20060816/) 53 * [Namespaces in XML 1.0 (Third Edition)](https://www.w3.org/TR/2009/REC-xml-names-20091208/). 54 * [Namespaces in XML 1.1 (Second Edition)](https://www.w3.org/TR/2006/REC-xml-names11-20060816/). 55 56 ## Limitations 57 58 This is a non-validating parser so it only verifies whether the document is 59 well-formed. We do aim to raise errors for all malformed constructs 60 encountered. However, this parser does not thorougly parse the contents of 61 DTDs. So most malformedness errors caused by errors **in DTDs** cannot be 62 reported. 63 64 ## Regarding `<!DOCTYPE` and `<!ENTITY` 65 66 The parser will handle the basic XML entities in text nodes and attribute 67 values: `& < > ' "`. It's possible to define additional 68 entities in XML by putting them in the DTD. This parser doesn't do anything with 69 that. If you want to listen to the `doctype` event, and then fetch the 70 doctypes, and read the entities and add them to `parser.ENTITIES`, then be my 71 guest. 72 73 ## Documentation 74 75 The source code contains JSDOC comments. Use them. What follows is a brief 76 summary of what is available. The final authority is the source code. 77 78 **PAY CLOSE ATTENTION TO WHAT IS PUBLIC AND WHAT IS PRIVATE.** 79 80 The move to TypeScript makes it so that everything is now formally private, 81 protected, or public. 82 83 If you use anything not public, that's at your own peril. 84 85 If there's a mistake in the documentation, raise an issue. If you just assume, 86 you may assume incorrectly. 87 88 ## Summary Usage Information 89 90 ### Example 91 92 ```javascript 93 var saxes = require("./lib/saxes"), 94 parser = new saxes.SaxesParser(); 95 96 parser.on("error", function (e) { 97 // an error happened. 98 }); 99 parser.on("text", function (t) { 100 // got some text. t is the string of text. 101 }); 102 parser.on("opentag", function (node) { 103 // opened a tag. node has "name" and "attributes" 104 }); 105 parser.on("end", function () { 106 // parser stream is done, and ready to have more stuff written to it. 107 }); 108 109 parser.write('<xml>Hello, <who name="world">world</who>!</xml>').close(); 110 ``` 111 112 ### Constructor Arguments 113 114 Settings supported: 115 116 * `xmlns` - Boolean. If `true`, then namespaces are supported. Default 117 is `false`. 118 119 * `position` - Boolean. If `false`, then don't track line/col/position. Unset is 120 treated as `true`. Default is unset. Currently, setting this to `false` only 121 results in a cosmetic change: the errors reported do not contain position 122 information. sax-js would literally turn off the position-computing logic if 123 this flag was set to false. The notion was that it would optimize 124 execution. In saxes at least it turns out that continually testing this flag 125 causes a cost that offsets the benefits of turning off this logic. 126 127 * `fileName` - String. Set a file name for error reporting. This is useful only 128 when tracking positions. You may leave it unset. 129 130 * `fragment` - Boolean. If `true`, parse the XML as an XML fragment. Default is 131 `false`. 132 133 * `additionalNamespaces` - A plain object whose key, value pairs define 134 namespaces known before parsing the XML file. It is not legal to pass 135 bindings for the namespaces `"xml"` or `"xmlns"`. 136 137 * `defaultXMLVersion` - The default version of the XML specification to use if 138 the document contains no XML declaration. If the document does contain an XML 139 declaration, then this setting is ignored. Must be `"1.0"` or `"1.1"`. The 140 default is `"1.0"`. 141 142 * `forceXMLVersion` - Boolean. A flag indicating whether to force the XML 143 version used for parsing to the value of ``defaultXMLVersion``. When this flag 144 is ``true``, ``defaultXMLVersion`` must be specified. If unspecified, the 145 default value of this flag is ``false``. 146 147 Example: suppose you are parsing a document that has an XML declaration 148 specifying XML version 1.1. 149 150 If you set ``defaultXMLVersion`` to ``"1.0"`` without setting 151 ``forceXMLVersion`` then the XML declaration will override the value of 152 ``defaultXMLVersion`` and the document will be parsed according to XML 1.1. 153 154 If you set ``defaultXMLVersion`` to ``"1.0"`` and set ``forceXMLVersion`` to 155 ``true``, then the XML declaration will be ignored and the document will be 156 parsed according to XML 1.0. 157 158 ### Methods 159 160 `write` - Write bytes onto the stream. You don't have to pass the whole document 161 in one `write` call. You can read your source chunk by chunk and call `write` 162 with each chunk. 163 164 `close` - Close the stream. Once closed, no more data may be written until it is 165 done processing the buffer, which is signaled by the `end` event. 166 167 ### Properties 168 169 The parser has the following properties: 170 171 `line`, `column`, `columnIndex`, `position` - Indications of the position in the 172 XML document where the parser currently is looking. The `columnIndex` property 173 counts columns as if indexing into a JavaScript string, whereas the `column` 174 property counts Unicode characters. 175 176 `closed` - Boolean indicating whether or not the parser can be written to. If 177 it's `true`, then wait for the `ready` event to write again. 178 179 `opt` - Any options passed into the constructor. 180 181 `xmlDecl` - The XML declaration for this document. It contains the fields 182 `version`, `encoding` and `standalone`. They are all `undefined` before 183 encountering the XML declaration. If they are undefined after the XML 184 declaration, the corresponding value was not set by the declaration. There is no 185 event associated with the XML declaration. In a well-formed document, the XML 186 declaration may be preceded only by an optional BOM. So by the time any event 187 generated by the parser happens, the declaration has been processed if present 188 at all. Otherwise, you have a malformed document, and as stated above, you 189 cannot rely on the parser data! 190 191 ### Error Handling 192 193 The parser continues to parse even upon encountering errors, and does its best 194 to continue reporting errors. You should heed all errors reported. After an 195 error, however, saxes may interpret your document incorrectly. For instance 196 ``<foo a=bc="d"/>`` is invalid XML. Did you mean to have ``<foo a="bc=d"/>`` or 197 ``<foo a="b" c="d"/>`` or some other variation? For the sake of continuing to 198 provide errors, saxes will continue parsing the document, but the structure it 199 reports may be incorrect. It is only after the errors are fixed in the document 200 that saxes can provide a reliable interpretation of the document. 201 202 That leaves you with two rules of thumb when using saxes: 203 204 * Pay attention to the errors that saxes report. The default `onerror` handler 205 throws, so by default, you cannot miss errors. 206 207 * **ONCE AN ERROR HAS BEEN ENCOUNTERED, STOP RELYING ON THE EVENT HANDLERS OTHER 208 THAN `onerror`.** As explained above, when saxes runs into a well-formedness 209 problem, it makes a guess in order to continue reporting more errors. The guess 210 may be wrong. 211 212 ### Events 213 214 To listen to an event, override `on<eventname>`. The list of supported events 215 are also in the exported `EVENTS` array. 216 217 See the JSDOC comments in the source code for a description of each supported 218 event. 219 220 ### Parsing XML Fragments 221 222 The XML specification does not define any method by which to parse XML 223 fragments. However, there are usage scenarios in which it is desirable to parse 224 fragments. In order to allow this, saxes provides three initialization options. 225 226 If you pass the option `fragment: true` to the parser constructor, the parser 227 will expect an XML fragment. It essentially starts with a parsing state 228 equivalent to the one it would be in if `parser.write("<foo">)` had been called 229 right after initialization. In other words, it expects content which is 230 acceptable inside an element. This also turns off well-formedness checks that 231 are inappropriate when parsing a fragment. 232 233 The option `additionalNamespaces` allows you to define additional prefix-to-URI 234 bindings known before parsing starts. You would use this over `resolvePrefix` if 235 you have at the ready a series of namespaces bindings to use. 236 237 The option `resolvePrefix` allows you to pass a function which saxes will use if 238 it is unable to resolve a namespace prefix by itself. You would use this over 239 `additionalNamespaces` in a context where getting a complete list of defined 240 namespaces is onerous. 241 242 Note that you can use `additionalNamespaces` and `resolvePrefix` together if you 243 want. `additionalNamespaces` applies before `resolvePrefix`. 244 245 The options `additionalNamespaces` and `resolvePrefix` are really meant to be 246 used for parsing fragments. However, saxes won't prevent you from using them 247 with `fragment: false`. Note that if you do this, your document may parse 248 without errors and yet be malformed because the document can refer to namespaces 249 which are not defined *in* the document. 250 251 Of course, `additionalNamespaces` and `resolvePrefix` are used only if `xmlns` 252 is `true`. If you are parsing a fragment that does not use namespaces, there's 253 no point in setting these options. 254 255 ### Performance Tips 256 257 * saxes works faster on files that use newlines (``\u000A``) as end of line 258 markers than files that use other end of line markers (like ``\r`` or 259 ``\r\n``). The XML specification requires that conformant applications behave 260 as if all characters that are to be treated as end of line characters are 261 converted to ``\u000A`` prior to parsing. The optimal code path for saxes is a 262 file in which all end of line characters are already ``\u000A``. 263 264 * Don't split Unicode strings you feed to saxes across surrogates. When you 265 naively split a string in JavaScript, you run the risk of splitting a Unicode 266 character into two surrogates. e.g. In the following example ``a`` and ``b`` 267 each contain half of a single Unicode character: ``const a = "\u{1F4A9}"[0]; 268 const b = "\u{1F4A9}"[1]`` If you feed such split surrogates to versions of 269 saxes prior to 4, you'd get errors. Saxes version 4 and over are able to 270 detect when a chunk of data ends with a surrogate and carry over the surrogate 271 to the next chunk. However this operation entails slicing and concatenating 272 strings. If you can feed your data in a way that does not split surrogates, 273 you should do it. (Obviously, feeding all the data at once with a single write 274 is fastest.) 275 276 * Don't set event handlers you don't need. Saxes has always aimed to avoid doing 277 work that will just be tossed away but future improvements hope to do this 278 more aggressively. One way saxes knows whether or not some data is needed is 279 by checking whether a handler has been set for a specific event. 280 281 ## FAQ 282 283 Q. Why has saxes dropped support for limiting the size of data chunks passed to 284 event handlers? 285 286 A. With sax you could set ``MAX_BUFFER_LENGTH`` to cause the parser to limit the 287 size of data chunks passed to event handlers. So if you ran into a span of text 288 above the limit, multiple ``text`` events with smaller data chunks were fired 289 instead of a single event with a large chunk. 290 291 However, that functionality had some problematic characteristics. It had an 292 arbitrary default value. It was library-wide so all parsers created from a 293 single instance of the ``sax`` library shared it. This could potentially cause 294 conflicts among libraries running in the same VM but using sax for different 295 purposes. 296 297 These issues could have been easily fixed, but there were larger issues. The 298 buffer limit arbitrarily applied to some events but not others. It would split 299 ``text``, ``cdata`` and ``script`` events. However, if a ``comment``, 300 ``doctype``, ``attribute`` or ``processing instruction`` were more than the 301 limit, the parser would generate an error and you were left picking up the 302 pieces. 303 304 It was not intuitive to use. You'd think setting the limit to 1K would prevent 305 chunks bigger than 1K to be passed to event handlers. But that was not the 306 case. A comment in the source code told you that you might go over the limit if 307 you passed large chunks to ``write``. So if you want a 1K limit, don't pass 64K 308 chunks to ``write``. Fair enough. You know what limit you want so you can 309 control the size of the data you pass to ``write``. So you limit the chunks to 310 ``write`` to 1K at a time. Even if you do this, your event handlers may get data 311 chunks that are 2K in size. Suppose on the previous ``write`` the parser has 312 just finished processing an open tag, so it is ready for text. Your ``write`` 313 passes 1K of text. You are not above the limit yet, so no event is generated 314 yet. The next ``write`` passes another 1K of text. It so happens that sax checks 315 buffer limits only once per ``write``, after the chunk of data has been 316 processed. Now you've hit the limit and you get a ``text`` event with 2K of 317 data. So even if you limit your ``write`` calls to the buffer limit you've set, 318 you may still get events with chunks at twice the buffer size limit you've 319 specified. 320 321 We may consider reinstating an equivalent functionality, provided that it 322 addresses the issues above and does not cause a huge performance drop for 323 use-case scenarios that don't need it.