README.md
  1  # saxes
  2  
  3  A sax-style non-validating parser for XML.
  4  
  5  Saxes is a fork of [sax](https://github.com/isaacs/sax-js) 1.2.4. All mentions
  6  of sax in this project's documentation are references to sax 1.2.4.
  7  
  8  Designed with [node](http://nodejs.org/) in mind, but should work fine in the
  9  browser or other CommonJS implementations.
 10  
 11  Saxes does not support Node versions older than 10.
 12  
 13  ## Notable Differences from Sax.
 14  
 15  * Saxes aims to be much stricter than sax with regards to XML
 16    well-formedness. Sax, even in its so-called "strict mode", is not strict. It
 17    silently accepts structures that are not well-formed XML. Projects that need
 18    better compliance with well-formedness constraints cannot use sax as-is.
 19  
 20    Consequently, saxes does not support HTML, or pseudo-XML, or bad XML. Saxes
 21    will report well-formedness errors in all these cases but it won't try to
 22    extract data from malformed documents like sax does.
 23  
 24  * Saxes is much much faster than sax, mostly because of a substantial redesign
 25    of the internal parsing logic. The speed improvement is not merely due to
 26    removing features that were supported by sax. That helped a bit, but saxes
 27    adds some expensive checks in its aim for conformance with the XML
 28    specification. Redesigning the parsing logic is what accounts for most of the
 29    performance improvement.
 30  
 31  * Saxes does not aim to support antiquated platforms. We will not pollute the
 32    source or the default build with support for antiquated platforms. If you want
 33    support for IE 11, you are welcome to produce a PR that adds a *new build*
 34    transpiled to ES5.
 35  
 36  * Saxes handles errors differently from sax: it provides a default onerror
 37    handler which throws. You can replace it with your own handler if you want. If
 38    your handler does nothing, there is no `resume` method to call.
 39  
 40  * There's no `Stream` API. A revamped API may be introduced later. (It is still
 41    a "streaming parser" in the general sense that you write a character stream to
 42    it.)
 43  
 44  * Saxes does not have facilities for limiting the size the data chunks passed to
 45    event handlers. See the FAQ entry for more details.
 46  
 47  ## Conformance
 48  
 49  Saxes supports:
 50  
 51  * [XML 1.0 fifth edition](https://www.w3.org/TR/2008/REC-xml-20081126/)
 52  * [XML 1.1 second edition](https://www.w3.org/TR/2006/REC-xml11-20060816/)
 53  * [Namespaces in XML 1.0 (Third Edition)](https://www.w3.org/TR/2009/REC-xml-names-20091208/).
 54  * [Namespaces in XML 1.1 (Second Edition)](https://www.w3.org/TR/2006/REC-xml-names11-20060816/).
 55  
 56  ## Limitations
 57  
 58  This is a non-validating parser so it only verifies whether the document is
 59  well-formed. We do aim to raise errors for all malformed constructs
 60  encountered. However, this parser does not thorougly parse the contents of
 61  DTDs. So most malformedness errors caused by errors **in DTDs** cannot be
 62  reported.
 63  
 64  ## Regarding `<!DOCTYPE` and `<!ENTITY`
 65  
 66  The parser will handle the basic XML entities in text nodes and attribute
 67  values: `&amp; &lt; &gt; &apos; &quot;`. It's possible to define additional
 68  entities in XML by putting them in the DTD. This parser doesn't do anything with
 69  that. If you want to listen to the `doctype` event, and then fetch the
 70  doctypes, and read the entities and add them to `parser.ENTITIES`, then be my
 71  guest.
 72  
 73  ## Documentation
 74  
 75  The source code contains JSDOC comments. Use them. What follows is a brief
 76  summary of what is available. The final authority is the source code.
 77  
 78  **PAY CLOSE ATTENTION TO WHAT IS PUBLIC AND WHAT IS PRIVATE.**
 79  
 80  The move to TypeScript makes it so that everything is now formally private,
 81  protected, or public.
 82  
 83  If you use anything not public, that's at your own peril.
 84  
 85  If there's a mistake in the documentation, raise an issue. If you just assume,
 86  you may assume incorrectly.
 87  
 88  ## Summary Usage Information
 89  
 90  ### Example
 91  
 92  ```javascript
 93  var saxes = require("./lib/saxes"),
 94    parser = new saxes.SaxesParser();
 95  
 96  parser.on("error", function (e) {
 97    // an error happened.
 98  });
 99  parser.on("text", function (t) {
100    // got some text.  t is the string of text.
101  });
102  parser.on("opentag", function (node) {
103    // opened a tag.  node has "name" and "attributes"
104  });
105  parser.on("end", function () {
106    // parser stream is done, and ready to have more stuff written to it.
107  });
108  
109  parser.write('<xml>Hello, <who name="world">world</who>!</xml>').close();
110  ```
111  
112  ### Constructor Arguments
113  
114  Settings supported:
115  
116  * `xmlns` - Boolean. If `true`, then namespaces are supported. Default
117    is `false`.
118  
119  * `position` - Boolean. If `false`, then don't track line/col/position. Unset is
120    treated as `true`. Default is unset. Currently, setting this to `false` only
121    results in a cosmetic change: the errors reported do not contain position
122    information. sax-js would literally turn off the position-computing logic if
123    this flag was set to false. The notion was that it would optimize
124    execution. In saxes at least it turns out that continually testing this flag
125    causes a cost that offsets the benefits of turning off this logic.
126  
127  * `fileName` - String. Set a file name for error reporting. This is useful only
128    when tracking positions. You may leave it unset.
129  
130  * `fragment` - Boolean. If `true`, parse the XML as an XML fragment. Default is
131    `false`.
132  
133  * `additionalNamespaces` - A plain object whose key, value pairs define
134     namespaces known before parsing the XML file. It is not legal to pass
135     bindings for the namespaces `"xml"` or `"xmlns"`.
136  
137  * `defaultXMLVersion` - The default version of the XML specification to use if
138    the document contains no XML declaration. If the document does contain an XML
139    declaration, then this setting is ignored. Must be `"1.0"` or `"1.1"`. The
140    default is `"1.0"`.
141  
142  * `forceXMLVersion` - Boolean. A flag indicating whether to force the XML
143    version used for parsing to the value of ``defaultXMLVersion``. When this flag
144    is ``true``, ``defaultXMLVersion`` must be specified. If unspecified, the
145    default value of this flag is ``false``.
146  
147    Example: suppose you are parsing a document that has an XML declaration
148    specifying XML version 1.1.
149  
150    If you set ``defaultXMLVersion`` to ``"1.0"`` without setting
151    ``forceXMLVersion`` then the XML declaration will override the value of
152    ``defaultXMLVersion`` and the document will be parsed according to XML 1.1.
153  
154    If you set ``defaultXMLVersion`` to ``"1.0"`` and set ``forceXMLVersion`` to
155    ``true``, then the XML declaration will be ignored and the document will be
156    parsed according to XML 1.0.
157  
158  ### Methods
159  
160  `write` - Write bytes onto the stream. You don't have to pass the whole document
161  in one `write` call. You can read your source chunk by chunk and call `write`
162  with each chunk.
163  
164  `close` - Close the stream. Once closed, no more data may be written until it is
165  done processing the buffer, which is signaled by the `end` event.
166  
167  ### Properties
168  
169  The parser has the following properties:
170  
171  `line`, `column`, `columnIndex`, `position` - Indications of the position in the
172  XML document where the parser currently is looking. The `columnIndex` property
173  counts columns as if indexing into a JavaScript string, whereas the `column`
174  property counts Unicode characters.
175  
176  `closed` - Boolean indicating whether or not the parser can be written to.  If
177  it's `true`, then wait for the `ready` event to write again.
178  
179  `opt` - Any options passed into the constructor.
180  
181  `xmlDecl` - The XML declaration for this document. It contains the fields
182  `version`, `encoding` and `standalone`. They are all `undefined` before
183  encountering the XML declaration. If they are undefined after the XML
184  declaration, the corresponding value was not set by the declaration. There is no
185  event associated with the XML declaration. In a well-formed document, the XML
186  declaration may be preceded only by an optional BOM. So by the time any event
187  generated by the parser happens, the declaration has been processed if present
188  at all. Otherwise, you have a malformed document, and as stated above, you
189  cannot rely on the parser data!
190  
191  ### Error Handling
192  
193  The parser continues to parse even upon encountering errors, and does its best
194  to continue reporting errors. You should heed all errors reported. After an
195  error, however, saxes may interpret your document incorrectly. For instance
196  ``<foo a=bc="d"/>`` is invalid XML. Did you mean to have ``<foo a="bc=d"/>`` or
197  ``<foo a="b" c="d"/>`` or some other variation?  For the sake of continuing to
198  provide errors, saxes will continue parsing the document, but the structure it
199  reports may be incorrect. It is only after the errors are fixed in the document
200  that saxes can provide a reliable interpretation of the document.
201  
202  That leaves you with two rules of thumb when using saxes:
203  
204  * Pay attention to the errors that saxes report. The default `onerror` handler
205    throws, so by default, you cannot miss errors.
206  
207  * **ONCE AN ERROR HAS BEEN ENCOUNTERED, STOP RELYING ON THE EVENT HANDLERS OTHER
208    THAN `onerror`.** As explained above, when saxes runs into a well-formedness
209    problem, it makes a guess in order to continue reporting more errors. The guess
210    may be wrong.
211  
212  ### Events
213  
214  To listen to an event, override `on<eventname>`. The list of supported events
215  are also in the exported `EVENTS` array.
216  
217  See the JSDOC comments in the source code for a description of each supported
218  event.
219  
220  ### Parsing XML Fragments
221  
222  The XML specification does not define any method by which to parse XML
223  fragments. However, there are usage scenarios in which it is desirable to parse
224  fragments. In order to allow this, saxes provides three initialization options.
225  
226  If you pass the option `fragment: true` to the parser constructor, the parser
227  will expect an XML fragment. It essentially starts with a parsing state
228  equivalent to the one it would be in if `parser.write("<foo">)` had been called
229  right after initialization. In other words, it expects content which is
230  acceptable inside an element. This also turns off well-formedness checks that
231  are inappropriate when parsing a fragment.
232  
233  The option `additionalNamespaces` allows you to define additional prefix-to-URI
234  bindings known before parsing starts. You would use this over `resolvePrefix` if
235  you have at the ready a series of namespaces bindings to use.
236  
237  The option `resolvePrefix` allows you to pass a function which saxes will use if
238  it is unable to resolve a namespace prefix by itself. You would use this over
239  `additionalNamespaces` in a context where getting a complete list of defined
240  namespaces is onerous.
241  
242  Note that you can use `additionalNamespaces` and `resolvePrefix` together if you
243  want. `additionalNamespaces` applies before `resolvePrefix`.
244  
245  The options `additionalNamespaces` and `resolvePrefix` are really meant to be
246  used for parsing fragments. However, saxes won't prevent you from using them
247  with `fragment: false`. Note that if you do this, your document may parse
248  without errors and yet be malformed because the document can refer to namespaces
249  which are not defined *in* the document.
250  
251  Of course, `additionalNamespaces` and `resolvePrefix` are used only if `xmlns`
252  is `true`. If you are parsing a fragment that does not use namespaces, there's
253  no point in setting these options.
254  
255  ### Performance Tips
256  
257  * saxes works faster on files that use newlines (``\u000A``) as end of line
258    markers than files that use other end of line markers (like ``\r`` or
259    ``\r\n``). The XML specification requires that conformant applications behave
260    as if all characters that are to be treated as end of line characters are
261    converted to ``\u000A`` prior to parsing. The optimal code path for saxes is a
262    file in which all end of line characters are already ``\u000A``.
263  
264  * Don't split Unicode strings you feed to saxes across surrogates. When you
265    naively split a string in JavaScript, you run the risk of splitting a Unicode
266    character into two surrogates. e.g.  In the following example ``a`` and ``b``
267    each contain half of a single Unicode character: ``const a = "\u{1F4A9}"[0];
268    const b = "\u{1F4A9}"[1]`` If you feed such split surrogates to versions of
269    saxes prior to 4, you'd get errors. Saxes version 4 and over are able to
270    detect when a chunk of data ends with a surrogate and carry over the surrogate
271    to the next chunk. However this operation entails slicing and concatenating
272    strings. If you can feed your data in a way that does not split surrogates,
273    you should do it. (Obviously, feeding all the data at once with a single write
274    is fastest.)
275  
276  * Don't set event handlers you don't need. Saxes has always aimed to avoid doing
277    work that will just be tossed away but future improvements hope to do this
278    more aggressively. One way saxes knows whether or not some data is needed is
279    by checking whether a handler has been set for a specific event.
280  
281  ## FAQ
282  
283  Q. Why has saxes dropped support for limiting the size of data chunks passed to
284  event handlers?
285  
286  A. With sax you could set ``MAX_BUFFER_LENGTH`` to cause the parser to limit the
287  size of data chunks passed to event handlers. So if you ran into a span of text
288  above the limit, multiple ``text`` events with smaller data chunks were fired
289  instead of a single event with a large chunk.
290  
291  However, that functionality had some problematic characteristics. It had an
292  arbitrary default value. It was library-wide so all parsers created from a
293  single instance of the ``sax`` library shared it. This could potentially cause
294  conflicts among libraries running in the same VM but using sax for different
295  purposes.
296  
297  These issues could have been easily fixed, but there were larger issues. The
298  buffer limit arbitrarily applied to some events but not others. It would split
299  ``text``, ``cdata`` and ``script`` events. However, if a ``comment``,
300  ``doctype``, ``attribute`` or ``processing instruction`` were more than the
301  limit, the parser would generate an error and you were left picking up the
302  pieces.
303  
304  It was not intuitive to use. You'd think setting the limit to 1K would prevent
305  chunks bigger than 1K to be passed to event handlers. But that was not the
306  case. A comment in the source code told you that you might go over the limit if
307  you passed large chunks to ``write``. So if you want a 1K limit, don't pass 64K
308  chunks to ``write``. Fair enough. You know what limit you want so you can
309  control the size of the data you pass to ``write``. So you limit the chunks to
310  ``write`` to 1K at a time. Even if you do this, your event handlers may get data
311  chunks that are 2K in size. Suppose on the previous ``write`` the parser has
312  just finished processing an open tag, so it is ready for text. Your ``write``
313  passes 1K of text. You are not above the limit yet, so no event is generated
314  yet. The next ``write`` passes another 1K of text. It so happens that sax checks
315  buffer limits only once per ``write``, after the chunk of data has been
316  processed. Now you've hit the limit and you get a ``text`` event with 2K of
317  data. So even if you limit your ``write`` calls to the buffer limit you've set,
318  you may still get events with chunks at twice the buffer size limit you've
319  specified.
320  
321  We may consider reinstating an equivalent functionality, provided that it
322  addresses the issues above and does not cause a huge performance drop for
323  use-case scenarios that don't need it.