README.md
  1  # sax js
  2  
  3  A sax-style parser for XML and HTML.
  4  
  5  Designed with [node](http://nodejs.org/) in mind, but should work fine in
  6  the browser or other CommonJS implementations.
  7  
  8  ## What This Is
  9  
 10  * A very simple tool to parse through an XML string.
 11  * A stepping stone to a streaming HTML parser.
 12  * A handy way to deal with RSS and other mostly-ok-but-kinda-broken XML
 13    docs.
 14  
 15  ## What This Is (probably) Not
 16  
 17  * An HTML Parser - That's a fine goal, but this isn't it.  It's just
 18    XML.
 19  * A DOM Builder - You can use it to build an object model out of XML,
 20    but it doesn't do that out of the box.
 21  * XSLT - No DOM = no querying.
 22  * 100% Compliant with (some other SAX implementation) - Most SAX
 23    implementations are in Java and do a lot more than this does.
 24  * An XML Validator - It does a little validation when in strict mode, but
 25    not much.
 26  * A Schema-Aware XSD Thing - Schemas are an exercise in fetishistic
 27    masochism.
 28  * A DTD-aware Thing - Fetching DTDs is a much bigger job.
 29  
 30  ## Regarding `<!DOCTYPE`s and `<!ENTITY`s
 31  
 32  The parser will handle the basic XML entities in text nodes and attribute
 33  values: `&amp; &lt; &gt; &apos; &quot;`. It's possible to define additional
 34  entities in XML by putting them in the DTD. This parser doesn't do anything
 35  with that. If you want to listen to the `ondoctype` event, and then fetch
 36  the doctypes, and read the entities and add them to `parser.ENTITIES`, then
 37  be my guest.
 38  
 39  Unknown entities will fail in strict mode, and in loose mode, will pass
 40  through unmolested.
 41  
 42  ## Usage
 43  
 44  ```javascript
 45  var sax = require("./lib/sax"),
 46    strict = true, // set to false for html-mode
 47    parser = sax.parser(strict);
 48  
 49  parser.onerror = function (e) {
 50    // an error happened.
 51  };
 52  parser.ontext = function (t) {
 53    // got some text.  t is the string of text.
 54  };
 55  parser.onopentag = function (node) {
 56    // opened a tag.  node has "name" and "attributes"
 57  };
 58  parser.onattribute = function (attr) {
 59    // an attribute.  attr has "name" and "value"
 60  };
 61  parser.onend = function () {
 62    // parser stream is done, and ready to have more stuff written to it.
 63  };
 64  
 65  parser.write('<xml>Hello, <who name="world">world</who>!</xml>').close();
 66  
 67  // stream usage
 68  // takes the same options as the parser
 69  var saxStream = require("sax").createStream(strict, options)
 70  saxStream.on("error", function (e) {
 71    // unhandled errors will throw, since this is a proper node
 72    // event emitter.
 73    console.error("error!", e)
 74    // clear the error
 75    this._parser.error = null
 76    this._parser.resume()
 77  })
 78  saxStream.on("opentag", function (node) {
 79    // same object as above
 80  })
 81  // pipe is supported, and it's readable/writable
 82  // same chunks coming in also go out.
 83  fs.createReadStream("file.xml")
 84    .pipe(saxStream)
 85    .pipe(fs.createWriteStream("file-copy.xml"))
 86  ```
 87  
 88  
 89  ## Arguments
 90  
 91  Pass the following arguments to the parser function.  All are optional.
 92  
 93  `strict` - Boolean. Whether or not to be a jerk. Default: `false`.
 94  
 95  `opt` - Object bag of settings regarding string formatting.  All default to `false`.
 96  
 97  Settings supported:
 98  
 99  * `trim` - Boolean. Whether or not to trim text and comment nodes.
100  * `normalize` - Boolean. If true, then turn any whitespace into a single
101    space.
102  * `lowercase` - Boolean. If true, then lowercase tag names and attribute names
103    in loose mode, rather than uppercasing them.
104  * `xmlns` - Boolean. If true, then namespaces are supported.
105  * `position` - Boolean. If false, then don't track line/col/position.
106  * `strictEntities` - Boolean. If true, only parse [predefined XML
107    entities](http://www.w3.org/TR/REC-xml/#sec-predefined-ent)
108    (`&amp;`, `&apos;`, `&gt;`, `&lt;`, and `&quot;`)
109  
110  ## Methods
111  
112  `write` - Write bytes onto the stream. You don't have to do this all at
113  once. You can keep writing as much as you want.
114  
115  `close` - Close the stream. Once closed, no more data may be written until
116  it is done processing the buffer, which is signaled by the `end` event.
117  
118  `resume` - To gracefully handle errors, assign a listener to the `error`
119  event. Then, when the error is taken care of, you can call `resume` to
120  continue parsing. Otherwise, the parser will not continue while in an error
121  state.
122  
123  ## Members
124  
125  At all times, the parser object will have the following members:
126  
127  `line`, `column`, `position` - Indications of the position in the XML
128  document where the parser currently is looking.
129  
130  `startTagPosition` - Indicates the position where the current tag starts.
131  
132  `closed` - Boolean indicating whether or not the parser can be written to.
133  If it's `true`, then wait for the `ready` event to write again.
134  
135  `strict` - Boolean indicating whether or not the parser is a jerk.
136  
137  `opt` - Any options passed into the constructor.
138  
139  `tag` - The current tag being dealt with.
140  
141  And a bunch of other stuff that you probably shouldn't touch.
142  
143  ## Events
144  
145  All events emit with a single argument. To listen to an event, assign a
146  function to `on<eventname>`. Functions get executed in the this-context of
147  the parser object. The list of supported events are also in the exported
148  `EVENTS` array.
149  
150  When using the stream interface, assign handlers using the EventEmitter
151  `on` function in the normal fashion.
152  
153  `error` - Indication that something bad happened. The error will be hanging
154  out on `parser.error`, and must be deleted before parsing can continue. By
155  listening to this event, you can keep an eye on that kind of stuff. Note:
156  this happens *much* more in strict mode. Argument: instance of `Error`.
157  
158  `text` - Text node. Argument: string of text.
159  
160  `doctype` - The `<!DOCTYPE` declaration. Argument: doctype string.
161  
162  `processinginstruction` - Stuff like `<?xml foo="blerg" ?>`. Argument:
163  object with `name` and `body` members. Attributes are not parsed, as
164  processing instructions have implementation dependent semantics.
165  
166  `sgmldeclaration` - Random SGML declarations. Stuff like `<!ENTITY p>`
167  would trigger this kind of event. This is a weird thing to support, so it
168  might go away at some point. SAX isn't intended to be used to parse SGML,
169  after all.
170  
171  `opentagstart` - Emitted immediately when the tag name is available,
172  but before any attributes are encountered.  Argument: object with a
173  `name` field and an empty `attributes` set.  Note that this is the
174  same object that will later be emitted in the `opentag` event.
175  
176  `opentag` - An opening tag. Argument: object with `name` and `attributes`.
177  In non-strict mode, tag names are uppercased, unless the `lowercase`
178  option is set.  If the `xmlns` option is set, then it will contain
179  namespace binding information on the `ns` member, and will have a
180  `local`, `prefix`, and `uri` member.
181  
182  `closetag` - A closing tag. In loose mode, tags are auto-closed if their
183  parent closes. In strict mode, well-formedness is enforced. Note that
184  self-closing tags will have `closeTag` emitted immediately after `openTag`.
185  Argument: tag name.
186  
187  `attribute` - An attribute node.  Argument: object with `name` and `value`.
188  In non-strict mode, attribute names are uppercased, unless the `lowercase`
189  option is set.  If the `xmlns` option is set, it will also contains namespace
190  information.
191  
192  `comment` - A comment node.  Argument: the string of the comment.
193  
194  `opencdata` - The opening tag of a `<![CDATA[` block.
195  
196  `cdata` - The text of a `<![CDATA[` block. Since `<![CDATA[` blocks can get
197  quite large, this event may fire multiple times for a single block, if it
198  is broken up into multiple `write()`s. Argument: the string of random
199  character data.
200  
201  `closecdata` - The closing tag (`]]>`) of a `<![CDATA[` block.
202  
203  `opennamespace` - If the `xmlns` option is set, then this event will
204  signal the start of a new namespace binding.
205  
206  `closenamespace` - If the `xmlns` option is set, then this event will
207  signal the end of a namespace binding.
208  
209  `end` - Indication that the closed stream has ended.
210  
211  `ready` - Indication that the stream has reset, and is ready to be written
212  to.
213  
214  `noscript` - In non-strict mode, `<script>` tags trigger a `"script"`
215  event, and their contents are not checked for special xml characters.
216  If you pass `noscript: true`, then this behavior is suppressed.
217  
218  ## Reporting Problems
219  
220  It's best to write a failing test if you find an issue.  I will always
221  accept pull requests with failing tests if they demonstrate intended
222  behavior, but it is very hard to figure out what issue you're describing
223  without a test.  Writing a test is also the best way for you yourself
224  to figure out if you really understand the issue you think you have with
225  sax-js.