/ libxml2 / doc / xmlreader.html
xmlreader.html
  1  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
  2      "http://www.w3.org/TR/html4/loose.dtd">
  3  <html>
  4  <head>
  5    <meta http-equiv="Content-Type" content="text/html">
  6    <style type="text/css"></style>
  7  <!--
  8  TD {font-family: Verdana,Arial,Helvetica}
  9  BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em}
 10  H1 {font-family: Verdana,Arial,Helvetica}
 11  H2 {font-family: Verdana,Arial,Helvetica}
 12  H3 {font-family: Verdana,Arial,Helvetica}
 13  A:link, A:visited, A:active { text-decoration: underline }
 14    </style>
 15  -->
 16    <title>Libxml2 XmlTextReader Interface tutorial</title>
 17  </head>
 18  
 19  <body bgcolor="#fffacd" text="#000000">
 20  <h1 align="center">Libxml2 XmlTextReader Interface tutorial</h1>
 21  
 22  <p></p>
 23  
 24  <p>This document describes the use of the XmlTextReader streaming API added
 25  to libxml2 in version 2.5.0 . This API is closely modeled after the <a
 26  href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader</a>
 27  and <a
 28  href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlReader.html">XmlReader</a>
 29  classes of the C# language.</p>
 30  
 31  <p>This tutorial will present the key points of this API, and working
 32  examples using both C and the Python bindings:</p>
 33  
 34  <p>Table of content:</p>
 35  <ul>
 36    <li><a href="#Introducti">Introduction: why a new API</a></li>
 37    <li><a href="#Walking">Walking a simple tree</a></li>
 38    <li><a href="#Extracting">Extracting informations for the current
 39    node</a></li>
 40    <li><a href="#Extracting1">Extracting informations for the
 41    attributes</a></li>
 42    <li><a href="#Validating">Validating a document</a></li>
 43    <li><a href="#Entities">Entities substitution</a></li>
 44    <li><a href="#L1142">Relax-NG Validation</a></li>
 45    <li><a href="#Mixing">Mixing the reader and tree or XPath
 46    operations</a></li>
 47  </ul>
 48  
 49  <p></p>
 50  
 51  <h2><a name="Introducti">Introduction: why a new API</a></h2>
 52  
 53  <p>Libxml2 <a href="http://xmlsoft.org/html/libxml-tree.html">main API is
 54  tree based</a>, where the parsing operation results in a document loaded
 55  completely in memory, and expose it as a tree of nodes all availble at the
 56  same time. This is very simple and quite powerful, but has the major
 57  limitation that the size of the document that can be hamdled is limited by
 58  the size of the memory available. Libxml2 also provide a <a
 59  href="http://www.saxproject.org/">SAX</a> based API, but that version was
 60  designed upon one of the early <a
 61  href="http://www.jclark.com/xml/expat.html">expat</a> version of SAX, SAX is
 62  also not formally defined for C. SAX basically work by registering callbacks
 63  which are called directly by the parser as it progresses through the document
 64  streams. The problem is that this programming model is relatively complex,
 65  not well standardized, cannot provide validation directly, makes entity,
 66  namespace and base processing relatively hard.</p>
 67  
 68  <p>The <a
 69  href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader
 70  API from C#</a> provides a far simpler programming model. The API acts as a
 71  cursor going forward on the document stream and stopping at each node in the
 72  way. The user's code keeps control of the progress and simply calls a
 73  Read() function repeatedly to progress to each node in sequence in document
 74  order. There is direct support for namespaces, xml:base, entity handling and
 75  adding DTD validation on top of it was relatively simple. This API is really
 76  close to the <a href="http://www.w3.org/TR/DOM-Level-2-Core/">DOM Core
 77  specification</a> This provides a far more standard, easy to use and powerful
 78  API than the existing SAX. Moreover integrating extension features based on
 79  the tree seems relatively easy.</p>
 80  
 81  <p>In a nutshell the XmlTextReader API provides a simpler, more standard and
 82  more extensible interface to handle large documents than the existing SAX
 83  version.</p>
 84  
 85  <h2><a name="Walking">Walking a simple tree</a></h2>
 86  
 87  <p>Basically the XmlTextReader API is a forward only tree walking interface.
 88  The basic steps are:</p>
 89  <ol>
 90    <li>prepare a reader context operating on some input</li>
 91    <li>run a loop iterating over all nodes in the document</li>
 92    <li>free up the reader context</li>
 93  </ol>
 94  
 95  <p>Here is a basic C sample doing this:</p>
 96  <pre>#include &lt;libxml/xmlreader.h&gt;
 97  
 98  void processNode(xmlTextReaderPtr reader) {
 99      /* handling of a node in the tree */
100  }
101  
102  int streamFile(char *filename) {
103      xmlTextReaderPtr reader;
104      int ret;
105  
106      reader = xmlNewTextReaderFilename(filename);
107      if (reader != NULL) {
108          ret = xmlTextReaderRead(reader);
109          while (ret == 1) {
110              processNode(reader);
111              ret = xmlTextReaderRead(reader);
112          }
113          xmlFreeTextReader(reader);
114          if (ret != 0) {
115              printf("%s : failed to parse\n", filename);
116          }
117      } else {
118          printf("Unable to open %s\n", filename);
119      }
120  }</pre>
121  
122  <p>A few things to notice:</p>
123  <ul>
124    <li>the include file needed : <code>libxml/xmlreader.h</code></li>
125    <li>the creation of the reader using a filename</li>
126    <li>the repeated call to xmlTextReaderRead() and how any return value
127      different from 1 should stop the loop</li>
128    <li>that a negative return means a parsing error</li>
129    <li>how xmlFreeTextReader() should be used to free up the resources used by
130      the reader.</li>
131  </ul>
132  
133  <p>Here is similar code in python for exactly the same processing:</p>
134  <pre>import libxml2
135  
136  def processNode(reader):
137      pass
138  
139  def streamFile(filename):
140      try:
141          reader = libxml2.newTextReaderFilename(filename)
142      except:
143          print "unable to open %s" % (filename)
144          return
145  
146      ret = reader.Read()
147      while ret == 1:
148          processNode(reader)
149          ret = reader.Read()
150  
151      if ret != 0:
152          print "%s : failed to parse" % (filename)</pre>
153  
154  <p>The only things worth adding are that the <a
155  href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">xmlTextReader
156  is abstracted as a class like in C#</a> with the same method names (but the
157  properties are currently accessed with methods) and that one doesn't need to
158  free the reader at the end of the processing. It will get garbage collected
159  once all references have disapeared.</p>
160  
161  <h2><a name="Extracting">Extracting information for the current node</a></h2>
162  
163  <p>So far the example code did not indicate how information was extracted
164  from the reader. It was abstrated as a call to the processNode() routine,
165  with the reader as the argument. At each invocation, the parser is stopped on
166  a given node and the reader can be used to query those node properties. Each
167  <em>Property</em> is available at the C level as a function taking a single
168  xmlTextReaderPtr argument whose name is
169  <code>xmlTextReader</code><em>Property</em> , if the return type is an
170  <code>xmlChar *</code> string then it must be deallocated with
171  <code>xmlFree()</code> to avoid leaks. For the Python interface, there is a
172  <em>Property</em> method to the reader class that can be called on the
173  instance. The list of the properties is based on the <a
174  href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">C#
175  XmlTextReader class</a> set of properties and methods:</p>
176  <ul>
177    <li><em>NodeType</em>: The node type, 1 for start element, 15 for end of
178      element, 2 for attributes, 3 for text nodes, 4 for CData sections, 5 for
179      entity references, 6 for entity declarations, 7 for PIs, 8 for comments,
180      9 for the document nodes, 10 for DTD/Doctype nodes, 11 for document
181      fragment and 12 for notation nodes.</li>
182    <li><em>Name</em>: the <a
183      href="http://www.w3.org/TR/REC-xml-names/#ns-qualnames">qualified
184      name</a> of the node, equal to (<em>Prefix</em>:)<em>LocalName</em>.</li>
185    <li><em>LocalName</em>: the <a
186      href="http://www.w3.org/TR/REC-xml-names/#NT-LocalPart">local name</a> of
187      the node.</li>
188    <li><em>Prefix</em>: a  shorthand reference to the <a
189      href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
190      the node.</li>
191    <li><em>NamespaceUri</em>: the URI defining the <a
192      href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
193      the node.</li>
194    <li><em>BaseUri:</em> the base URI of the node. See the <a
195      href="http://www.w3.org/TR/xmlbase/">XML Base W3C specification</a>.</li>
196    <li><em>Depth:</em> the depth of the node in the tree, starts at 0 for the
197      root node.</li>
198    <li><em>HasAttributes</em>: whether the node has attributes.</li>
199    <li><em>HasValue</em>: whether the node can have a text value.</li>
200    <li><em>Value</em>: provides the text value of the node if present.</li>
201    <li><em>IsDefault</em>: whether an Attribute  node was generated from the
202      default value defined in the DTD or schema (<em>unsupported
203    yet</em>).</li>
204    <li><em>XmlLang</em>: the <a
205      href="http://www.w3.org/TR/REC-xml#sec-lang-tag">xml:lang</a> scope
206      within which the node resides.</li>
207    <li><em>IsEmptyElement</em>: check if the current node is empty, this is a
208      bit bizarre in the sense that <code>&lt;a/&gt;</code> will be considered
209      empty while <code>&lt;a&gt;&lt;/a&gt;</code> will not.</li>
210    <li><em>AttributeCount</em>: provides the number of attributes of the
211      current node.</li>
212  </ul>
213  
214  <p>Let's look first at a small example to get this in practice by redefining
215  the processNode() function in the Python example:</p>
216  <pre>def processNode(reader):
217      print "%d %d %s %d" % (reader.Depth(), reader.NodeType(),
218                             reader.Name(), reader.IsEmptyElement())</pre>
219  
220  <p>and look at the result of calling streamFile("tst.xml") for various
221  content of the XML test file.</p>
222  
223  <p>For the minimal document "<code>&lt;doc/&gt;</code>" we get:</p>
224  <pre>0 1 doc 1</pre>
225  
226  <p>Only one node is found, its depth is 0, type 1 indicate an element start,
227  of name "doc" and it is empty. Trying now with
228  "<code>&lt;doc&gt;&lt;/doc&gt;</code>" instead leads to:</p>
229  <pre>0 1 doc 0
230  0 15 doc 0</pre>
231  
232  <p>The document root node is not flagged as empty anymore and both a start
233  and an end of element are detected. The following document shows how
234  character data are reported:</p>
235  <pre>&lt;doc&gt;&lt;a/&gt;&lt;b&gt;some text&lt;/b&gt;
236  &lt;c/&gt;&lt;/doc&gt;</pre>
237  
238  <p>We modifying the processNode() function to also report the node Value:</p>
239  <pre>def processNode(reader):
240      print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
241                                reader.Name(), reader.IsEmptyElement(),
242                                reader.Value())</pre>
243  
244  <p>The result of the test is:</p>
245  <pre>0 1 doc 0 None
246  1 1 a 1 None
247  1 1 b 0 None
248  2 3 #text 0 some text
249  1 15 b 0 None
250  1 3 #text 0
251  
252  1 1 c 1 None
253  0 15 doc 0 None</pre>
254  
255  <p>There are a few things to note:</p>
256  <ul>
257    <li>the increase of the depth value (first row) as children nodes are
258      explored</li>
259    <li>the text node child of the b element, of type 3 and its content</li>
260    <li>the text node containing the line return between elements b and c</li>
261    <li>that elements have the Value None (or NULL in C)</li>
262  </ul>
263  
264  <p>The equivalent routine for <code>processNode()</code> as used by
265  <code>xmllint --stream --debug</code> is the following and can be found in
266  the xmllint.c module in the source distribution:</p>
267  <pre>static void processNode(xmlTextReaderPtr reader) {
268      xmlChar *name, *value;
269  
270      name = xmlTextReaderName(reader);
271      if (name == NULL)
272          name = xmlStrdup(BAD_CAST "--");
273      value = xmlTextReaderValue(reader);
274  
275      printf("%d %d %s %d",
276              xmlTextReaderDepth(reader),
277              xmlTextReaderNodeType(reader),
278              name,
279              xmlTextReaderIsEmptyElement(reader));
280      xmlFree(name);
281      if (value == NULL)
282          printf("\n");
283      else {
284          printf(" %s\n", value);
285          xmlFree(value);
286      }
287  }</pre>
288  
289  <h2><a name="Extracting1">Extracting information for the attributes</a></h2>
290  
291  <p>The previous examples don't indicate how attributes are processed. The
292  simple test "<code>&lt;doc a="b"/&gt;</code>" provides the following
293  result:</p>
294  <pre>0 1 doc 1 None</pre>
295  
296  <p>This proves that attribute nodes are not traversed by default. The
297  <em>HasAttributes</em> property allow to detect their presence. To check
298  their content the API has special instructions. Basically two kinds of operations
299  are possible:</p>
300  <ol>
301    <li>to move the reader to the attribute nodes of the current element, in
302      that case the cursor is positionned on the attribute node</li>
303    <li>to directly query the element node for the attribute value</li>
304  </ol>
305  
306  <p>In both case the attribute can be designed either by its position in the
307  list of attribute (<em>MoveToAttributeNo</em> or <em>GetAttributeNo</em>) or
308  by their name (and namespace):</p>
309  <ul>
310    <li><em>GetAttributeNo</em>(no): provides the value of the attribute with
311      the specified index no relative to the containing element.</li>
312    <li><em>GetAttribute</em>(name): provides the value of the attribute with
313      the specified qualified name.</li>
314    <li>GetAttributeNs(localName, namespaceURI): provides the value of the
315      attribute with the specified local name and namespace URI.</li>
316    <li><em>MoveToAttributeNo</em>(no): moves the position of the current
317      instance to the attribute with the specified index relative to the
318      containing element.</li>
319    <li><em>MoveToAttribute</em>(name): moves the position of the current
320      instance to the attribute with the specified qualified name.</li>
321    <li><em>MoveToAttributeNs</em>(localName, namespaceURI): moves the position
322      of the current instance to the attribute with the specified local name
323      and namespace URI.</li>
324    <li><em>MoveToFirstAttribute</em>: moves the position of the current
325      instance to the first attribute associated with the current node.</li>
326    <li><em>MoveToNextAttribute</em>: moves the position of the current
327      instance to the next attribute associated with the current node.</li>
328    <li><em>MoveToElement</em>: moves the position of the current instance to
329      the node that contains the current Attribute  node.</li>
330  </ul>
331  
332  <p>After modifying the processNode() function to show attributes:</p>
333  <pre>def processNode(reader):
334      print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
335                                reader.Name(), reader.IsEmptyElement(),
336                                reader.Value())
337      if reader.NodeType() == 1: # Element
338          while reader.MoveToNextAttribute():
339              print "-- %d %d (%s) [%s]" % (reader.Depth(), reader.NodeType(),
340                                            reader.Name(),reader.Value())</pre>
341  
342  <p>The output for the same input document reflects the attribute:</p>
343  <pre>0 1 doc 1 None
344  -- 1 2 (a) [b]</pre>
345  
346  <p>There are a couple of things to note on the attribute processing:</p>
347  <ul>
348    <li>Their depth is the one of the carrying element plus one.</li>
349    <li>Namespace declarations are seen as attributes, as in DOM.</li>
350  </ul>
351  
352  <h2><a name="Validating">Validating a document</a></h2>
353  
354  <p>Libxml2 implementation adds some extra features on top of the XmlTextReader
355  API. The main one is the ability to DTD validate the parsed document
356  progressively. This is simply the activation of the associated feature of the
357  parser used by the reader structure. There are a few options available
358  defined as the enum xmlParserProperties in the libxml/xmlreader.h header
359  file:</p>
360  <ul>
361    <li>XML_PARSER_LOADDTD: force loading the DTD (without validating)</li>
362    <li>XML_PARSER_DEFAULTATTRS: force attribute defaulting (this also imply
363      loading the DTD)</li>
364    <li>XML_PARSER_VALIDATE: activate DTD validation (this also imply loading
365      the DTD)</li>
366    <li>XML_PARSER_SUBST_ENTITIES: substitute entities on the fly, entity
367      reference nodes are not generated and are replaced by their expanded
368      content.</li>
369    <li>more settings might be added, those were the one available at the 2.5.0
370      release...</li>
371  </ul>
372  
373  <p>The GetParserProp() and SetParserProp() methods can then be used to get
374  and set the values of those parser properties of the reader. For example</p>
375  <pre>def parseAndValidate(file):
376      reader = libxml2.newTextReaderFilename(file)
377      reader.SetParserProp(libxml2.PARSER_VALIDATE, 1)
378      ret = reader.Read()
379      while ret == 1:
380          ret = reader.Read()
381      if ret != 0:
382          print "Error parsing and validating %s" % (file)</pre>
383  
384  <p>This routine will parse and validate the file. Error messages can be
385  captured by registering an error handler. See python/tests/reader2.py for
386  more complete Python examples. At the C level the equivalent call to cativate
387  the validation feature is just:</p>
388  <pre>ret = xmlTextReaderSetParserProp(reader, XML_PARSER_VALIDATE, 1)</pre>
389  
390  <p>and a return value of 0 indicates success.</p>
391  
392  <h2><a name="Entities">Entities substitution</a></h2>
393  
394  <p>By default the xmlReader will report entities as such and not replace them
395  with their content. This default behaviour can however be overriden using:</p>
396  
397  <p><code>reader.SetParserProp(libxml2.PARSER_SUBST_ENTITIES,1)</code></p>
398  
399  <h2><a name="L1142">Relax-NG Validation</a></h2>
400  
401  <p style="font-size: 10pt">Introduced in version 2.5.7</p>
402  
403  <p>Libxml2 can now validate the document being read using the xmlReader using
404  Relax-NG schemas. While the Relax NG validator can't always work in a
405  streamable mode, only subsets which cannot be reduced to regular expressions
406  need to have their subtree expanded for validation. In practice it means
407  that, unless the schemas for the top level element content is not expressable
408  as a regexp, only chunk of the document needs to be parsed while
409  validating.</p>
410  
411  <p>The steps to do so are:</p>
412  <ul>
413    <li>create a reader working on a document as usual</li>
414    <li>before any call to read associate it to a Relax NG schemas, either the
415      preparsed schemas or the URL to the schemas to use</li>
416    <li>errors will be reported the usual way, and the validity status can be
417      obtained using the IsValid() interface of the reader like for DTDs.</li>
418  </ul>
419  
420  <p>Example, assuming the reader has already being created and that the schema
421  string contains the Relax-NG schemas:</p>
422  <pre><code>rngp = libxml2.relaxNGNewMemParserCtxt(schema, len(schema))<br>
423  rngs = rngp.relaxNGParse()<br>
424  reader.RelaxNGSetSchema(rngs)<br>
425  ret = reader.Read()<br>
426  while ret == 1:<br>
427      ret = reader.Read()<br>
428  if ret != 0:<br>
429      print "Error parsing the document"<br>
430  if reader.IsValid() != 1:<br>
431      print "Document failed to validate"</code><br>
432  </pre>
433  
434  <p>See <code>reader6.py</code> in the sources or documentation for a complete
435  example.</p>
436  
437  <h2><a name="Mixing">Mixing the reader and tree or XPath operations</a></h2>
438  
439  <p style="font-size: 10pt">Introduced in version 2.5.7</p>
440  
441  <p>While the reader is a streaming interface, its underlying implementation
442  is based on the DOM builder of libxml2. As a result it is relatively simple
443  to mix operations based on both models under some constraints. To do so the
444  reader has an Expand() operation allowing to grow the subtree under the
445  current node. It returns a pointer to a standard node which can be
446  manipulated in the usual ways. The node will get all its ancestors and the
447  full subtree available. Usual operations like XPath queries can be used on
448  that reduced view of the document. Here is an example extracted from
449  reader5.py in the sources which extract and prints the bibliography for the
450  "Dragon" compiler book from the XML 1.0 recommendation:</p>
451  <pre>f = open('../../test/valid/REC-xml-19980210.xml')
452  input = libxml2.inputBuffer(f)
453  reader = input.newTextReader("REC")
454  res=""
455  while reader.Read():
456      while reader.Name() == 'bibl':
457          node = reader.Expand()            # expand the subtree
458          if node.xpathEval("@id = 'Aho'"): # use XPath on it
459              res = res + node.serialize()
460          if reader.Next() != 1:            # skip the subtree
461              break;</pre>
462  
463  <p>Note, however that the node instance returned by the Expand() call is only
464  valid until the next Read() operation. The Expand() operation does not
465  affects the Read() ones, however usually once processed the full subtree is
466  not useful anymore, and the Next() operation allows to skip it completely and
467  process to the successor or return 0 if the document end is reached.</p>
468  
469  <p><a href="mailto:xml@gnome.org">Daniel Veillard</a></p>
470  
471  <p>$Id$</p>
472  
473  <p></p>
474  </body>
475  </html>