/ org.htmlparser / src / org / htmlparser / lexer / package.html
package.html
  1  <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
  2  
  3  <HTML>
  4  <HEAD>
  5  <!--
  6   HTMLParser Library $Name: v1_6_20060319 $ - A java-based parser for HTML
  7   http://sourceforge.org/projects/htmlparser
  8   Copyright (C) 2004 Somik Raha
  9  
 10   Revision Control Information
 11  
 12   $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/package.html,v $
 13   $Author: derrickoswald $
 14   $Date: 2005/04/12 11:27:41 $
 15   $Revision: 1.13 $
 16  
 17   This library is free software; you can redistribute it and/or
 18   modify it under the terms of the GNU Lesser General Public
 19   License as published by the Free Software Foundation; either
 20   version 2.1 of the License, or (at your option) any later version.
 21  
 22   This library is distributed in the hope that it will be useful,
 23   but WITHOUT ANY WARRANTY; without even the implied warranty of
 24   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
 25   Lesser General Public License for more details.
 26  
 27   You should have received a copy of the GNU Lesser General Public
 28   License along with this library; if not, write to the Free Software
 29   Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
 30  -->
 31  <TITLE>Lexer Package</TITLE>
 32  </HEAD>
 33  <BODY>
 34  The lexer package is the base level I/O subsystem.
 35  <P>The lexer package is responsible for reading characters from the HTML source
 36  and identifying the node lexemes. For example, the HTML code below would return
 37  the list of nodes shown:</P>
 38  <PRE>
 39  &lt;html&gt;&lt;head&gt;&lt;title&gt;Humoresque&lt;/title&gt;&lt;/head&gt;
 40  &lt;body bgcolor='silver'&gt;
 41  Passengers will please refrain
 42  from flushing toilets while the train
 43  is standing in the station. I love you!
 44  &lt;p&gt;
 45  We encourage constipation
 46  while the train is in the station
 47  If the train can't go
 48  then why should you.
 49  &lt;/body&gt;
 50  &lt;/html&gt;
 51  </PRE>
 52  <OL>
 53  <LI>line 0, offset 0, to line 0, offset 6, html tag</LI>
 54  <LI>line 0, offset 6, to line 0, offset 12, head tag</LI>
 55  <LI>line 0, offset 12, to line 0, offset 19, title tag</LI>
 56  <LI>line 0, offset 19, to line 0, offset 29, string node "Humoresque"</LI>
 57  <LI>line 0, offset 29, to line 0, offset 37, end title tag</LI>
 58  <LI>line 0, offset 37, to line 0, offset 44, end head tag</LI>
 59  <LI>line 0, offset 44, to line 0, offset 45, string node "\n"</LI>
 60  <LI>line 1, offset 0, to line 1, offset 23, body tag</LI>
 61  <LI>line 1, offset 23, to line 4, offset 40, string node "\nPassengers...you!\n"</LI>
 62  <LI>line 5, offset 0,  to line 5, offset 2, paragraph tag</LI>
 63  <LI>line 5, offset 3, to line 9, offset 21, string node "\nWe...you.\n"</LI>
 64  <LI>line 10, offset 0, to line 10, offset 7, end body tag</LI>
 65  <LI>line 10, offset 8, to line 10, offset 9, string "\n"</LI>
 66  <LI>line 11, offset 0, to line 11, offset 7, html tag</LI>
 67  <LI>line 11, offset 7, to line 11, offset 8, string node "\n"</LI>
 68  </OL>
 69  <p>Stream, Source, Page and Lexer
 70  <p>The package is arranged in four levels, <CODE>Stream</CODE>,
 71  <CODE>Source</CODE> <CODE>Page</CODE> and <CODE>Lexer</CODE> in the order of lowest to
 72  highest.
 73  A <CODE>Stream</CODE> is raw bytes from the URLConnection or file. It has no
 74  intelligence. A <CODE>Source</CODE> is raw characters, hence it knows about the
 75  encoding scheme used and can be reset if a different encoding is detected after
 76  partially reading in the text. A <CODE>Page</CODE> provides characters from the
 77  source while maintaining the index of line numbers, and hence can be thought of
 78  as an array of strings corresponding to source file lines, but it doesn't
 79  actually store any text, relying on the buffering within the
 80  <CODE>Source</CODE> instead. The <CODE>Lexer</CODE> contains the actual lexeme parsing
 81  code. It reads characters from the page, keeping track of where it is with a
 82  <CODE>Cursor</CODE> and creates the array of nodes using various state
 83  machines.
 84  <p>
 85  The following are some design goals and 'invariants' within the package, if you
 86  are attempting to understand or modify it.
 87  <DL>
 88  <DT>Contiguous Nodes
 89  <DD>Adjacent nodes have no characters between them. The list of nodes forms an
 90  uninterrupted chain that, by start and end definitions, completely covers the
 91  characters that were read from the HTML source.
 92  <DT>Text Fidelity
 93  <DD>Besides complete coverage, the nodes do not initially contain copies of 
 94  the text, but instead simply contain offsets into a single large buffer
 95  that contains the text read from the HTML source. Even within tags, the
 96  attributes list can contain whitespace, thus there is no lost whitespace or
 97  text formatting either outside or within tags. Upper and lower case text is
 98  preserved.
 99  <DT>Line Endings
100  <DD>End of line characters are just whitespace. There is no distinction
101  made between end of line characters (or pairs of characters on Windows) and
102  other whitespace. The text is not read in line by line so nodes (tags) can easily span
103  multiple lines with no special processing. Line endings are not transformed
104  between platforms, i.e. Unix line endings are not converted to Windows line
105  endings by this level.  Each node has a starting and ending location, which
106  the page can use to extract the text. To facilitate formatting error and log messages
107  the page can turn these offsets into row and column numbers. In general ignore line
108  breaks in the source if at all possible.
109  <DT>State Machines
110  <DD>The Lexer has the following state machines:
111  <UL>
112  <LI>in text - parseString()</LI>
113  <LI>in comment - parseRemark()</LI>
114  <LI>in tag - parseTag()</LI>
115  <LI>in JSP tag - parseJsp()</LI>
116  </UL>
117  There is another state machine -- parseCDATA -- used by higher level code
118  (script and style scanners), but this isn't actually used by the lexer.
119  <DT>Two Jars
120  <DD>For elementary operations at the node level, a minimalist jar file containing
121  only the lexer and base tag classes is split out from the larger <CODE>htmlparser.jar</CODE>.
122  In this way, simple parsing and output is handled with a jar file that is under
123  45 kilobytes, but anything beyond peephole manipulation, i.e. closing tag detection
124  and other semantic reasoning, will need the full set of scanners, nodes and ancillary
125  classes, which now stands at 210 kilobytes.
126  </DL>
127  </BODY>
128  </HTML>