package.html
1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> 2 3 <HTML> 4 <HEAD> 5 <!-- 6 HTMLParser Library $Name: v1_6_20060319 $ - A java-based parser for HTML 7 http://sourceforge.org/projects/htmlparser 8 Copyright (C) 2004 Somik Raha 9 10 Revision Control Information 11 12 $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/package.html,v $ 13 $Author: derrickoswald $ 14 $Date: 2005/04/12 11:27:41 $ 15 $Revision: 1.13 $ 16 17 This library is free software; you can redistribute it and/or 18 modify it under the terms of the GNU Lesser General Public 19 License as published by the Free Software Foundation; either 20 version 2.1 of the License, or (at your option) any later version. 21 22 This library is distributed in the hope that it will be useful, 23 but WITHOUT ANY WARRANTY; without even the implied warranty of 24 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU 25 Lesser General Public License for more details. 26 27 You should have received a copy of the GNU Lesser General Public 28 License along with this library; if not, write to the Free Software 29 Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA 30 --> 31 <TITLE>Lexer Package</TITLE> 32 </HEAD> 33 <BODY> 34 The lexer package is the base level I/O subsystem. 35 <P>The lexer package is responsible for reading characters from the HTML source 36 and identifying the node lexemes. For example, the HTML code below would return 37 the list of nodes shown:</P> 38 <PRE> 39 <html><head><title>Humoresque</title></head> 40 <body bgcolor='silver'> 41 Passengers will please refrain 42 from flushing toilets while the train 43 is standing in the station. I love you! 44 <p> 45 We encourage constipation 46 while the train is in the station 47 If the train can't go 48 then why should you. 49 </body> 50 </html> 51 </PRE> 52 <OL> 53 <LI>line 0, offset 0, to line 0, offset 6, html tag</LI> 54 <LI>line 0, offset 6, to line 0, offset 12, head tag</LI> 55 <LI>line 0, offset 12, to line 0, offset 19, title tag</LI> 56 <LI>line 0, offset 19, to line 0, offset 29, string node "Humoresque"</LI> 57 <LI>line 0, offset 29, to line 0, offset 37, end title tag</LI> 58 <LI>line 0, offset 37, to line 0, offset 44, end head tag</LI> 59 <LI>line 0, offset 44, to line 0, offset 45, string node "\n"</LI> 60 <LI>line 1, offset 0, to line 1, offset 23, body tag</LI> 61 <LI>line 1, offset 23, to line 4, offset 40, string node "\nPassengers...you!\n"</LI> 62 <LI>line 5, offset 0, to line 5, offset 2, paragraph tag</LI> 63 <LI>line 5, offset 3, to line 9, offset 21, string node "\nWe...you.\n"</LI> 64 <LI>line 10, offset 0, to line 10, offset 7, end body tag</LI> 65 <LI>line 10, offset 8, to line 10, offset 9, string "\n"</LI> 66 <LI>line 11, offset 0, to line 11, offset 7, html tag</LI> 67 <LI>line 11, offset 7, to line 11, offset 8, string node "\n"</LI> 68 </OL> 69 <p>Stream, Source, Page and Lexer 70 <p>The package is arranged in four levels, <CODE>Stream</CODE>, 71 <CODE>Source</CODE> <CODE>Page</CODE> and <CODE>Lexer</CODE> in the order of lowest to 72 highest. 73 A <CODE>Stream</CODE> is raw bytes from the URLConnection or file. It has no 74 intelligence. A <CODE>Source</CODE> is raw characters, hence it knows about the 75 encoding scheme used and can be reset if a different encoding is detected after 76 partially reading in the text. A <CODE>Page</CODE> provides characters from the 77 source while maintaining the index of line numbers, and hence can be thought of 78 as an array of strings corresponding to source file lines, but it doesn't 79 actually store any text, relying on the buffering within the 80 <CODE>Source</CODE> instead. The <CODE>Lexer</CODE> contains the actual lexeme parsing 81 code. It reads characters from the page, keeping track of where it is with a 82 <CODE>Cursor</CODE> and creates the array of nodes using various state 83 machines. 84 <p> 85 The following are some design goals and 'invariants' within the package, if you 86 are attempting to understand or modify it. 87 <DL> 88 <DT>Contiguous Nodes 89 <DD>Adjacent nodes have no characters between them. The list of nodes forms an 90 uninterrupted chain that, by start and end definitions, completely covers the 91 characters that were read from the HTML source. 92 <DT>Text Fidelity 93 <DD>Besides complete coverage, the nodes do not initially contain copies of 94 the text, but instead simply contain offsets into a single large buffer 95 that contains the text read from the HTML source. Even within tags, the 96 attributes list can contain whitespace, thus there is no lost whitespace or 97 text formatting either outside or within tags. Upper and lower case text is 98 preserved. 99 <DT>Line Endings 100 <DD>End of line characters are just whitespace. There is no distinction 101 made between end of line characters (or pairs of characters on Windows) and 102 other whitespace. The text is not read in line by line so nodes (tags) can easily span 103 multiple lines with no special processing. Line endings are not transformed 104 between platforms, i.e. Unix line endings are not converted to Windows line 105 endings by this level. Each node has a starting and ending location, which 106 the page can use to extract the text. To facilitate formatting error and log messages 107 the page can turn these offsets into row and column numbers. In general ignore line 108 breaks in the source if at all possible. 109 <DT>State Machines 110 <DD>The Lexer has the following state machines: 111 <UL> 112 <LI>in text - parseString()</LI> 113 <LI>in comment - parseRemark()</LI> 114 <LI>in tag - parseTag()</LI> 115 <LI>in JSP tag - parseJsp()</LI> 116 </UL> 117 There is another state machine -- parseCDATA -- used by higher level code 118 (script and style scanners), but this isn't actually used by the lexer. 119 <DT>Two Jars 120 <DD>For elementary operations at the node level, a minimalist jar file containing 121 only the lexer and base tag classes is split out from the larger <CODE>htmlparser.jar</CODE>. 122 In this way, simple parsing and output is handled with a jar file that is under 123 45 kilobytes, but anything beyond peephole manipulation, i.e. closing tag detection 124 and other semantic reasoning, will need the full set of scanners, nodes and ancillary 125 classes, which now stands at 210 kilobytes. 126 </DL> 127 </BODY> 128 </HTML>