A Lisp Based HTML Parser

This is a preliminary document which will be updated over time. Upates will be available for downloading.

Introduction and a simple example
LHTML parse output format
Case mode notes
Parsing HTML comments
Parsing <SCRIPT> and <STYLE> tags
Parsing SGML <! tags
Parsing Illegal and Deprecated Tags
Default Attribute Values
Parsing Interleaved Character Formatting Tags
parse-html reference
   methods
   phtml-internal

Introduction and a simple example

The parse-html generic function processes HTML input, returning a list of HTML tags, attributes, and text. Here is a simple example (we have added carriage returns in the string for readability):

(parse-html "<HTML>
             <HEAD>
             <TITLE>Example HTML input</TITLE>
             <BODY>
             <P>Here is some text with a <B>bold</B> word <br>and a 
                <A HREF=\"help.html\">link</A></P>
             </HTML>")

generates:

((:html (:head (:title "Example HTML input"))
  (:body (:p "Here is some text with a " (:b "bold") "word" :br "and a "
             ((:a :href "help.html") "link")))))

The output format is known as LHTML (Lisp HTML) format; it is the same format that the aserve htmlgen macro accepts. As the example shows, it is a nested collection of lists where the first element of each list is a keyword associated with an HTML marker, or a list whose first element is a keyword associated with an HTML marker and the other elements are attribute names and values. (The link is an example where the first element is a list.)

LHTML parse output format

LHTML is a list representation of HTML tags and content.

Each list member may be:

  1. a string containing text content, such as "Here is some text with a "
  2. a keyword package symbol representing a HTML tag with no associated attributes or content, such as :br.
  3. a list representing an HTML tag with associated attributes and/or content, such as (:b "bold") or ((:a :href "help.html") "link").

More on the list (# 3) member: if the HTML tag does not have associated attributes, then the first list member will be a keyword package symbol representing the HTML tag, and the other elements will  represent the content, which can be a string (text content), a keyword package symbol (HTML tag with no attributes or content), or list (nested HTML tag with associated attributes and/or content). If there are associated attributes, then the first list member will be a list containing a keyword package symbol followed by two list members for each associated attribute; the first member is a keyword package symbol representing the attribute, and the next member is a string corresponding to the attribute value.

Case Mode and LHTML

If excl:*current-case-mode* is :CASE-INSENSITIVE-UPPER, keyword package symbols will be in upper case; otherwise, they will be in lower case.

HTML Comments

HTML comments are represented use a :comment symbol. For example,

(parse-html "<!-- this is a comment-->")

--> ((:comment " this is a comment"))

HTML <SCRIPT> and <STYLE> tags

All <SCRIPT> and <STYLE> content is not parsed; it is returned as text content.

For example,

(parse-html "<SCRIPT>this <B>will not</B> be parsed</SCRIPT>")

--> ((:script "this <B>will not</B> be parsed"))

XML and SGML <! tags

Since, some HTML pages contain special XML/SGML tags, non-comment tags starting with '<!' are treated specially:

(parse-html "<!doctype this is some text>")

--> ((:!doctype " this is some text"))

Illegal and Deprecated HTML

There is plenty of illegal and deprecated HTML on the web that popular browsers nonetheless successfully display. The parse-html parser is generous - it will not raise an error condition upon encountering most input. In particular, it does not maintain a list of legal HTML tags and will successfully parse nonsense input.

For example,

(parse-html "<this> <is> <some> <nonsense> <input>")

--> ((:this (:is (:some (:nonsense :input)))))

In some situations, you may prefer a two-pass parse that results in a parse where deep nesting related to unrecognized tags is minimized:

(let ((string "<this> <is> <some> <nonsense> </some> <input>"))
        (multiple-value-bind (res rogues)
          (parse-html string :collect-rogue-tags t)
            (declare (ignorable res))
            (parse-html string :no-body-tags rogues)))

--> (:this :is (:some (:nonsense)) :input)

See the descriptions of the collect-rogue-tags and no-body-tags keyword arguments to parse-html descriptions in the reference section below for more information.

Default Attribute values

As per the HTML 4.0 specification, attributes without specified values are given a lower case string value that matches the attribute name.

For example,

(parse-html "<P here ARE some attributes>")

--> (((:p :here "here" :are "are" :some "some" :attributes "attributes")))

Interleaved Character Formatting Tags

Existing HTML pages often have character format tags that are interleaved among other tags. Such interleaving is removed in a manner consistent with the HTML 4.0 specification.

For example,

(parse-html "<P>Here is <B>bold text<P>that spans</B>two paragraphs")

--> ((:p "Here is " (:b "bold text")) (:p (:b "that spans") "two paragraphs"))

parse-html Reference

parse-html [Generic function]

Arguments: input-source &key callbacks callback-only collect-rogue-tags no-body-tags

Returns LHTML output, as described above in this document.

parse-html Methods

Methods specializing on the required argument are defined for stream (reads output from the stream) and string (parses the argument string).

parse-html (p stream) &key callbacks callback-only collect-rogue-tags no-body-tags
parse-html (str string) &key callbacks callback-only collect-rogue-tags no-body-tags
parse-html (file t) &key callbacks callback-only collect-rogue-tags no-body-tags

The t method assumes the argument is a pathname suitable for use with the with-open-file macro.

phtml-internal [Function]

Arguments: stream read-sequence-func callback-only callbacks collect-rogue-tags no-body-tags

This function may be used when more control is needed for supplying the HTML input. The read-sequence-func argument, if non-nil, should be a function object or a symbol naming a function. When phtml-internal requires another buffer of HTML input, it will invoke the read-sequence-func function with two arguments - the first argument is an internal buffer character array and the second argument is the phtml-internal stream argument. If read-sequence-fun is nil, phtml-internal will invoke read-sequence to fill the buffer. The read-sequence-func function must return the number of character array elements successfully stored in the buffer.

Copyright (c) 2000 by Franz Inc. All rights reserved.