This is a preliminary document which will be updated over time. Upates will be available for downloading.
Introduction and a simple example
LHTML parse output format
Case mode notes
Parsing HTML comments
Parsing <SCRIPT> and <STYLE> tags
Parsing SGML <! tags
Parsing Illegal and Deprecated Tags
Default Attribute Values
Parsing Interleaved Character Formatting Tags
parse-html reference
methods
phtml-internal
The parse-html generic function processes HTML input, returning a list of HTML tags, attributes, and text. Here is a simple example (we have added carriage returns in the string for readability):
(parse-html "<HTML> <HEAD> <TITLE>Example HTML input</TITLE> <BODY> <P>Here is some text with a <B>bold</B> word <br>and a <A HREF=\"help.html\">link</A></P> </HTML>")
generates:
((:html (:head (:title "Example HTML input")) (:body (:p "Here is some text with a " (:b "bold") "word" :br "and a " ((:a :href "help.html") "link")))))
The output format is known as LHTML (Lisp HTML) format; it is the same format that the aserve htmlgen macro accepts. As the example shows, it is a nested collection of lists where the first element of each list is a keyword associated with an HTML marker, or a list whose first element is a keyword associated with an HTML marker and the other elements are attribute names and values. (The link is an example where the first element is a list.)
LHTML is a list representation of HTML tags and content.
Each list member may be:
More on the list (# 3) member: if the HTML tag does not have associated attributes, then the first list member will be a keyword package symbol representing the HTML tag, and the other elements will represent the content, which can be a string (text content), a keyword package symbol (HTML tag with no attributes or content), or list (nested HTML tag with associated attributes and/or content). If there are associated attributes, then the first list member will be a list containing a keyword package symbol followed by two list members for each associated attribute; the first member is a keyword package symbol representing the attribute, and the next member is a string corresponding to the attribute value.
If excl:*current-case-mode*
is :CASE-INSENSITIVE-UPPER
,
keyword package symbols will be in upper case; otherwise, they will be in lower case.
HTML comments are represented use a :comment
symbol. For example,
(parse-html "<!-- this is a comment-->") --> ((:comment " this is a comment"))
All <SCRIPT> and <STYLE> content is not parsed; it is returned as text content.
For example,
(parse-html "<SCRIPT>this <B>will not</B> be parsed</SCRIPT>") --> ((:script "this <B>will not</B> be parsed"))
Since, some HTML pages contain special XML/SGML tags, non-comment tags starting with '<!' are treated specially:
(parse-html "<!doctype this is some text>") --> ((:!doctype " this is some text"))
There is plenty of illegal and deprecated HTML on the web that popular browsers nonetheless successfully display. The parse-html parser is generous - it will not raise an error condition upon encountering most input. In particular, it does not maintain a list of legal HTML tags and will successfully parse nonsense input.
For example,
(parse-html "<this> <is> <some> <nonsense> <input>") --> ((:this (:is (:some (:nonsense :input)))))
In some situations, you may prefer a two-pass parse that results in a parse where deep nesting related to unrecognized tags is minimized:
(let ((string "<this> <is> <some> <nonsense> </some> <input>")) (multiple-value-bind (res rogues) (parse-html string :collect-rogue-tags t) (declare (ignorable res)) (parse-html string :no-body-tags rogues))) --> (:this :is (:some (:nonsense)) :input)
See the descriptions of the collect-rogue-tags and no-body-tags keyword arguments to parse-html descriptions in the reference section below for more information.
As per the HTML 4.0 specification, attributes without specified values are given a lower case string value that matches the attribute name.
For example,
(parse-html "<P here ARE some attributes>") --> (((:p :here "here" :are "are" :some "some" :attributes "attributes")))
Existing HTML pages often have character format tags that are interleaved among other tags. Such interleaving is removed in a manner consistent with the HTML 4.0 specification.
For example,
(parse-html "<P>Here is <B>bold text<P>that spans</B>two paragraphs") --> ((:p "Here is " (:b "bold text")) (:p (:b "that spans") "two paragraphs"))
Arguments: input-source &key callbacks callback-only collect-rogue-tags no-body-tags
Returns LHTML output, as described above in this document.
Methods specializing on the required argument are defined for stream
(reads output from the stream) and string
(parses the argument string).
parse-html (p stream) &key callbacks callback-only collect-rogue-tags no-body-tagsparse-html (str string) &key callbacks callback-only collect-rogue-tags no-body-tagsparse-html (file t) &key callbacks callback-only collect-rogue-tags no-body-tags
The t method assumes the argument is a pathname suitable for use with the with-open-file macro.
Arguments: stream read-sequence-func callback-only callbacks collect-rogue-tags no-body-tags
This function may be used when more control is needed for supplying the HTML input. The read-sequence-func argument, if non-nil, should be a function object or a symbol naming a function. When phtml-internal requires another buffer of HTML input, it will invoke the read-sequence-func function with two arguments - the first argument is an internal buffer character array and the second argument is the phtml-internal stream argument. If read-sequence-fun is nil, phtml-internal will invoke read-sequence to fill the buffer. The read-sequence-func function must return the number of character array elements successfully stored in the buffer.
Copyright (c) 2000 by Franz Inc. All rights reserved.