A Lisp Based XML Parser

This is a preliminary document which will be updated over time. Upates will be available for downloading.

Introduction and a simple example
LXML parse output format
parse-xml non-validating parser properties
Usage notes
   parse-xml requires Modern Lisp's mixed case and international character support
   parse-xml and packages
   parse-xml, the XML Namespace specification, and packages
   The parse-xml function has been tested
      ACL does not support Unicode 4 byte scalar values
   debugging aids
XML Conformance test results
Compiling and Loading the parser
parse-xml reference
*debug-xml*
*debug-dtd*

Introduction and a simple example

The parse-xml generic function processes XML input, returning a list of XML tags, attributes, and text. Here is a simple example:

(parse-xml "<item1><item2 att1='one'/>this is some text</item1>")

-->((item1 ((item2 att1 "one")) "this is some text"))

The output format is known as LXML (Lisp XML) format.

LXML Format

LXML is a list representation of XML tags and content.

Each list member may be:

  1. a string containing text content, such as "Here is some text with a "
  2. a list representing a XML tag with associated attributes and/or content,
    such as (item1 "text") or ((item1 :att1 "help.html") "link").
  3. XML comments and or processing instructions - see the more detailed example below for
    further information.

More on members of type # 2: if the XML tag does not have associated attributes, then the first list member will be a symbol representing the XML tag, and the other elements will represent the content, which can be a string (text content), a symbol (XML tag with no attributes or content), or list (nested XML tag with associated attributes and/or content). If there are associated attributes, then the first list member will be a list containing a symbol followed by two list members for each associated attribute; the first member is a symbol representing the attribute, and the next member is a string corresponding to the attribute value.

Non Validating Parser Properties

parse-xml is a non-validating XML parser. It will detect non-well-formed XML input. When processing valid XML input, parse-xml will optionally produce the same output as a validating parser would, including the processing of an external DTD subset and external entity declarations.

By default, parse-xml outputs a DTD parse along with the parsed XML contents. The DTD parse may be optionally suppressed. The following example shows DTD parsed output components:

(defvar *xml-example-external-url*
   "<!ENTITY ext1 'this is some external entity %param1;'>")

(defun example-callback (var-name token &optional public)
  (declare (ignorable token public))
  (setf var-name (uri-path var-name))
  (if* (equal var-name "null") then nil
    else
      (let ((string (eval (intern var-name (find-package :user)))))
      (make-string-input-stream string))))

(defvar *xml-example-string*
"<?xml version='1.0' encoding='utf-8'?>
<!-- the following XML input is well-formed but its validity has not been checked ... -->
<?piexample this is an example processing instruction tag ?>
<!DOCTYPE example SYSTEM '*xml-example-external-url*' [
   <!ELEMENT item1 (item2* | (item3+ , item4))>
   <!ELEMENT item2 ANY>
   <!ELEMENT item3 (#PCDATA)>
   <!ELEMENT item4 (#PCDATA)>
   <!ATTLIST item1
      att1 CDATA #FIXED 'att1-default'
      att2 ID #REQUIRED
      att3 ( one | two | three ) 'one'
      att4 NOTATION ( four | five ) 'four' >
   <!ENTITY % param1 'text'>
   <!ENTITY nentity SYSTEM 'null' NDATA somedata>
   <!NOTATION notation SYSTEM 'notation-processor'>
   ]>
<item1 att2='1'><item3>&ext1;</item3></item1>")

(pprint (parse-xml *xml-example-string* :external-callback 'example-callback))

-->

((:xml :version "1.0" :encoding "utf-8")
  (:comment " the following XML input is well-formed but may or may not be valid ")
  (:pi :piexample "this is an example processing instruction tag ")
  (:DOCTYPE :example
    (:[ (:ELEMENT :item1 (:choice (:* :item2) (:seq (:+ :item3) :item4))) 
        (:ELEMENT :item2 :ANY)
        (:ELEMENT :item3 :PCDATA) (:ELEMENT :item4 :PCDATA)
        (:ATTLIST item1 (att1 :CDATA :FIXED "att1-default") (att2 :ID :REQUIRED)
             (att3 (:enumeration :one :two :three) "one") 
             (att4 (:NOTATION :four :five) "four"))
        (:ENTITY :param1 :param "text") 
        (:ENTITY :nentity :SYSTEM "null" :NDATA :somedata)
        (:NOTATION :notation :SYSTEM "notation-processor"))
    (:external (:ENTITY :ext1 "this is some external entity text")))
   ((item1 att1 "att1-default" att2 "1" att3 "one" att4 "four") 
       (item3 "this is some external entity text")))

Usage Notes

There are :

   1. parse-xml requires Modern Lisp's mixed case and international character support
   2. parse-xml and packages
   3. parse-xml, the XML Namespace specification, and packages
   4. The parse-xml function has been tested
         ACL does not support Unicode 4 byte scalar values
   5. debugging aids

  1. The parse-xml function has been compiled and tested only in a modern (case-sensitive) ACL. Its successful operation depends on both the mixed case support and wide character support found in modern ACL.
  2. The parser uses the keyword package for DTD tokens and other special XML tokens. Since element and attribute token symbols are usually interned in the current package, it is not recommended to execute parse-xml when the current package is the keyword package.
  3. The XML parser supports the XML Namespaces specification. The parser recognizes a "xmlns" attribute and attribute names starting with "xmlns:". As per the specification, the parser expects that the associated value is an URI string. The parser then associates XML Namespace prefixes with a Lisp package provided via the parse-xml :uri-to-package option or, if necessary, a package created on the fly. The following example demonstrates this behavior:
(setf *xml-example-string4*
   "<bibliography
         xmlns:bib='http://www.bibliography.org/XML/bib.ns'
         xmlns='urn:royal-mail.gov.uk/XML/ns/postal.ns,1999'>
      <bib:book owner='Smith'>
         <bib:title>A Tale of Two Cities</bib:title>
         <bib:bibliography
              xmlns:bib='http://www.franz.com/XML/bib.ns'
              xmlns='urn:royal-mail2.gov.uk/XML/ns/postal.ns,1999'>
            <bib:library branch='Main'>UK Library</bib:library>
            <bib:date calendar='Julian'>1999</bib:date>
            </bib:bibliography>
         <bib:date calendar='Julian'>1999</bib:date>
         </bib:book>
    </bibliography>")

(setf *uri-to-package* nil)
(setf *uri-to-package*
    (acons (parse-uri "http://www.bibliography.org/XML/bib.ns")
        (make-package "bib") *uri-to-package*))
(setf *uri-to-package*
    (acons (parse-uri "urn:royal-mail.gov.uk/XML/ns/postal.ns,1999")
        (make-package "royal") *uri-to-package*))
(setf *uri-to-package*
    (acons (parse-uri "http://www.franz.com/XML/bib.ns")
        (make-package "franz-ns") *uri-to-package*))
(pprint (multiple-value-list 
(parse-xml *xml-example-string4*
   :uri-to-package *uri-to-package*)))

-->

((((bibliography |xmlns:bib| "http://www.bibliography.org/XML/bib.ns" xmlns
      "urn:royal-mail.gov.uk/XML/ns/postal.ns,1999")
      "
      "
     ((bib::book royal::owner "Smith") "
       " (bib::title "A Tale of Two Cities") "
       "
        ((bib::bibliography royal::|xmlns:bib| "http://www.franz.com/XML/bib.ns" royal::xmlns
            "urn:royal-mail2.gov.uk/XML/ns/postal.ns,1999")
          "
          " ((franz-ns::library net.xml.namespace.0::branch "Main") "UK Library") "
          " ((franz-ns::date net.xml.namespace.0::calendar "Julian") "1999") "
          ")
        "
        " ((bib::date royal::calendar "Julian") "1999") "
        ")
    "
    "))
((#<uri urn:royal-mail2.gov.ukXML/ns/postal.ns,1999> . #<The net.xml.namespace.0 package>)
(#<uri http://www.franz.com/XML/bib.ns> . #<The franz-ns package>)
(#<uri urn:royal-mail.gov.ukXML/ns/postal.ns,1999> . #<The royal package>)
(#<uri http://www.bibliography.org/XML/bib.ns> . #<The bib package>)))

In the absence of XML Namespace attributes, element and attribute symbols are interned in the current package. Note that this implies that attributes and elements referenced in DTD content will be interned in the current package.

  1. The parse-xml function has been tested using the OASIS conformance test suite (see details below). The test suite has wide coverage across possible XML and DTD syntax, but there may be some syntax paths that have not yet been tested or completely supported. Here is a list of currently known syntax parsing issues:
  1. ACL does not support 4 byte Unicode scalar values, so input containing such data will not be processed correctly. (Note, however, that parse-xml does correctly detect and process wide Unicode input.)
  2. An initial <?xml declaration in external entity files is skipped without a check being made to see if the <?xml declaration is itself incorrect.
  1. When investigating possible parser errors or examining more closely where the parser determined that the input was non-well-formed, the net.xml.parser internal symbols *debug-xml* and *debug-dtd* are useful. When not bound to nil, these variables cause lexical analysis and intermediate parsing results to be output to *standard-output*.

XML Conformance Test Suite

Using the OASIS test suite (http://www.oasis-open.org), here are the current parse-xml results:

Compiling and Loading

Load build.cl into a modern ACL session will result in a pxml.fasl file that can subsequently be loaded in a modern ACL to provide XML parsing functionality.

parse-xml reference

parse-xml            [Generic function]

Arguments: input-source &key external-callback content-only general-entities parameter-entities uri-to-package

This generic function returns multiple values:

  1. LXML and parsed DTD output, as described above in this document.
  2. An association list containing the uri-to-package argument conses (if any) and conses associated with any XML Namespace packages created during the parse (see uri-to-package argument description, below).

The arguments and their effects are:

(defun file-callback (uri-object token &optional public)
;; the uri-object is an ACL URI object created from
;; the XML input. In this example, this function
;; assumes that all uri's will be file specifications.
;;
;; the token argument identifies what token is associated
;; with the external parse (for example :DOCTYPE for external
;; DTD subset
;;
;; the public argument contains the associated PUBLIC string,
;; when present
;;
(declare (ignorable token public))
;; an open stream is returned on success
;; a nil return value indicates that the external
;; parse should not occur
;; Note that parse-xml will close the open stream before
;; exiting
(ignore-errors (open (uri-path uri-object))))

parse-xml Methods

(parse-xml (p stream) &key external-callback content-only 
            general-entities parameter-entities
            uri-to-package)

(parse-xml (str string) &key external-callback content-only 
            general-entities parameter-entities
            uri-to-package)

An easy way to parse a file containing XML input:

(with-open-file (p "example.xml")
    (parse-xml p :content-only p))

net.xml.parser unexported special variables:

*debug-xml*

When not bound to nil, generates XML lexical state and intermediary parse result debugging output.

*debug-dtd*

When not bound to nil, generates DTD lexical state and intermediaryparse result debugging output.

Copyright (c) 2000 by Franz Inc. All rights reserved.