This page describes the XML parser in the library. The parser takes its source input as a string, and produces as output a hierarchical tree structure representing the document. There are also formatter classes that take the document structure and produce a string representation of the document as output.
The reason for using strings as input and output rather than files is for two reasons. Firstly, it is more flexible - a file can be turned into a string easily, but not vice-versa (without using a temporary file). Secondly, it allows the parser to use the built-in string scanning functions, which increase parse speed. The downside of using strings is that the document has to be read into memory prior to parsing, so the parser may not be suitable for really huge documents.
The parser is contained in the class xml.XmlParser
, which has a parse
method for parsing a string. It returns a xml.XmlDocument
object, or fails if the input was not well-formed. Here is an example program.
import
io(stop, write),
lang(to_string),
xml(XmlParser)
procedure main()
local p,s,d
p := XmlParser()
s := "<?xml version=\"1.0\" encoding=\"UTF-8\"?><simple></simple>"
d := p.parse(s) | stop("Couldn't parse:", &why)
write(to_string(d, 3))
end
If you have a file you need to parse, just load it into a string first using the io.Files.file_to_string()
method.
XmlDocument
and XmlElement
classesAs noted above, the parser returns an XmlDocument
instance. XmlDocument
has methods to inspect the result of the parsing. Most notably, the method get_root_element()
will return the xml.XmlElement
which is the root of the element structure :-
e := d.get_root_element()
The XmlElement
class has methods which enable you to search the children of the element, and to inspect the element attributes. For example, say that the element e represented the following structure :-
<top a1="val1" a2="val2">
<inner a3="val3">
Some text
</inner>
</top>
Then
e.get_name() # "top"
e.get_attribute("a1") # "val1"
e.get_attribute("a2") # "val2"
e.get_attribute("absent") # fails
f := e.search_children("inner") # sets f to another XmlElement, representing
# inner. If there were several inner elements,
# it would suspend them in sequence.
e.search_children("absent") # fails
f.get_string_content() # " Some text "
f.get_trimmed_string_content() # "Some text"
Please see the API documentation for more details.
The parser fires three types of event during parsing :-
XmlParser.WARNING_EVENT
- a warning messageXmlParser.VALIDITY_ERROR_EVENT
- a message indicating a validity errorXmlParser.FATAL_ERROR_EVENT
- a message indicating a fatal error, ie the string being parsed is not a well-formed xml document. Only one such message is ever fired.A fatal error will mean that the parse
method will fail. A validity error won’t cause the parse
method to fail because a well-formed xml document can still be constructed and used. However, the parser will count the number of validity errors, and this count can be accessed in the XmlDocument
’s validity_errors
field. Warnings can safely be ignored - they just indicate that the source document could be improved in certain respects.
Here is an example that listens for and prints out events from the parser.
import
io(stop, write),
lang(to_string),
xml(XmlParser)
procedure eh(p, s, type)
write(type,":",to_string(p, 3))
end
procedure main()
local p,s,d
p := XmlParser()
s := "<?xml version=\"1.0\" encoding=\"UTF-8\"?>_
<top a1=\"val1\" a2=\"val2\">_
<inner a3=\"val3\">_
Some text _
</inner>_
</top"
p.connect(eh)
d := p.parse(s) | stop("Couldn't parse:", &why)
write("Successfully parsed")
end
The output is (in part) :-
validity error:object xml.ProblemDetail#1(
stack=
list#12[
object xml.Diversion#3(id="input";subject="<?xml version ... </top";pos=63)
]
msg="top has attributes but none were declared"
)
... more validity errors ...
fatal error:object xml.ProblemDetail#4(
stack=
list#106[
object xml.Diversion#7(id="input";subject="<?xml version ... </top";pos=107)
]
msg="'>' expected"
)
Couldn't parse:'>' expected
Note how each message detail is encapsulated in a xml.ProblemDetail
instance.
The parser’s validation process can be turned off if desired, using the set_validate()
method. This will increase parser speed, but may affect the result in terms of whitespace (see below).
During parsing, the parser sometimes needs to resolve external entities. Typically this is when an external DTD needs to be loaded, as in the doctype declaration
<!DOCTYPE web-app
PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.2//EN"
"http://java.sun.com/j2ee/dtds/web-app_2_2.dtd">
In order to resolve this, and obtain the external data, the parser uses xml.Resolver
. A custom resolver class can be used as follows :-
import xml
class MyResolver(Resolver)
public override resolve(base, external_id)
local s, t
s := external_id.get_public_id() # eg -//Sun Microsystems, Inc.//DTD Web Application 2.2//EN
t := external_id.get_system_id() # eg http://java.sun.com/j2ee/dtds/web-app_2_2.dtd
# Do something with t and s, to return a string representing the external entity
end
end
...
p.set_resolver(MyResolver())
The parser uses a default resolver, xml.DefaultResolver
, which should be sufficient for normal purposes. It resolves system ids beginning with file://
and http://
locally and over the network respectively. If the system id doesn’t begin with either of those strings, then it is treated as a local file path.
Output of document structures is done using a set of formatter classes. For XML documents, use the xml.XmlFormatter
class :-
r := RamStream() # A Stream to capture the result
f := XmlFormatter(r) # Create a formatter outputting to the stream
f.format (doc) # doc is the XmlDocument to format
s := r.done() # Get the string from the stream
write(s)
The formatter can use any io.Stream
as its output destination; an opened file would be a typical alternative to the RamStream
used above. The default output destination, if no parameter is given to the formatter’s constructor, is FileStream.stdout
.
Various options can be set on the formatter; for example
f.set_text_trim()
f.set_indent(3)
will output all the text content with whitespace trimmed, and all elements formatted with an indent of 3 spaces.
Namespaces are fully supported as a post-processing step to normal parsing. This is on by default, but can be turned off by using p.set_do_namespaces(&no)
. Assuming namespaces are being processed, then the XmlElement
class has extra methods which can be used to find child elements and attributes based on the global name, which is a pair of a URI and a local name. A global name is represented by a xml.GlobalName
instance, which can be created with something like :-
gn := GlobalName("Local", "http://schemas.xmlsoap.org/soap/envelope/")
This global name can then be used to select the element “Prefix:Local” in the following example :-
<parent xmlns:Prefix="http://schemas.xmlsoap.org/soap/envelope/">
<Prefix:Local attr="123"/>
</parent>
The selection methods for elements and attributes using global names can be found in the XmlElement
class. Please see the API docs for full details.
To test the parser, there is a script dotests.sh
in the distribution directory which runs about 1700 test documents through the parser. The test documents come from various sources, and fall into one of three categories:-
parse()
method to fail. See testnotwf
.testvalid
.testvalid
.There are three or four instances (all from one test suite) where I can’t agree with their definition of what is and isn’t well-formed/invalid. The XML spec can be maddeningly vague in some respects, so I am probably just not interpreting it right.
There are also a very small number of cases where a well formed but invalid document is not reported as invalid. There are no cases where a valid document is reported as invalid, or a well-formed document will not parse.
All these tests which cause the parser problems are commented out with an appropriate commentary in the dotests.sh
file.
This is a small program which can be used to see what the parser does with a particular document. Given an input filename, testxml will parse the document and output various sections showing the formatted version of the document, a complete display of the document’s structure, and the document’s constraints read from the DTD.
Just run “testxml -?” for a list of options.
The HTML parser, xml.HtmlParser
, uses similar document structures for input and output as the XML parser. However, it is much simpler than the XML parser.
Code to create and use the HTML parser follows a similar form to the example for XML shown above :-
import
io(write),
lang(to_string),
xml(HtmlParser)
procedure main()
local p, s, d
p := HtmlParser()
s := "<html lang=\"en\">_
Some text _
<p>_
Some more_
</html>"
d := p.parse(s)
write(to_string(d, 3))
end
The parse
method returns an xml.HtmlDocument
object, which can then be inspected as needed.
In contrast to the XML parser, the parse
method will never fail. In other words, even if something that isn’t remotely like HTML is given as input, it will still try to make sense of it. This is in recognition of the fact that much HTML out on the Web is malformed! A fussy parser would be of little use.
HtmlDocument
and HtmlElement
classesAs noted above, the parser returns an HtmlDocument
instance. This works in a very similar way to the XmlDocument
class used for XML parsing; in fact the two classes share a common base class. The most important method is get_root_element()
, which will return the [xml.
HtmlElement](libref/index.html?xml.
HtmlElement.html) which is the root of the element structure :-
e := d.get_root_element()
HtmlElement
is also related to XmlElement
by way of a common base class, and they have the same methods for inspecting attributes and child elements. So, for example, say that the element e
represented the following structure :-
<html lang="en">
Some html text
<p>
Some more
</html>
Then
e.get_name() # "HTML"
e.get_attribute("LANG") # "en"
f := e.search_children("P") # sets f to another HtmlElement, representing the <p>.
# This has one child, namely the text content
# between the <p> and the </html>
e.search_children("absent") # fails
f.get_string_content() # returns " Some more "
f.get_trimmed_string_content() # returns "Some more"
Note that all element names and attribute names are capitalized during parsing (eg “html”->“HTML”).
The search_tree
method of the Element class is very useful if you want to get at an Element deep within the document. Please see the API documentation for more details.
Output of an HTML document is done with the xml.HtmlFormatter
class, which again shares a common base class with its XML equivalent, XmlFormatter
. For example :-
r := RamStream() # A Stream to capture the result
f := HtmlFormatter(r)
f.format (d)
s := r.done() # Get the string from the stream
write(s)
This is a small program which can be used to see what the parser does with a particular document. Given an input filename, testhtml will parse the document and output the formatted equivalent, and a complete display of the document’s structure.
Contents