In this document we quickly present an high-level integration of XML to Objective-Caml, providing XML transducers based on tree view. This gives a homogeneous view of XML within Objective-Caml code, and lets users to manipulate such terms like any other ones, without constraint.
Keywords: Language extension, XML transducers and Embedded XML
XML Objective-Caml tools are actually based on two standards. The first one is SAX, which is event based . The second one is tree based and called DOM. Both propose only a low level management dedicated to XML instances.
At the opposite, our approach called OX provides expressive features for XML manipulation. OX is a simple macro language one could see as Objective-Caml extension. This brings an expressive level dedicated for XML manipulation. XML manipulations means :
A decomposition must introduce a simple mechanism for information extraction when the instance is complex. In order to provide such a mechanism we introduce a well known concepts in functional programming languages : the pattern matching.
The composition is the capability of building XML terms easily. Languages PHP and JSP are essentially based on such concept providing embedded language in HTML. In both cases XML is the main language, and so the constructions are intrinsic. Our approach is slightly different as we consider Objective-Caml is the reference language, enriched with XML. Nevertheless, as we will see the methodology is in fact the same but well formed XML term is implied by the syntax in OX in opposition with PHP or JSP where this condition is not statically verified.
An OX program is a Objective-Caml program plus :
expr +:= | "XML" "{" xml "}" | "XML" "PATTERN" "{" pattern "}" | "XML" "match" xml "with" ("|" pattern* -> ocaml)*
The following grammar defines XML term syntax in Objective-Caml.
xml ::= | STRING | "<" name attribute* ">" xml* "</>" | LIDENT | "{" expr "}" name = | (sname ":")? sname sname ::= | XMLIDENT | "{" expr "}" attribute ::= | name "=" xml
An XML term can be :
Examples :
let xml_string = XML{ "Simple XML string" } let mkHtml head body = XML{ <HTML> header body </> } let mkHead head = XML{ <HEAD head </> } let mkBody bgc content = XML{ <BODY bgcolor=bgc> body </> } let _ = mkHtml (mkHead XML{}) (mkBody XML{"#FFFFFF"} XML{<P>"simple html content"</>})This code produces the following XML fragment:
<HTML><HEAD/><BODY bgcolor='#FFFFFF'><P>simple html content</P></BODY></HTML>
pattern = | "CDATA" (1) | STRING (2) | "<" (spname ":")? spname pattribute* (as LIDENT)? ">" pattern* "</>" (3) | "_" (4) | pattern as LIDENT (5) | pattern "|" pattern (6) | pattern ("*"|"+"|"?") (7) | "(" pattern+ ")" (8) spname = | XMLIDENT ("|" XMLIDENT)* (as LIDENT)? | "(" XMLIDENT ("|" XMLIDENT)* (as LIDENT)? ")" | "_" (as LIDENT)? pattribute = | (pname ":")? pname "=" pattern pname = | XMLIDENT | "_"
A pattern can be (1) any string, (2) a given string, (3) a tag, (4) an atom (string or tag), (5) a variable binding, (6) a disjunction, (7) a regular pattern, or (8) a block. Finally a tag name pattern can be a disjunction.
The pattern matching introduces terms decomposition expressed by a set of rules like :
XML match term with | <TagName att1=val1 att2='V2'>_* as content</> -> (1) | <_ as name att1=val1 att2='V2'>_* as content</> -> (2) | <_ att1=val1 att2='V2'>_* as content</> -> (3) | (* empty *) -> (4) | CDATA -> (5)Rules
(1)
(2)
and
(3)
introduces tag matching with an
attribute
att2
with a value
V2
and maybe an attribute
att1
. In the rule
(1)
the tag name must be
TagName
, in rules
(2)
and
(3)
the
tag name is ignored but unified with name in
(2)
. The rule
(4)
introduces the empty matching and finally the rule
(5)
introduces the cdata matching.
In order to provide expressive, reusable and flexible pattern matching we introduce disjunctive patterns.
A disjunction consists of grouping a potential set of terms in only one rule. With this approach the user can group in a natural way a set of tags giving them a same interpretation.
XML match term with | <Section|Subsection|Subsubsection as tagname> _* as content </> -> ...
In this example the rule matches a
Section
or a
Subsection
or a
Subsubsection
. But in order to
retrieve the dynamic tag name an aliasing is defined by the
variable
tagname
The main difference with conventional tools concerns the sequence itself which is explicit. This let the user managing this construction.
XML match term with | head -> (1) | head tail -> (2) | head _? as tail -> (3) | head _+ as tail -> (4) | head (_|_*) as tail -> (5) | head _* as tail -> (6)
The decomposition of such structure is done by a
left-to-right manner. This pattern also matches atomic structures
unifying
tail
with the empty term if
specified i.e. cases (3) and (6). The main problem with this
approach concerns the introduction of infinite loop when a pattern
matching is partial producing a stack overflow exception during its
execution.
Matching sequence using regularity is introduced - like in XDuce - allowing a full management of decomposition. Thus the following code
XML match term with _* as pre <table>table</> _* as post -> ...matches sequences containing at least on
table
tag.
Such approach gives an homogeneous way to directly manipulate XML inside programming languages like Objective-Caml. The matching introduces the capability of information extraction based on pattern while embedded XML introduces proper construction.