In this document we quickly present an high-level integration of XML to Objective-Caml, providing XML transducers based on tree view. This gives a homogeneous view of XML within Objective-Caml code, and lets users to manipulate such terms like any other ones, without constraint.

Keywords: Language extension, XML transducers and Embedded XML

Introduction

XML Objective-Caml tools are actually based on two standards. The first one is SAX, which is event based . The second one is tree based and called DOM. Both propose only a low level management dedicated to XML instances.

At the opposite, our approach called OX provides expressive features for XML manipulation. OX is a simple macro language one could see as Objective-Caml extension. This brings an expressive level dedicated for XML manipulation. XML manipulations means :

A decomposition must introduce a simple mechanism for information extraction when the instance is complex. In order to provide such a mechanism we introduce a well known concepts in functional programming languages : the pattern matching.

The composition is the capability of building XML terms easily. Languages PHP and JSP are essentially based on such concept providing embedded language in HTML. In both cases XML is the main language, and so the constructions are intrinsic. Our approach is slightly different as we consider Objective-Caml is the reference language, enriched with XML. Nevertheless, as we will see the methodology is in fact the same but well formed XML term is implied by the syntax in OX in opposition with PHP or JSP where this condition is not statically verified.

EBNF like grammar

An OX program is a Objective-Caml program plus :

     
    expr +:=
    | "XML" "{" xml "}"
    | "XML" "PATTERN" "{" pattern "}"
    | "XML" "match" xml "with" ("|" pattern* -> ocaml)*

    

XML term

The following grammar defines XML term syntax in Objective-Caml.

     
    xml ::=
    | STRING
    | "<" name attribute* ">" xml* "</>"
    | LIDENT
    | "{" expr "}"

    name = 
    | (sname ":")? sname

    sname ::=
    | XMLIDENT
    | "{" expr "}"

    attribute ::=
    | name "=" xml

    

An XML term can be :

Examples :


     let xml_string         = XML{ "Simple XML string" }
     let mkHtml head body   = XML{ <HTML> header body </> }
     let mkHead head        = XML{ <HEAD head </> }
     let mkBody bgc content = XML{ <BODY bgcolor=bgc> body </> }
     let _ =
         mkHtml (mkHead XML{}) 
                (mkBody XML{"#FFFFFF"} XML{<P>"simple html content"</>})

    
This code produces the following XML fragment:
     <HTML><HEAD/><BODY bgcolor='#FFFFFF'><P>simple html content</P></BODY></HTML>

    

Patterns

       
    pattern =
    | "CDATA"                                                        (1)
    | STRING                                                         (2)
    | "<" (spname ":")? spname pattribute* (as LIDENT)? ">" 
             pattern* 
      "</>"                                                          (3)
    | "_"                                                            (4)
    | pattern as LIDENT                                              (5)
    | pattern "|" pattern                                            (6)
    | pattern ("*"|"+"|"?")                                          (7)
    | "(" pattern+ ")"                                               (8)

    spname =
    | XMLIDENT ("|" XMLIDENT)* (as LIDENT)?
    | "(" XMLIDENT ("|" XMLIDENT)* (as LIDENT)? ")"
    | "_"  (as LIDENT)?

    pattribute = 
    | (pname ":")? pname "=" pattern

    pname =
    | XMLIDENT 
    | "_"

    

A pattern can be (1) any string, (2) a given string, (3) a tag, (4) an atom (string or tag), (5) a variable binding, (6) a disjunction, (7) a regular pattern, or (8) a block. Finally a tag name pattern can be a disjunction.

Matching atomic terms

The pattern matching introduces terms decomposition expressed by a set of rules like :

      
  XML match term with
  | <TagName att1=val1 att2='V2'>_* as content</> ->   (1)
  | <_ as name att1=val1 att2='V2'>_* as content</> -> (2)
  | <_ att1=val1 att2='V2'>_* as content</> ->         (3)
  | (* empty *) ->                                     (4)
  | CDATA ->                                           (5)

     
Rules (1) (2) and (3) introduces tag matching with an attribute att2 with a value V2 and maybe an attribute att1 . In the rule (1) the tag name must be TagName , in rules (2) and (3) the tag name is ignored but unified with name in (2) . The rule (4) introduces the empty matching and finally the rule (5) introduces the cdata matching.

In order to provide expressive, reusable and flexible pattern matching we introduce disjunctive patterns.

A disjunction consists of grouping a potential set of terms in only one rule. With this approach the user can group in a natural way a set of tags giving them a same interpretation.

      
  XML match term with
  | <Section|Subsection|Subsubsection as tagname> _* as content </> ->
        ...

     

In this example the rule matches a Section or a Subsection or a Subsubsection . But in order to retrieve the dynamic tag name an aliasing is defined by the variable tagname

Matching sequence

The main difference with conventional tools concerns the sequence itself which is explicit. This let the user managing this construction.

      
    XML match term with
    | head ->                   (1)
    | head tail  ->             (2)
    | head _? as tail ->        (3)
    | head _+ as tail ->        (4)
    | head (_|_*) as tail ->    (5)
    | head _* as tail ->        (6)

     

The decomposition of such structure is done by a left-to-right manner. This pattern also matches atomic structures unifying tail with the empty term if specified i.e. cases (3) and (6). The main problem with this approach concerns the introduction of infinite loop when a pattern matching is partial producing a stack overflow exception during its execution.

Matching sequence using regularity is introduced - like in XDuce - allowing a full management of decomposition. Thus the following code

      
  XML match term with
    _* as pre <table>table</> _* as post -> ...

     
matches sequences containing at least on table tag.

Conclusion and Perspectives

Such approach gives an homogeneous way to directly manipulate XML inside programming languages like Objective-Caml. The matching introduces the capability of information extraction based on pattern while embedded XML introduces proper construction.