Overview Tutorial

In this document we quickly present an high-level integration of XML to Java, providing XML transducers based on tree view. This gives a homogeneous view of XML within Java code, and lets users to manipulate such terms like any other ones, without constraint.

Keywords: Language extension and XML transducers

Introduction

XML Java tools are actually based on two standards. The first one is SAX, which is event based . The second one is tree based and called DOM. Both propose only a low level management dedicated to XML instances.

At the opposite, our approach called JEM provides expressive features for XML manipulation. JEM is a simple macro language one could see as Java extension. This brings an expressive level dedicated for XML manipulation based on a decomposition mechanism.

This decomposition provides a simple mechanism for information extraction when the instance is complex. In order to provide such a mechanism we introduce a well known concepts in functional programming languages : the pattern matching.

EBNF like grammar

An JEM program is a Java program plus dedicated XML pattern matching blocks.

Statements

     
    JavaStatement +:=
        match ( expression ) { 
           (case pattern { JavaStatement })*
           (default { JavaStatement })?
        }
    

Patterns


    pattern =
        spattern (*|+|?)? (as IDENT)? (|? pattern)?
       
    spattern =
        EMPTY                                                        (1)
        ANY                                                          (2)
        CDATA                                                        (3)
        REGEXP? STRING                                               (4)
        < pattname pattribute* IDENT? [ pattern* ]>                  (5)
        ( pattern )                                                  (6)

    pattname =
        XMLIDENT (| XMLIDENT)* (as IDENT)?
        _ (as IDENT)?

    pattribute = 
        XMLIDENT = (REGEXP? STRING | CDATA) (as IDENT)?

    

A pattern can be (1) an empty term,, (2) any element (string or tag), (3) a string, (4) a regular or plain string, (5) a tag, (6) a disjunction or (8) a block. Finally a tag name pattern can be a disjunction.

Matching atomic terms

The pattern matching introduces terms decomposition expressed by a set of rules like :

      
  match (expression) {
     case <TagName att1=val1 att2='V2' [ANY* as content]>  { (1) }
     case <_ as name att1=val1 att2='V2'[ANY* as content]> { (2) }
     case <_ att1=val1 att2='V2'[ANY* as content]>         { (3) }
     case EMPTY                                            { (4) }
     case CDATA                                            { (5) }
  }
     
Rules (1) (2) and (3) introduces tag matching with an attribute att2 with a value V2 and maybe an attribute att1 . In the rule (1) the tag name must be TagName , in rules (2) and (3) the tag name is ignored but unified with name in (2) . The rule (4) introduces the empty matching and finally the rule (5) introduces the string matching.

In order to provide expressive, reusable and flexible pattern matching we introduce disjunctive patterns.

A disjunction consists of grouping a potential set of terms in only one rule. With this approach the user can group in a natural way a set of tags giving them a same interpretation.

      
  match (expression) {
     case <Section|Subsection|Subsubsection as tagname [ANY*]> { ... }
  }
     

In this example the rule matches a Section or a Subsection or a Subsubsection . But in order to retrieve the dynamic tag name an aliasing is defined by the variable tagname

Matching sequence

The main difference with conventional tools concerns the sequence itself which is explicit. This let the user to manage this construction. Then extracting information depending on sibling fragment becomes possible and provides expressive layer for information analysis and structural decomposition.

      
    match (expression) {
       case head                    { (1) }
       case head tail               { (2) }
       case head ANY? as tail       { (3) }
       case head ANY+ as tail       { (4) }
       case head (ANY|ANY*) as tail { (5) }
       case head ANY* as tail       { (6) }
       default                      { (7) }
    }
     

The decomposition of such structure is done by a left-to-right manner. This pattern also matches atomic structures unifying tail with the empty term if specified i.e. cases (3), (5) and (6). Finally default case was captured and managed by the default defined by the case (7). Like switch statement if there is no default nothing is done in such case.

Matching sequence using regularity is introduced - like in XDuce - allowing a full management of decomposition. Thus the following code

      
    match (expression) {
       case ANY* as pre <table[ANY*]> ANY* as post { ... }
    }
     
matches sequences containing at least on table tag.

Conclusion and Perspectives

Such approach gives an homogeneous way to directly manipulate XML inside programming languages like Java. The matching introduces the capability of information extraction based on pattern while embedded XML introduces proper construction.