Extract Expressions

The extract expression provides a mechanism to locate and extract data from elements in the input XML document stream.

When we define a mapping rule R, we intend it to match an element E in the input stream. An extract expression that is defined inside R, allows R to locate and extract data from E or E’s children.

The extract expression is represented in xfMap by the <extract> element. Its expr attribute holds the value of its expression string:

<extract expr="..."/>

The expression string allows the following to be extracted from the input XML document:

  1. the matched element’s text content – when the expression string is the sole: ‘.’
  2. any of the matched element’s attribute values – when the expression string is of the form:
    ‘@’attributeName
  3. any of the matched element’s descendant text content – when the expression string is of the form:
    ‘./’immediateChild(‘/’descendants)+
  4. any of the matched element’s descendant attribute values – when the expression string is of the form:
    ‘./’immediateChild(‘/’descendants)+’[@’attributeName’]’

Note: Note: ImmediateChild and descendants in c and d above are QNames. Therefore as in the match expressions the prefixes for the QNames if any must be bound in the namespace declarations in the xfMap’s root element, i.e., the <xfMap> element.

An element may contain multiple child elements at the same level with the same name – if this is the case, then the first encountered child element will be the one from which data will be extracted. To extract the value of the second, third, or nth child element with the same name an index may be suffixed to the QName.

Each immediateChild or descendants’ QName in the extract expression may be followed by an index, a positive number, that is enclosed within ‘{‘ and ‘}’. The index indicates, not the position, but the count of that particular element in the context of its parent.

The example below illustrates the usage of the extract expression. Consider the following input XML document fragment:

<pfx:Test xmlns:pfx="my-test-uri">
	<pfx:myElement a1="val1" a2="val2" ... an="valN">
		this is the text context.
	</pfx:myElement>
	<pfx:myOtherElement>
		<pfx:someChild>the child value</pfx:someChild>
	</pfx:myOtherElement>
<pfx:a>first-a</pfx:a>
<pfx:b>first-b</pfx:b>
<pfx:a>second-a</pfx:a>
</pfx:Test>

First, we define a mapping rule R in the xfMap document that matches <myElement> element. R may contain any number of extract expressions, e0,e1,...,en, in its definition. (We’ll ignore how R is defined - for now we only need to know that some elements in R use these extract expressions.)

<xfMap xmlns:pfx:"my-test-uri">
	...
	<!-- call this mapping rule R -->
	<mapping match="pfx:myElement">
		...
			<!-- call this e0 -->
			<extract expr="."/>
		...
			<!-- call this e1 -->
			<extract expr="@a1"/>
		...
			<!-- call this en -->
			<extract expr="@an"/>
		...		
			<!-- call this c0 -->
			<extract expr="./pfx:someChild"/>
        	...
			<!-- call this c1 -->
			<extract expr="./pfx:a{2}"/>
	</mapping>
	...
<xfMap>

The expression string in e0, “.”, refers to the text content of <pfx:myElement>, therefore e0 extracts “this is the text content.

The expressions strings in e1,...,en refer to the values of the attributes a1,...,an, therefore each of the e1,...,en, extract val1,...,valn, respectively.

The expression string in c0, “./pfx:someChild”, refers to the text content of the <pfx:someChild> element, therefore c0 extracts “the child value”.

The expression string in c1, “./pfx:a{2}”, refers to the text content of the second <pfx:a> element, therefore c1 extracts “second-a”.

A default value may be specified for the extract expression when the data pointed to by the expression string is not present in the input XML document stream. This default value is represented in xfMap as the default attribute of the <extract> element.

<extract expr="..." default="some default value"/>

The extract expression may also specify the optional as-xml, preserve-cdata, escape-charactersand declare-namespaces attributes. This is done as in the following:

<extract expr="..."
         as-xml="[true|false]"
         escape-characters="[true|false]"
         declare-namespaces="[true|false]" 
         write-xml-header="[true|false]"/>

When the as-xml attribute is set to true, the target of the extract expression will be extracted as an XML fragment with the target as the root. By default, when retrieving this XML fragment, preserve-cdata is set to true.

When the preserve-cdata attribute is set to true, the extract expression will be handled typically, with the exception that CDATA entities will not be ignored. That is, the opening and closing CDATA tags will be treated as text.

The escape-characters attribute defaults to false, when set to true the reader will escape characters that coincide with the XML markup, e.g., “<” and “&” are escaped to “&lt;” and “&amp;”, respectively. Note that it is not necessary to escape these characters when the data is used outside the context of an XML document.

The declare-namespaces attribute defaults to false and it is only applicable when as-xml is set to true, i.e., when the extract expression is being used to mapped an XML subtree from the source document into an XML fragment. Because of XML Namespace scoping the XML fragments mapped from the source document may not have all their prefixes bound, setting this attribute to true instructs the extract expression to add any missing namespace declarations in the resulting XML fragments. These XML namespace valid fragments may then be further consumed by alternate XML processes, e.g., XSLT and XQuery processors.

The write-xml-header attribute defaults to false and it is only applicable when as-xml is set to true, i.e., when the extract expression is being used to mapped an XML subtree from the source document into an XML fragment.