You are here: FME Workbench > Feature Types and Attributes > Regular Expressions

Regular Expressions

A regular expression describes strings of characters. It's a pattern that matches certain strings and doesn't match others.

Different Types of Regular Expression

There are two types of regular expression (RE), as defined by POSIX:

This implementation adds a third type – advanced REs (AREs) – which are basically EREs with some significant extensions.

This topic primarily describes AREs. BREs mostly exist for backward compatibility in some old programs; they will be discussed at the end. POSIX EREs are almost an exact subset of AREs. Features of AREs that are not present in EREs will be indicated.

Tcl regular expressions are implemented using the package written by Henry Spencer, based on the 1003.2 spec and some (not quite all) of the Perl5 extensions. Much of the description of regular expressions is copied verbatim from his manual entry.

Regular Expression Syntax

An ARE is one or more branches, separated by |, matching anything that matches any of the branches.

A branch is zero or more constraints or quantified atoms, concatenated. It matches a match for the first, followed by a match for the second, etc; an empty branch matches the empty string.

A quantified atom is an atom possibly followed by a single quantifier. Without a quantifier, it matches a match for the atom. The quantifiers, and what a so-quantified atom matches, are:

*    

a sequence of 0 or more matches of the atom

+

a sequence of 1 or more matches of the atom

?

a sequence of 0 or 1 matches of the atom

{m}

a sequence of exactly m matches of the atom

{m,} 

a sequence of m or more matches of the atom

{m,n} 

a sequence of m through n (inclusive) matches of the atom; m may not exceed n

*? +? ?? {m}? {m,}? {m,n}?

non-greedy quantifiers, which match the same possibilities, but prefer the smallest number rather than the largest number of matches (see Matching)

The forms using { and } are known as bounds. The numbers m and n are unsigned decimal integers with permissible values from 0 to 255 inclusive.

An atom is one of:

(re)   (where re is any regular expression)   

matches a match for re, with the match noted for possible reporting

(?:re)  

as previous, but does no reporting (a ``non-captur­ing'' set of parentheses)

() 

matches an empty string, noted for possible report­ing

(?:)  

matches an empty string, without reporting

[chars]   

a bracket expression, matching any one of the chars (see Bracket Expressions for more detail)

.   

matches any single character

\k   (where k is a non-alphanumeric character)

matches that character taken as an ordinary character, e.g. \\ matches a backslash character

\c   where c is alphanumeric (possibly followed by other characters)

an escape (AREs only), see Escapes

{   

when followed by a character other than a digit, matches the left-brace character `{'; when followed by a digit, it is the beginning of a bound (see above)

x   

where x is a single character with no other signif­icance, matches that character.

A constraint matches an empty string when specific conditions are met. A constraint may not be followed by a quantifier. The simple constraints are as follows; some more constraints are described in Escapes.

^   

matches at the beginning of a line

$   

matches at the end of a line

(?=re)   

positive lookahead (AREs only), matches at any point where a substring matching re begins

(?!re)   

negative lookahead (AREs only), matches at any point where no substring matching re begins

Notes: