You are here: FME Workbench > Feature Types and Attributes > Regular Expressions > Bracket Expressions

Bracket Expressions

A bracket expression is a list of characters enclosed in square brackets [ ]. It normally matches any single character from the list (however, see below). If the list begins with ^, it matches any single character (but see below) not from the rest of the list.

If two characters in the list are separated by -, this is shorthand for the full range of characters between those two (inclusive) in the collating sequence, e.g. [0-9] in ASCII matches any decimal digit. Two ranges may not share an endpoint, so, for example, a-c-e is illegal. Ranges are very collating-sequence-dependent, and portable programs should avoid relying on them.

To include a literal ] or - in the list, the simplest method is to enclose it in [. and .] to make it a collating element (see below). Alternatively, make it the first character (following a possible ^), or (AREs only) precede it with \. Alternatively, for -, make it the last character, or the second endpoint of a range. To use a literal - as the first endpoint of a range, make it a collating element or (AREs only) precede it with \. With the exception of these, some combinations using [ (see next paragraphs), and escapes, all other special characters lose their special significance within a bracket expression.

Within a bracket expression, a collating element (a character, a multi-character sequence that collates as if it were a single character, or a collating-sequence name for either) enclosed in [. and .] stands for the sequence of characters of that collating element. The sequence is a single element of the bracket expression's list. A bracket expression in a locale that has multi-character collating elements can thus match more than one character. | So (insidiously), a bracket expression that starts with ^ | can match multi-character collating elements even if none | of them appear in the bracket expression! (Note: Tcl currently has no multi-character collating elements. This information is only for illustration.)

For example, assume the collating sequence includes a ch | multi-character collating element. Then the RE [[.ch.]]*c | (zero or more ch's followed by c) matches the first five | characters of `chchcc'. Also, the RE [^c]b matches all of | `chb' (because [^c] matches the multi-character ch).

Within a bracket expression, a collating element enclosed in [= and =] is an equivalence class, standing for the sequences of characters of all collating elements equivalent to that one, including itself. (If there are no other equivalent collating elements, the treatment is as if the enclosing delimiters were `[.' and `.]'.) For example, if o and ^ are the members of an equivalence class, then `[[=o=]]', `[[=^=]]', and `[o^]' are all synonymous. An equivalence class may not be an endpoint of a range. (Note: Tcl currently implements only the Unicode | locale. It doesn't define any equivalence classes. The | examples above are just illustrations.)

Within a bracket expression, the name of a character class enclosed in [: and :] stands for the list of all characters (not all collating elements!) belonging to that class. Standard character classes are:

alpha A letter – uppercase letter

lower A – lowercase letter

digit – decimal digit

xdigit – hexadecimal digit

alnum – alphanumeric (letter or digit)

print – An alphanumeric (same as alnum)

blank – space or tab character

space – character producing white space in displayed text

punct – punctuation character

graph – character with a visible representation

cntrl – control character

A locale may provide others. (Note that the current Tcl implementation has only one locale: the Unicode locale.) A character class may not be used as an endpoint of a range.

There are two special cases of bracket expressions: the bracket expressions [[:<:]] and [[:>:]] are constraints, matching empty strings at the beginning and end of a word respectively. A word is defined as a sequence of word characters that is neither preceded nor followed by word characters. A word character is an alnum character or an underscore (_). These special bracket expressions are deprecated; users of AREs should use constraint escapes instead (see Escapes).