Searchconditions / regular expressions (regexp)


Our searchconditions are not simple texts, but regular expressions (regexp). Regular expressions are strings (single-line texts), which follow the principle, that they consist of a sequence of a denotation, what kind of character(s) can appear, followed by a denotation of how often the character(s) can appear (so called „quantifiers“) Default for a quantifier is 1. What's more: each character stands for the sign it represents (except for some special characters). That means: First: You can do normal search-and-replace as you did before you heard anything about „regexp“. „Abc“ matches „Abcdef“ (Beware! Regular expressions are case-sensitive, which means „Abc“ and „abc“ are different, unless you specify a certain flag [see below!].), and replacing „Abc“ by „XX“ results in „XXdef“ (in the given example „Abcdef“).

Quantifier

The most important quantifiers are „?“, „*“ and „+“. There meaning is the following:

?

once or not at all

*

arbitrarily often: zero or more times

+

arbitrarily often, but minimum one time, i.e. one or more times

Examples:

Searching for „a?bc“ in „bc“ results in hit, because „a“ can, but needs not occur. „a*bc“ matches „bc“, but also „abc“ and „aaabc“. „a+bc“ does not match „bc“, but „abc“ and „aaabc“.


More general and exact is the expression {min, MAX}. „?“ corresponds to {0, 1}, „*“ equates {0,} and „+“ parallels{1,}. {,3} would mean „at most three times“ and {3,} were „at least three times“. And {5} does only match „exactly five times“.


Examples:

„a{1,3}bc“ does not match „bc“, but „abc“and „aaabc“. „a{5}bc“does not match, except when the text contains „aaaaabc“. Searching „a{5}bc“ and replacing it by „YY“ yields „XYYX“ when applied to„XaaaaabcX“.

Greedy versus Non-Greedy (Reluctant) Matching

Normally the search is greedy, which means the longest string is taken, to which the searchcondition applies. By adding a „?“ you can do „reluctant“ matching, which means the string is taken as short as possible, so that it still fits.


Example:

„.*?foo“ replaced by „AB“ applied to „xfooxxxxxxfoo“ yields „ABAB“ and not „AB“.

Character Classes

But really interesting becomes matching not until you use character classes. The simplest case is where you represent them by square brackets, e.g. „[abc]“ or „[0-9]“. „[abc]+“ would match all strings like „bc“, „abc“ or „aaabc“. And „[0-9]+,[0-9]+“ matches any German decimal number, which has at least one position after the decimal comma, e.g. „1,0“.

Character Classes For Advanced Learners

„[^abc]“

means „not a, b oder c“, replacing „[^abc]+“ in „bbbY“ by „X“ would result in „bbbX“.

„[a-zA-Z]“

a through z or A through Z, inclusive (range)

„[a-d[m-p]]“

a through d, or m through p: [a-dm-p] (union)

„[a-z&&[def]]“

d, e, or f (intersection)

„[a-z&&[^bc]]“

a through z, except for b and c: [ad-z] (subtraction)

„[a-z&&[^m-p]]“

a through z, and not m through p: [a-lq-z](subtraction)

Predefined

For single characters or whole character classes there are predefined expressions which in most cases begin with a „\“ (backslash).

Predefined Characters

\\

The backslash

\0n

The character with octal value 0n (0 <= n <= 7)

\0nn

The character with octal value 0nn (0 <= n <= 7)

\0mnn

The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7)

\xhh

The character with hexadecimal value 0xhh

\uhhhh

The character with hexadecimal value 0xhhhh

\t

The tab character ('\u0009')

\n

The newline (line feed) character ('\u000A')

\r

The carriage-return character ('\u000D')

\f

The form-feed character ('\u000C')

\a

The alert (bell) character ('\u0007')

\e

The escape character ('\u001B')

\cx

The control character corresponding to x. „\ca“ means „press CTRL-Key at the same time you press A“ (Example: „\cj“ - or „\011“ - would mean the tab character.)

Predefined Character Classes

.

Any character (may or may not match line terminators, depending on the „Dotall-flag“)

\d

Digit: [0-9] (Beware! Englich decimal number can contain a „.“ (decimal point, dot), the searchstring then would be about „\d+\.?\d*“.)

\D

A non-digit: [^0-9]

\s

„Space“: A whitespace character: [ \t\n\x0B\f\r]

\S

A non-whitespace character: [^\s]

\w

„Word“. A word character: [a-zA-Z_0-9] (Normally variablennames consist of these, German texts can also contain „umlauts“!)

\W

A non-word character: [^\w]

POSIX-Character Classes (only US-ASCII, Beware in e.g. Germany or France, cause of Umlauts, Accents etc.!)

\p{Lower}

A lower-case alphabetic character: [a-z]

\p{Upper}

An upper-case alphabetic character:[A-Z]

\p{ASCII}

All ASCII characters:[\x00-\x7F]

\p{Alpha}

An alphabetic character:[\p{Lower}\p{Upper}]

\p{Digit}

A decimal digit: [0-9]

\p{Alnum}

An alphanumeric character:[\p{Alpha}\p{Digit}]

\p{Punct}

Punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

\p{Graph}

A visible character: [\p{Alnum}\p{Punct}]

\p{Print}

A printable character: [\p{Graph}\x20]

\p{Blank}

A space or a tab: [ \t]

\p{Cntrl}

A control character: [\x00-\x1F\x7F]

\p{XDigit}

A hexadecimal digit: [0-9a-fA-F]

\p{Space}

A whitespace character: [ \t\n\x0B\f\r]

java.lang.Character classes (simple java character type)

\p{javaLowerCase}

Equivalent to java.lang.Character.isLowerCase()

\p{javaUpperCase}

Equivalent to java.lang.Character.isUpperCase()

\p{javaWhitespace}

Equivalent to java.lang.Character.isWhitespace()

\p{javaMirrored}

Equivalent to java.lang.Character.isMirrored()

Boundary matchers

^

The beginning of a line

$

The end of a line

\b

A word boundary (Beware! Matches also the „_“!)

\B

A non-word boundary

\A

The beginning of the input

\G

The end of the previous match

\Z

The end of the input but for the final terminator, if any

\z

The end of the input

Special constructs:

(?:X)

X, a „non-capturing group“. Normally round brackets represent (sub-) expressions, you can refer to via „$1“ through „$9“ („$0“ bezeichnet den refers to the whole expression). Example: We want to replace „abcXXXdef“ by „abcdef“. We could do that e.g. by searching „(abc).{3}(def)“ and replacing it by „$1$2“. „abc“ uand „def“ could as well be something variably.

(?idmsux-idmsux)

Nothing, but turns match flags i d m s u x on - off.

Meaning of the Flags:

i

caseInsensitiv

By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. Unicode-aware case-insensitive matching can be enabled by specifying the unicode-flag in conjunction with this flag. Should be used for languages like German or French.

d

Unix-Lines

In this mode, only the '\n' line terminator is recognized in the behavior of ., ^, and $.

m

multiline

In multiline mode the expressions ^ and $ match just after or just before, respectively, a line terminator or the end of the input sequence. By default these expressions only match at the beginning and the end of the entire input sequence. Don't turn multiline mode on if you are searching for a multiline text! Use “\s+” to match any sequence of whitespace characters including line feed (independent of platform / operating system!)!

s

dotall

In dotall mode, the expression . matches any character, including a line terminator. (The s is a mnemonic for "single-line" mode, which is what this is called in Perl.)

u

unicode

Enables Unicode-aware case folding (should be used e.g. in Germany or France together with flag i) By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched.

x

comments

In this mode, whitespace is ignored, and embedded comments starting with # are ignored until the end of a line.

Important:

„\\“

We need this under Windoze for directory (folder) names. (It is easier to enclose the whole directory name in „\Q“ and „\E“ („quotation“ and „end“) , e.g. „\QC:\mydirectory\...\eclipse_workspace\E“. Cause the enclosed string can probably itself contain „\E“, you should use the menuitem „Input Assistance“ [on 'Find and replace'])

„\s+“

or (still better) „\p{javaWhitespace}+“ bezeichnet einen beliebig langen Leerraum

„\w+“

A variablename (might contain an underscore character, but no German umlauts, e.g.)

„\.“

is a (literal) .

„.*“

is a arbitrary string (might be empty)

„\p{XDigit}{2,4}“

or „[0-9a-fA-F]{2,4}“ designates hexadecimal numbers. Errors concerning hexadecimal numbers you will find e.g. by using „\s+[0-9a-fA-F]{3}[ \t\n\x0B\f\r\.;]+“.


Who wants to know more, should visit the internet, especially you should search for „java patterns regexp“ or look on „http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html“.

If I have written nonsense somewhere in this document (this is likely, cause I myself am still a beginner concerning regexp), please mailto: eumel100@web.de



Counter

Harald K., E-Mail: eumel100@web.de