Our searchconditions are not simple texts, but regular
expressions (regexp). Regular expressions are
strings (single-line texts), which follow the principle, that they
consist of a sequence of a denotation, what kind
of character(s) can appear, followed by a denotation of how
often the character(s) can appear (so called
„quantifiers“) Default for a
quantifier is 1. What's more: each character stands for the sign it
represents (except for some special characters). That means: First:
You can do normal search-and-replace as you did before you heard
anything about „regexp“. „Abc“ matches
„Abcdef“ (Beware! Regular expressions are case-sensitive,
which means „Abc“ and „abc“ are different,
unless you specify a certain flag [see below!].), and replacing „Abc“
by „XX“ results in „XXdef“ (in the given
example „Abcdef“).
The most important quantifiers are „?“, „*“ and „+“. There meaning is the following:
? |
once or not at all |
* |
arbitrarily often: zero or more times |
+ |
arbitrarily often, but minimum one time, i.e. one or more times |
Searching for „a?bc“ in „bc“ results in
hit, because „a“ can, but needs not occur. „a*bc“
matches „bc“, but also „abc“ and „aaabc“.
„a+bc“ does not match „bc“,
but „abc“ and „aaabc“.
More general and exact is the expression {min, MAX}. „?“ corresponds to {0, 1}, „*“ equates {0,} and „+“ parallels{1,}. {,3} would mean „at most three times“ and {3,} were „at least three times“. And {5} does only match „exactly five times“.
„a{1,3}bc“ does not match „bc“, but „abc“and „aaabc“. „a{5}bc“does not match, except when the text contains „aaaaabc“. Searching „a{5}bc“ and replacing it by „YY“ yields „XYYX“ when applied to„XaaaaabcX“.
Normally the search is greedy, which means the longest string is taken, to which the searchcondition applies. By adding a „?“ you can do „reluctant“ matching, which means the string is taken as short as possible, so that it still fits.
„.*?foo“ replaced by „AB“ applied to „xfooxxxxxxfoo“ yields „ABAB“ and not „AB“.
But really interesting becomes matching not until you use character classes. The simplest case is where you represent them by square brackets, e.g. „[abc]“ or „[0-9]“. „[abc]+“ would match all strings like „bc“, „abc“ or „aaabc“. And „[0-9]+,[0-9]+“ matches any German decimal number, which has at least one position after the decimal comma, e.g. „1,0“.
„[^abc]“ |
means „not a, b oder c“, replacing „[^abc]+“ in „bbbY“ by „X“ would result in „bbbX“. |
„[a-zA-Z]“ |
a through z or A through Z, inclusive (range) |
„[a-d[m-p]]“ |
a through d, or m through p: [a-dm-p] (union) |
„[a-z&&[def]]“ |
d, e, or f (intersection) |
„[a-z&&[^bc]]“ |
a through z, except for b and c: [ad-z] (subtraction) |
„[a-z&&[^m-p]]“ |
a through z, and not m through p: [a-lq-z](subtraction) |
For single characters or whole character classes there are predefined expressions which in most cases begin with a „\“ (backslash).
\\ |
The backslash |
\0n |
The character with octal value 0n (0 <= n <= 7) |
\0nn |
The character with octal value 0nn (0 <= n <= 7) |
\0mnn |
The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7) |
\xhh |
The character with hexadecimal value 0xhh |
\uhhhh |
The character with hexadecimal value 0xhhhh |
\t |
The tab character ('\u0009') |
\n |
The newline (line feed) character ('\u000A') |
\r |
The carriage-return character ('\u000D') |
\f |
The form-feed character ('\u000C') |
\a |
The alert (bell) character ('\u0007') |
\e |
The escape character ('\u001B') |
\cx |
The control character corresponding to x. „\ca“ means „press CTRL-Key at the same time you press A“ (Example: „\cj“ - or „\011“ - would mean the tab character.) |
. |
Any character (may or may not match line terminators, depending on the „Dotall-flag“) |
\d |
Digit: [0-9] (Beware! Englich decimal number can contain a „.“ (decimal point, dot), the searchstring then would be about „\d+\.?\d*“.) |
\D |
A non-digit: [^0-9] |
\s |
„Space“: A whitespace character: [ \t\n\x0B\f\r] |
\S |
A non-whitespace character: [^\s] |
\w |
„Word“. A word character: [a-zA-Z_0-9] (Normally variablennames consist of these, German texts can also contain „umlauts“!) |
\W |
A non-word character: [^\w] |
\p{Lower} |
A lower-case alphabetic character: [a-z] |
\p{Upper} |
An upper-case alphabetic character:[A-Z] |
\p{ASCII} |
All ASCII characters:[\x00-\x7F] |
\p{Alpha} |
An alphabetic character:[\p{Lower}\p{Upper}] |
\p{Digit} |
A decimal digit: [0-9] |
\p{Alnum} |
An alphanumeric character:[\p{Alpha}\p{Digit}] |
\p{Punct} |
Punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ |
\p{Graph} |
A visible character: [\p{Alnum}\p{Punct}] |
\p{Print} |
A printable character: [\p{Graph}\x20] |
\p{Blank} |
A space or a tab: [ \t] |
\p{Cntrl} |
A control character: [\x00-\x1F\x7F] |
\p{XDigit} |
A hexadecimal digit: [0-9a-fA-F] |
\p{Space} |
A whitespace character: [ \t\n\x0B\f\r] |
\p{javaLowerCase} |
Equivalent to java.lang.Character.isLowerCase() |
\p{javaUpperCase} |
Equivalent to java.lang.Character.isUpperCase() |
\p{javaWhitespace} |
Equivalent to java.lang.Character.isWhitespace() |
\p{javaMirrored} |
Equivalent to java.lang.Character.isMirrored() |
^ |
The beginning of a line |
$ |
The end of a line |
\b |
A word boundary (Beware! Matches also the „_“!) |
\B |
A non-word boundary |
\A |
The beginning of the input |
\G |
The end of the previous match |
\Z |
The end of the input but for the final terminator, if any |
\z |
The end of the input |
(?:X) |
X, a „non-capturing group“. Normally round brackets represent (sub-) expressions, you can refer to via „$1“ through „$9“ („$0“ bezeichnet den refers to the whole expression). Example: We want to replace „abcXXXdef“ by „abcdef“. We could do that e.g. by searching „(abc).{3}(def)“ and replacing it by „$1$2“. „abc“ uand „def“ could as well be something variably. |
(?idmsux-idmsux) |
Nothing, but turns match flags i d m s u x on - off. |
i |
caseInsensitiv |
By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. Unicode-aware case-insensitive matching can be enabled by specifying the unicode-flag in conjunction with this flag. Should be used for languages like German or French. |
d |
Unix-Lines |
In this mode, only the '\n' line terminator is recognized in the behavior of ., ^, and $. |
m |
multiline |
In multiline mode the expressions ^ and $ match just after or just before, respectively, a line terminator or the end of the input sequence. By default these expressions only match at the beginning and the end of the entire input sequence. Don't turn multiline mode on if you are searching for a multiline text! Use “\s+” to match any sequence of whitespace characters including line feed (independent of platform / operating system!)! |
s |
dotall |
In dotall mode, the expression . matches any character, including a line terminator. (The s is a mnemonic for "single-line" mode, which is what this is called in Perl.) |
u |
unicode |
Enables Unicode-aware case folding (should be used e.g. in Germany or France together with flag i) By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. |
x |
comments |
In this mode, whitespace is ignored, and embedded comments starting with # are ignored until the end of a line. |
„\\“ |
We need this under Windoze for directory (folder) names. (It is easier to enclose the whole directory name in „\Q“ and „\E“ („quotation“ and „end“) , e.g. „\QC:\mydirectory\...\eclipse_workspace\E“. Cause the enclosed string can probably itself contain „\E“, you should use the menuitem „Input Assistance“ [on 'Find and replace']) |
„\s+“ |
or (still better) „\p{javaWhitespace}+“ bezeichnet einen beliebig langen Leerraum |
„\w+“ |
A variablename (might contain an underscore character, but no German umlauts, e.g.) |
„\.“ |
is a (literal) . |
„.*“ |
is a arbitrary string (might be empty) |
„\p{XDigit}{2,4}“ |
or „[0-9a-fA-F]{2,4}“ designates hexadecimal numbers. Errors concerning hexadecimal numbers you will find e.g. by using „\s+[0-9a-fA-F]{3}[ \t\n\x0B\f\r\.;]+“. |
Who wants to know more, should visit the internet, especially you should search for „java patterns regexp“ or look on „http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html“.
If I have written nonsense somewhere in this document (this is likely, cause I myself am still a beginner concerning regexp), please mailto: eumel100@web.de