Regular Expressions

--for matching or substitution or splitting :

matching: m/whatever/. Leading m not needed if // delimiters, but other delimiters can also be used m!whatrever!. Match returns TRUE or FALSE (0 or 1) in scalar context; usually used as part of

if (m/whatever/) or while (m/whatever/g). There are some modifiers, added after the closing delimiter: /i = case insensitive, /g = global, finds every match (as in the while loop example). Usage: $string =~ m/this/; causes matching to be done on the $string. Note the ~ following the equal sign in the binding operator.

substitution: s/this/that/ replaces this with that. The substitution operator modifies the string it is working on, and returns the number of substitutions made in scalar context. Other delimiters can be used, as for match. The lefthand operand is a regex, but the righthand operand is essesntially a double-quoted string without the quote marks: it follows the same rules of variable interpolation and metacharacters as a double quoted string. There are some modifiers: /i = case insensitive, /g = global (does every match or substituion), /e execute the replaced string as a Perl command (substitute command only). Usage: $string =~ s/this/that/; same as for match operator.

splitting: the string (a scalar variable) is split into a list (array) of substrings. The splits occur at the point matched by the regex, and the matched characters are NOT included in the resulting substrings. For example, if $string = "as,df,gh"; and @arr = split /,/ , $string; @arr will get 3 elements, $arr[0] = as, $arr[1] = df, and $arr[2] = gh. Note that the split operator doesn't use the binding operator (=~). It also doesn't automatically eliminate white space.

-- variable interpolation is allowed: you can create the regex as a separate variable, then put it into the operator. For example, $regex = "cat"; print "CAT" if ($string =~ /$regex/); will work.

--the matches work on literal character matches, such as "scat" matches /cat/, but they also can use:

assertions about positions such as beginning and end of the string and word boundaries,

character classes which allow any of several alternative characters to work in the match

quantifiers which look variable numbers of an element in a row

groupings which are characters within parentheses treated as a unit for quanitfiers

alternatives which try several possibilities for a match.

metacharacters that stand for something other than the literal character (like \n for newline)

--Assertions: take up no width, just specify a position.

^ start of line /^cat/ matches "cats" but not "scat'

$ end of line /cat$/ matches "scat" but not "cats"

\b between word and non-word chars /\bcat/b/ matches "cat" but not "scat"

--Character classes: all characters within square brackets [ ] are treated as alternatives to match a single character in a string: [aeiou] will match a single vowel, for example. If a caret ^ is the first character witin the brackets, the class is negated: [^aeiou] will match an y consonant, any character that is NOT a vowel. There are also some pre-defined character classes that can be used without the square brackets (although they can be put within square brackets if this is useful).

\w word char: [a-zA-Z0-9_]

\W non-word char {^ a-zA-Z0-9_]

\d digit [0-9]

\D non-dgit [^0-9]

\s white space [ \t\n\r\f\

\S non-space

--Quantifiers: are used after a character or character class to indicate how many are needed. Many quantifiers give a range: in this case, the regex engine is greedy, it takes the maximum number that it can. Thus, if $string = "caaaaaa", m/ca+/ will match caaaaaa, not ca. However, if you follow the quantifier with a ?, it will cause a lazy match: m/ca+?/ will match "ca" instead of "caaaaaa". Be careful with *, because it will match zero characters: m/.*/ matches anything. "good food" =~ s/o*/e/ "Substitute zero or more o's with e's. " This actually matches at first position, since g is in fact 0 or more o's. Should use s/o+/e, which needs one or more o's = "geed feed".

? 0 or 1; the preceeding char is optional; {0,1}

* 0 or more; {0,} (matches anything)

+ 1 or more; {1,}

{m,n} matches at least m and at most n of this char

{m,} mathces at least m

{,n} matches at most n

{m} matches exactly m times

--Grouping: the characters within parentheses are grouped, so quantifiers after the parentheses apply to the whole group: m/(cat){3}/ matches catcatcat, while m/cat{3}/ matches cattt. More on parentheses below.

--Alternatives: groups of characters (usually within paranetheses) and separated by a vertical bar | are alternatives. The regex engine tries each alternative, from left to right, looking for a match. Note that these are groups of characters, not single characters. Use character classes for single characters, as this is much faster. [abc] is more like a filter than actual alternatives: character classes don't generate backtracking. Alternatives are tried left to right, and first success (NOT longest) is reported. Alternatives are not greedy.

"Three tournaments won" =~ m/(to|tour|tournaments)/ matches "to" because it is leftmost.

"Three tournaments won" =~ m/(tournaments|tour|to)/ matches "tournaments" because it is leftmost.

--Metacharacters: stand for something other than the literal characters: e.g. \n is newline, \t is tab. If you want to use the literal characters, they must be "escaped" by putting a backslash \ in front of them. Thus, \\ is a literal backslash. Different metacharacters are used inside character classes [ ] than outside.

--in regex:

^ start of line

$ end of line

\b between word and non-word chars

\B NOT a boundary between a word and a non-word character

. matches any char except \n

\w word char: [a-zA-Z0-9_]

\W non-word char {^ a-zA-Z0-9_]

\d digit [0-9]

\D non-dgit [^0-9]

\s white space [ \t\n\r\f\

\S non-space

\n, \t, \r, \f, newline, tab, carriage return, formfeed

| alternative

( ) grouping

[ ] character class

{ } quantifier

+,?,* quantifiers

/ only if used as a delimiter

--in character class:

[^ caret as first char in class: negates the class; ^ anywhere else is just a caret

- hyphen between two chars indicates the range (in ASCII sequence. e.g. 0-9

\w, \d, \s word, digit, whitespace, as above. But not \W, etc.

\n, \t, \r, \f, as above

] the closing delimiter, must be escaped \] to use it as part of the class

 

--More on parentheses: have two effects:

1. groups things, as in (cat)+ means catcatcat, while cat+ means cattt

2. remembers what matches its contents, as variables numbered $1, $2, ...

--The memory counts variables based on order of leftmost parentheses: (the ((cat) (runs))) captures

$1 = the cat runs; $2 = cat runs; $3 = cat; $4 = runs.

--if you are still within the regex, the variables are \1, \2, etc. this works within a match, such as m/(ATG)CCGG\1/, which matches ATGCCGGATG. It also works within the left side of a substitution, but if you need it on the right side, you must use $1: s/(ATG)CCGG\1/$1TTAA$1/ will replace the sequence above with ATGTTAAATG.

--If several alternatives are captured by parentheses, only the one that actually matches will be captured: $string = "My pet is a cat"; m/\b(cat|dog)\b/; print $1; will print "cat".

--the $1, $2, etc. variables only work within the smallest block enclosed by curly braces { }; they are reset outside this scope.

--the regex engine is a NFA "Non-deterministic finite state automaton", and it has a complicated behavior that makes it both powerful and tricky to use.

--basic operation of the engine:

1. start at the left end of the string and try to match every char in the regex in exact order:

"The cat ran" =~ m/cat/ Matches "cat" after 4 failures at the chars before cat.

2. As soon as a match succeeds, stop: "The cat cat ran" =~ m/cat/ stops at the first cat and never reaches the second one.

3. check all posible alternatives before declaring failure. (can lead to lots of time)

4. keep track of all alternatives that might be tried; if a match fails on one alternative, go back to the previous choice and take the other laternative and folow it until either success in matching or failure and backtracking to the previous slternative. This is a last in, first out (LIFO) thing, a stack.

 

--Some specifics:

1. The engine starts at the left most position in the string, tries to match from there. If all possibilities fail, it moves to the next position and tries all pobbilities from there. Failure only occurs if all positions have been tried.

2. Alternatives in parentheses are tried in left to right order, and all of them are tried at every position before moving on to the next position,

3. Matches must satifiy every assertion (position specified by ^, $, etc.) or quantified atom (character or character class with quantifier) in the regex, in their order, so that each position is matched before the engine moves on to the next position.

4. Quantifiers match by starting at the first char and moving to the right as long as each char matches. When it reaches the first non-match, it backtracks to the last matching char and reports the match. This is "greedy" behavior: the maximal number of matches is taken.

 

 

Examples:

1. matching letters: problem is, \w is also digits and underscore. /^A-Za-z$/ requires that every char in the entire string be a letter: /^ACGTacgt$/ for DNA. Or /^[^\W\d_]$/

2. Words in general: have apostrophe(') and hyphen (-) in them, such as o'clock and cat's and pre-evaluation.. Also, there are some common numerical/letter mixed expressions: 1st for first, e.g. Also, words can be found within words: "there goes the cat" =~ m/the/ matches the first word, not the fourth, but =~ m/\bthe\b/ matches the word "the" only.

3. Trim leading and trailing whitespace from words: s/\s(.*?)\s/$1/. Note the lazy evaluation.

4. Strip the path from a full-path filename to the bare filename. s!.*/!! Removes everything up to and including the final /. Note use of alternative delimiters.

5. time of day: 11:30. [01][0-9]:[0-5][0-9] wo't work well. (1[012] | [1-9]) :[0-5][0-9] will

6. Quantifier: "The cat ran" =~ m/he/ vs m/.*he/ The latter checks through every cahr in .*, reaches the end of the string and tries to match "he". Then it backtracks one char at a time, trying to match that he, until it gets back to the second char in the string. So, it's a lot slower,.

7. Exponential increase in possibilities: using two unlimited quantifiers /(a+)*/

8. Things within a delimiter: e.g. <a href=http://www.bios.niu.edu> You want what's within the tags, everything between <a and > so m/<a(.*)<>/ But, this picks up the ending >, because it matches .*. which is greedy. So, try m/<a[^>]*>/ That is, zero or more charas that are not >.

9. temperatures input from STDIN: can start with + or - or nothing, immediately followed by some digits, optionally followed by a decimal point and more digits,followed by F or C: m/^[+-]?[0-9]+(\.[\d]*)?[CF]$/ Using this for matching number and C or F: m/^([+-]?[0-9]+((\.[\d]*))?([CF])$/

10. Some things are difficult to do with regexes: nesting delimiters, for example. Also, try finding a palindromic restriction site: starts with m/[ACGT]{3}/, then need to complement it and look for complement at each position.

11. Find all sequences within EcoB restriction site, which is TGANNNNNNNNGCT. m/TGA([ACGT]{8})GCT/ ; Then save $1 into an array: push @arr, $1;

Matching, substitution, and assignment: Note that =~ has a higher precedence than =, so parentheses are needed to get what you want.

In a list context, m// returns the matched sequence that is in the aprentheses. @arr = ($string =~ m/(ATG|CTG|GTG)GGC/; will save every ATG or CTG or GTG that is followed by GGC into the array. The "pos" function keeps track of where the match occurred in the sequence. Thus, while ($seq =~ m/(ATG|CTG|GTG)GGC/g ) { $site = pos $seq; push @sitearr, $site; } will list the position of each of the sites in the sequence.

Another useful construction: ($a = $b) =~ s/this/that/; First, $b is copied into $a, then $a is substituted. $b is not changed by this at all. If you change the parentheses: $a = ($b =~ s/this/that/ ); or don't use them at all, substitution is done on $b, and $a gets the number of substitutions done (s/// is being used in a scalar context).

Translate: tr///. Doesn't use regex: just substitutes one letter for another: tr/ACGT/TGCA/ complements the sequence. tr/ACGT// (also tr/ACGT/ACGT/) changes nothehing, but returns number of chars (easy way to count). $num = $string =~ tr/ACGT/ACGT/

$string =~ tr/ACGTN/ACGT/ simply removes all N's--any chars in right side that don't have a match in left side are deleted. $num = $string =~ tr/ACGT/ACGT/