Thursday, August 20, 2009

Learn perl easy part4

Filehandles

You can create your own filehandles using the open function, read and/or write to them, and then clean up using close.
open

open opens a file for reading and/or writing, and associates a filehandle with it. You can choose any name for the filehandle, but the convention is to make it all caps. In the examples, we use FILEHANDLE.
open a file for reading open FILEHANDLE,"cosmids.fasta" alternative form: open FILEHANDLE,"
open a file for writing open FILEHANDLE,">cosmids.fasta"

open a file for appending open FILEHANDLE,">>cosmids.fasta'

open a file for reading and writing open FILEHANDLE,"+Catching Open Failures

It's common for open to fail. Maybe the file doesn't exist, or you don't have permissions to read or create it. Always check open's return value, which is TRUE if the operation succeeded, FALSE otherwise:

$result = open COSMIDS,"cosmids.fasta";
die "Can't open cosmids file: $!\n" unless $result;

When an error occurs, the $! variable holds a descriptive string containing a description of the error, such as "file not found".

There is a compact idiom for accomplishing this in one step:

open COSMIDS,"cosmids.fasta" or die "Can't open cosmids file: $!\n";

Using a Filehandle

Once you've created a filehandle, you can read from it or write to it, just as if it were STDIN or STDOUT. This code reads from file "text.in" and copies lines to "text.out":

open IN,"text.in" or die "Can't open input file: $!\n";
open OUT,">text.out" or die "Can't open output file: $!\n";

while ($line = ) {
print OUT $line;
}

Closing a Filehandle

When you are done with a filehandle, you should close it. This will also happen automatically when your program ends, or if you reuse the same filehandle name.

close IN or warn "Errors while closing filehandle: $!";

Some errors, like filesystem full, only occur when you close the filehandle, so you should check for errors in the same way you do when you open a filehandle.

The Magic of <>

The bare <> function when used without any explicit filehandle is magical. It reads from each of the files on the command line as if they were one single large file. If no file is given on the command line, then <> reads from standard input.

This sounds weird, but it is extremely useful.
A Practical Example of <>

Count the number of lines and bytes in a series of files. If no file is specified, count from standard input (like wc does).

Code:

#!/usr/local/bin/perl
# file: wc.pl
($bytes,$lines) = (0,0);

while (<>) {
$bytes += length($_);
$lines++;
}

print "LINES: $lines\n";
print "BYTES: $bytes\n";

Output:

(~/grant) 79% wc.pl progress.txt
LINES: 102
BYTES: 5688

(~/grant) 80% wc.pl progress.txt resources.txt specific_aims
LINES: 481
BYTES: 24733


Globals and Functions that Affect I/O

Several built-in globals affect input and output:
$/ The input record separator. The value of this global is used by to determine where the end of a line is. Normally "\n".

$\ The record output string. Whatever this is set to will appear at the end of everything printed by print. Normally empty.

$, The output field separator. Appears between all items printed with the print function. Normally empty. $" The output list separator. Interpolated between all items of an array when an array is interpolated into a double-quoted string. Normally a space. $. The line count. When reading from <>, this will be set to the line number of the "virtual file".

Example use of Input Record Separator

Say you have a text file containing records in the following interesting format:


>gi|5340860|gb|AI793144.1|AI793144 on36f02.y5 NCI_CGAP_Lu5 Homo sapiens cDNA clone
CAAACAGCCCCCGATAACGCTACGTGAGCTGGGCCCTGGGCCTGAGGCAGAAAACGGACGGAAGAAAAGG
TCTGGCCGGAGATGGGTCTCACTCTGTCACCCAGACTGGAGTGCAGTGAGTGGTGCGATCATAGCTTACT
GCAGCCTGAAACTCCTGGGCTCAAGTGATCTTCTCGCCTCAGCCTCCTGAGTAGCTGGAGCTACAGGAAT
GAGCATAGATGAACAATGTTGCATCACGCTTGACATCACCGGNGCTTCTTTCCAGTGTGGATTTGCTCAT
GTAAAATGAGGTGTGAGCTCTGCCTGAAAGCTTTTCCATATGCATCACATTTGCAGGGCTTTTCTCCAGT
GTGGGTTCTTTGGTGTCTCAAAAGATGTGAGCTGTTACTGAAAGCTTTCCCACACACATCACACTCATAG
GGCTTCTCTCTACCGTGGATTCGCTGGTGTCCAACAAGAGCTGAACTGTATCTGAAGGCCTTTCCACGCT
TGTCACATTCATATAGTTTCTTTCCACTGTGGATTNTCTGGTGACAGAAGAGGCCCAAGCACTAGCTAAA
GCTNTTCCCTCACTCACTACACTGCTATGGCTTCTCTTCAGTATGAACTCTGATGTTGTCTCAGATATGA
ACTCAGAGAGGATNTCCCACAATCATTACACTGGTATGGTTCCTTTTCGTGTGAGTTCTCTGGTGTCNAA
ATACATCTGAGCTGTGATGAAAGAACTTNCCACACTCACTACATTGGGAAGG

>gi|4306680|gb|AI451833.1|AI451833 mx13e08.y1 Soares mouse NML Mus musculus cDNA clone
TGAATGTATGCAGTGCGGAAAGACATTCACTTCTGGCCACTGTGCCAGAAGACATTTAGGGACTCACAGT
GGAGCCTGGCCTTACAAATGTGAAGTGTGTGGGAAAGCTTATCCCTACGTCTATTCCCTTCGAAACCACA
AAAAAAGTCACAACGAAGAAAAACTTTATGAATGTAAACAATGTGGGAAAGCCTTTAAATACATTTCTTC
CTTACGCAACCACGAGACTACTCACACTGGAGAGAAGCCCTATGAATGTAAGGAATGTGGGAAAGCCTTT
AGTTGTTCCAGTTACATTCAAAATCACATGAGAACACACAAAAGGCAGTCCTATGAATGTAAGGAGTGTG
GTAAGGTGTTCTCATATTCCAAAAGTCTTCGGAGACACATGACTACACATAGTTAATTAGAGAGGGATAG
TTNTAAGTATAATTTAAATATATAAAAGAGCTCTACACATTCTAGCTCCTCATTAAGAAACAAAAAATTT
CACACTGGAAAACGAGCCTATGAATGCAGTATGTGTGCCAAAGTCTCAGTACATGCCACAGT

>gi|3400733|gb|AI074089.1|AI074089 oq97c08.x1 NCI_CGAP_Co12 Homo sapiens cDNA clone
GAATCTTCTGGGTCCTCTTTATTAAGAGCCCTCTGCCTTCCCAGGGGAGGGAAGCAAATCCTTCAGGGCC
CCCAGAGTTCCTGCACCCCATATCATGGGTGAGTCCTACCAGCCACAGAGCCACCCGTCACCGTGGAGAG
GCTTAAGCTGCACTCAGAGCTCCCCCCGGGCATGCCGAATGTAGTGTTGATGCAGCCCTGCTTCCTGAGC
AAAGTCCTGACCGCACTCTGTGCAGGCGAAGGTGCCAGGAGGGGCACGGACCTCATGCATCTGGCGGTGC
CGCCTCAGAGAAACAGCCTGCCCAAAGGTCTTGCCACAGTCAGGACAAGGGAAGGTGGGCTGGGCAGTAG
TGGTTGCAACCGGCAGGGTGGGCTTGGCGGCTGGACCGTGGCTGCGCTGGTGGGTGATTAGGGCTTTGGA
...

If you use standard <>, you will get a line at a time, and have to figure out where one record ends and a new one starts. However, if you set the input record separator to ">", then each time you read a "line", you will read all the way to the next ">" symbol. Throw away the first record (which is empty), keep the others.

#!/usr/local/bin/perl
# file: get_fasta_records.pl

$/ = '>';

<>; # throw away the first record (will be empty)

while (<>) {
chomp;
# split up lines of the record. The first line
# is the sequence ID. The second and subsequent lines
# are the sequence
my ($id,@sequence) = split "\n";
my $sequence = join '',@sequence; # reassemble the sequence
}

Special Uses of the Input Record Separator

The input record separator has two special cases.
Paragraph Mode

If the input record separator ($/) is set to the empty string ("") it goes into paragraph mode. Each <> will read up to the next blank line. Multiple blank lines will be skipped over. This is good for reading text separated into paragraphs.
Slurp Mode

If the input record separator is set to the undefined value (undef) then it goes into slurp mode. The <> operator will read its entire input into a single scalar.

Here's how to read the entire file cosmids.fasta into a scalar variable:

open IN,"cosmids.fasta" or die "Can't open cosmids.fasta: $!\n";
$/ = undef;

$data = ; # data slurp


Regular Expressions

A regular expression is a string template against which you can match a piece of text. They are something like shell wildcard expressions, but much more powerful.
Examples of Regular Expressions

This bit of code loops through each line of a file. Finds all lines containing an EcoRI site, and bumps up a counter:

Code:

#!/usr/bin/perl -w
#file: EcoRI1.pl

use strict;

my $filename = "example.fasta";
open (FASTA , "$filename") or print "$filename does not exist\n";
my $sites;

while (my $line = ) {
chomp $line;

if ($line =~ /GAATTC/){
print "Found an EcoRI site!\n";
$sites++;
}
}

if ($sites){
print "$sites EcoRI sites total\n";
}else{
print "No EcoRI sites were found\n";
}

#note: if $sites is declared inside while loop you would not be able to
#print it outside the loop

Output:

~]$ ./EcoRI1.pl
Found an EcoRI site!
Found an EcoRI site!
.
.
.
Found an EcoRI site!
Found an EcoRI site!
34 EcoRI sites total


This Works Too!
Code:

#file:EcoRI2.pl

while ( ) {
chomp;
if ($_ = /GAATTC/){
print "Found an EcoRI site!\n";
$sites++;
}
}

Output:

~]$ ./EcoRI1.pl
Found an EcoRI site!
Found an EcoRI site!
.
.
.
Found an EcoRI site!
Found an EcoRI site!
34 EcoRI sites total


This Also Works
Code:

#file:EcoRI.pl

while ( ) {
chomp;
if (/GAATTC/){
print "Found an EcoRI site!\n";
$sites++;
}
}

By default, a regular expression examines $_ and returns a TRUE if it matches, FALSE otherwise.
Output:

~]$ ./EcoRI1.pl
Found an EcoRI site!
Found an EcoRI site!
.
.
.
Found an EcoRI site!
Found an EcoRI site!
34 EcoRI sites total

This does the same thing, but counts one type of methylation site (Pu-C-X-G) instead:
Code:

#file:methy.pl

while () {
chomp;

if (/[GA]C.?G/){ #What Happens If Your File Is Not All In CAPS
#print "Found a Methylation Site!\n";
$sites++;
}
}
if ($sites){
print "$sites Methylation Sites total\n";
}else{
print "No Methylation Sites were found\n";
}



Output:

~]$ ./methy.pl
723 Methylation Sites total

Regular Expression Variable

A regular expression is normally delimited by two slashes ("/"). Everything between the slashes is a pattern to match. Patterns can be made up of the following Atoms:

1. Ordinary characters: a-z, A-Z, 0-9 and some punctuation. These match themselves.

2. The "." character, which matches everything except the newline.

3. A bracket list of characters, such as [AaGgCcTtNn], [A-F0-9], or [^A-Z] (the last means anything BUT A-Z).

4. Certain predefined character sets: \d The digits [0-9] \w A word character [A-Za-z_0-9] \s White space [ \t\n\r] \D A non-digit \W A non-word \S Non-whitespace
5. Anchors: ^ Matches the beginning of the string $ Matches the end of the string \b Matches a word boundary (between a \w and a \W)

Examples:

* /g..t/ matches "gaat", "goat", and "gotta get a goat" (twice)

* /g[gatc][gatc]t/ matches "gaat", "gttt", "gatt", and "gotta get an agatt" (once)

* /\d\d\d-\d\d\d\d/ matches 376-8380, and 5128-8181, but not 055-98-2818.

* /^\d\d\d-\d\d\d\d/ matches 376-8380 and 376-83801, but not 5128-8181.

* /^\d\d\d-\d\d\d\d$/ only matches telephone numbers.

* /\bcat/ matches "cat", "catsup" and "more catsup please" but not "scat".

* /\bcat\b/ only text containing the word "cat".

Quantifiers

By default, an atom matches once. This can be modified by following the atom with a quantifier:
? atom matches zero or exactly once* atom matches zero or more times + atom matches one or more times {3} atom matches exactly three times {2,4} atom matches between two and four times, inclusive {4,} atom matches at least four times

Examples:

* /goa?t/ matches "goat" and "got". Also any text that contains these words.
* /g.+t/ matches "goat", "goot", and "grant", among others.
* /g.*t/ matches "gt", "goat", "goot", and "grant", among others.
* /^\d{3}-\d{4}$/ matches US telephone numbers (no extra text allowed.

Alternatives and Grouping

A set of alternative patterns can be specified with the | symbol:

/wolf|sheep/; # matches "wolf" or "sheep"

/big bad (wolf|sheep)/; # matches "big bad wolf" or "big bad sheep"

You can combine parenthesis and quantifiers to quantify entire subpatterns:

/Who's afraid of the big (bad )?wolf\?/;
# matches "Who's afraid of the big bad wolf?" and
# "Who's afraid of the big wolf?"

This also shows how to literally match the special characters -- put a backslash (\) in front of them.
Specifying the String to Match

Regular expressions will attempt to match $_ by default. To specify another string variable, use the =~ (binding) operator:

$h = "Who's afraid of Virginia Woolf?";
print "I'm afraid!\n" if $h =~ /Woo?lf/;

There's also an equivalent "not match" operator !~, which reverses the sense of the match:

$h = "Who's afraid of Virginia Woolf?";
print "I'm not afraid!\n" if $h !~ /Woo?lf/;

Using a Different Delimiter

If you want to match slashes in the pattern, you can backslash them:

$file = '/usr/local/blast/cosmids.fasta';
print "local file" if $file =~ /^\/usr\/local/;

This is ugly, so you can specify any match delimiter with the m (match) operator:

$file = '/usr/local/blast/cosmids.fasta';
print "local file" if $file =~ m!^/usr/local!;

The punctuation character that follows the m becomes the delimiter. In fact // is just an abbreviation for m//. Almost any punctuation character will work:

* m!^/usr/local!
* m#^/usr/local#
* m@^/usr/local@
* m,^/usr/local,
* m{^/usr/local}
* m[^/usr/local]

The last two examples show that you can use left-right bracket pairs as well.
Matching with a Variable Pattern

You can use a scalar variable for all or part of a regular expression. For example:

$pattern = '/usr/local';
print "matches" if $file =~ /^$pattern/;

See the o flag for important information about using variables inside patterns.

Subpatterns

You can extract and manipulate subpatterns in regular expressions.

To designate a subpattern, surround its part of the pattern with parenthesis (same as with the grouping operator). This example has just one subpattern, (.+) :

/Who's afraid of the big bad w(.+)f/

Matching Subpatterns

Once a subpattern matches, you can refer to it later within the same regular expression. The first subpattern becomes \1, the second \2, the third \3, and so on.

while (<>) {
chomp;
print "I'm scared!\n" if /Who's afraid of the big bad w(.)\1f/
}

This loop will print "I'm scared!" for the following matching lines:

* Who's afraid of the big bad woof
* Who's afraid of the big bad weef
* Who's afraid of the big bad waaf

but not

* Who's afraid of the big bad wolf
* Who's afraid of the big bad wife

In a similar vein, /\b(\w+)s love \1 food\b/ will match "dogs love dog food", but not "dogs love monkey food".
Using Subpatterns Outside the Regular Expression Match

Outside the regular expression match statement, the matched subpatterns (if any) can be found the variables $1, $2, $3, and so forth.

Example. Extract 50 base pairs upstream and 25 base pairs downstream of the TATTAT consensus transcription start site:


while (<>) {
chomp;
next unless /(.{50})TATTAT(.{25})/;
my $upstream = $1;
my $downstream = $2;
}

Extracting Subpatterns Using Arrays

If you assign a regular expression match to an array, it will return a list of all the subpatterns that matched. Alternative implementation of previous example:


while (<>) {
chomp;
my ($upstream,$downstream) = /(.{50})TATTAT(.{25})/;
}

If the regular expression doesn't match at all, then it returns an empty list. Since an empty list is FALSE, you can use it in a logical test:


while (<>) {
chomp;
next unless my($upstream,$downstream) = /(.{50})TATTAT(.{25})/;
print "upstream = $upstream\n";
print "downstream = $downstream\n";
}


Grouping without Making Subpatterns

Because parentheses are used both for grouping (a|ab|c) and for matching subpatterns, you may match subpatterns that don't want to. To avoid this, group with (?:pattern):

/big bad (?:wolf|sheep)/;

# matches "big bad wolf" or "big bad sheep",
# but doesn't extract a subpattern.

Subpatterns and Greediness

By default, regular expressions are "greedy". They try to match as much as they can. For example:

$h = 'The fox ate my box of doughnuts';
$h =~ /(f.+x)/;
$subpattern = $1;

Because of the greediness of the match, $subpattern will contain "fox ate my box" rather than just "fox".

To match the minimum number of times, put a ? after the qualifier, like this:

$h = 'The fox ate my box of doughnuts';
$h =~ /(f.+?x)/;
$subpattern = $1;

Now $subpattern will contain "fox". This is called lazy matching.

Lazy matching works with any quantifier, such as +?, *? and {2,50}?.


String Substitution

String substitution allows you to replace a pattern or character range with another one using the s/// and tr/// functions.
The s/// Function

s/// has two parts: the regular expression and the string to replace it with: s/expression/replacement/.

$h = "Who's afraid of the big bad wolf?";
$i = "He had a wife.";

$h =~ s/w.+f/goat/; # yields "Who's afraid of the big bad goat?"
$i =~ s/w.+f/goat/; # yields "He had a goate."

If you extract pattern matches, you can use them in the replacement part of the substitution:

$h = "Who's afraid of the big bad wolf?";

$h =~ s/(\w+) (\w+) wolf/$2 $1 wolf/;
# yields "Who's afraid of the bad big wolf?"

Default Substitution Variable

If you don't bind a variable with =~, then s/// operates on $_ just as the match does.
Using a Variable in the Substitution Part

Yes you can:

$h = "Who's afraid of the big bad wolf?";
$animal = 'hyena';
$h =~ s/(\w+) (\w+) wolf/$2 $1 $animal/;
# yields "Who's afraid of the bad big hyena?"

Using Different Delimiters

The s/// function can use alternative delimiters, including parentheses and bracket pairs. For example:

$h = "Who's afraid of the big bad wolf?";

$h =~ s!(\w+) (\w+) wolf!$2 $1 wolf!; # using ! as delimiter

$h =~ s{(\w+) (\w+) wolf}{$2 $1 wolf}; # using {} as delimiter

Translating Character Ranges

The tr/// function allows you to translate one set of characters into another. Specify the source set in the first part of the function, and the destination set in the second part:

$h = "Who's afraid of the big bad wolf?";
$h =~ tr/ao/AO/; # yields "WhO's AfrAid Of the big bAd wOlf?";

Like s///, the tr/// function operates on $_ if not otherwise specified.

tr/// returns the number of characters transformed, which is sometimes handy for counting the number of a particular character without actually changing the string.

This example counts N's in a series of DNA sequences:

Code:


while (<>) {
chomp; # assume one sequence per line
my $count = tr/Nn/Nn/;
print "Sequence $_ contains $count Ns\n";
}

Output:

(~) 50% count_Ns.pl sequence_list.txt
Sequence 1 contains 0 Ns
Sequence 2 contains 3 Ns
Sequence 3 contains 1 Ns
Sequence 4 contains 0 Ns
...


Regular Expression Options

Regular expression matches and substitutions have a whole set of options which you can toggle on by appending one or more of the i, m, s, g, e or x modifiers to the end of the operation. See Programming Perl Page 153 for more information. Some example:

$string = 'Big Bad WOLF!';
print "There's a wolf in the closet!" if $string =~ /wolf/i;
# i is used for a case insensitive match

i Case insensitive match.

g Global match (see below).

e Evalute right side of s/// as an expression.

o Only compile variable patterns once (see below).

m Treat string as multiple lines. ^ and $ will match at start and end of internal lines, as well as at beginning and end of whole string. Use \A and \Z to match beginning and end of whole string when this is turned on.

s Treat string as a single line. "." will match any character at all, including newline.

x Allow extra whitespace and comments in pattern.
Global Matches

Adding the g modifier to the pattern causes the match to be global. Called in a scalar context (such as an if or while statement), it will match as many times as it can.

This will match all codons in a DNA sequence, printing them out on separate lines:

Code:

$sequence = 'GTTGCCTGAAATGGCGGAACCTTGAA';
while ( $sequence =~ /(.{3})/g ) {
print $1,"\n";
}

Output:

GTT
GCC
TGA
AAT
GGC
GGA
ACC
TTG

If you perform a global match in a list context (e.g. assign its result to an array), then you get a list of all the subpatterns that matched from left to right. This code fragment gets arrays of codons in three reading frames:

@frame1 = $sequence =~ /(.{3})/g;
@frame2 = substr($sequence,1) =~ /(.{3})/g;
@frame3 = substr($sequence,2) =~ /(.{3})/g;

The position of the most recent match can be determined by using the pos function.
Code:

#file:pos.pl
my $seq = "XXGGATCCXX";

if ( $seq =~ /(GGATCC)/gi ){
my $pos = pos($seq);
print "Our Sequence: $seq\n";
print '$pos = ', "1st postion after the match: $pos\n";
print '$pos - length($1) = 1st postion of the match: ',($pos-length($1)),"\n";
print '($pos - length($1))-1 = 1st postion before the the match: ',($pos-length($1)-1),"\n";
}

Output:

~]$ ./pos.pl
Our Sequence: XXGGATCCXX
$pos = 1st postion after the match: 8
$pos - length($&) = 1st postion of the match: 2
($pos - length($&))-1 = 1st postion before the the match: 1

Variable Interpolation and the "o" Modifier

If you use a variable inside a pattern template, as in /$pattern/ be aware that there is a small performance penalty each time Perl encounters a pattern it hasn't seen before. If $pattern doesn't change over the life of the program, then use the o ("once") modifier to tell Perl that the variable won't change. The program will run faster:

$codon = '.{3}';
@frame1 = $sequence =~ /($codon)/og;

Testings Your Regular Expressions

To be sure that you are getting what you think you want you can use the following "Magic" Perl Automatic Match Variables $&, $`, and $'
Code:

#file:matchTest.pl

if ("Hello there, neighbor" =~ /\s(\w+),/){
print "That actually matched '$&'.\n";
print "That was ($`) ($&) ($').\n";
}

Output:

That actually matched ' there,'.
That was (Hello) ( there,) ( neighbor).


Regular Expression Options

Regular expression matches and substitutions have a whole set of options which you can toggle on by appending one or more of the i, m, s, g, e or x modifiers to the end of the operation. See Programming Perl Page 153 for more information. Some example:

$string = 'Big Bad WOLF!';
print "There's a wolf in the closet!" if $string =~ /wolf/i;
# i is used for a case insensitive match

i Case insensitive match.

g Global match (see below).

e Evalute right side of s/// as an expression.

o Only compile variable patterns once (see below).

m Treat string as multiple lines. ^ and $ will match at start and end of internal lines, as well as at beginning and end of whole string. Use \A and \Z to match beginning and end of whole string when this is turned on.

s Treat string as a single line. "." will match any character at all, including newline.

x Allow extra whitespace and comments in pattern.
Global Matches

Adding the g modifier to the pattern causes the match to be global. Called in a scalar context (such as an if or while statement), it will match as many times as it can.

This will match all codons in a DNA sequence, printing them out on separate lines:

Code:

$sequence = 'GTTGCCTGAAATGGCGGAACCTTGAA';
while ( $sequence =~ /(.{3})/g ) {
print $1,"\n";
}

Output:

GTT
GCC
TGA
AAT
GGC
GGA
ACC
TTG

If you perform a global match in a list context (e.g. assign its result to an array), then you get a list of all the subpatterns that matched from left to right. This code fragment gets arrays of codons in three reading frames:

@frame1 = $sequence =~ /(.{3})/g;
@frame2 = substr($sequence,1) =~ /(.{3})/g;
@frame3 = substr($sequence,2) =~ /(.{3})/g;

The position of the most recent match can be determined by using the pos function.
Code:

#file:pos.pl
my $seq = "XXGGATCCXX";

if ( $seq =~ /(GGATCC)/gi ){
my $pos = pos($seq);
print "Our Sequence: $seq\n";
print '$pos = ', "1st postion after the match: $pos\n";
print '$pos - length($1) = 1st postion of the match: ',($pos-length($1)),"\n";
print '($pos - length($1))-1 = 1st postion before the the match: ',($pos-length($1)-1),"\n";
}

Output:

~]$ ./pos.pl
Our Sequence: XXGGATCCXX
$pos = 1st postion after the match: 8
$pos - length($&) = 1st postion of the match: 2
($pos - length($&))-1 = 1st postion before the the match: 1

Variable Interpolation and the "o" Modifier

If you use a variable inside a pattern template, as in /$pattern/ be aware that there is a small performance penalty each time Perl encounters a pattern it hasn't seen before. If $pattern doesn't change over the life of the program, then use the o ("once") modifier to tell Perl that the variable won't change. The program will run faster:

$codon = '.{3}';
@frame1 = $sequence =~ /($codon)/og;

Testings Your Regular Expressions

To be sure that you are getting what you think you want you can use the following "Magic" Perl Automatic Match Variables $&, $`, and $'
Code:

#file:matchTest.pl

if ("Hello there, neighbor" =~ /\s(\w+),/){
print "That actually matched '$&'.\n";
print "That was ($`) ($&) ($').\n";
}

Output:

That actually matched ' there,'.
That was (Hello) ( there,) ( neighbor).

No comments: