Difference between revisions of "Perl"

From CSE330 Wiki
Jump to navigationJump to search
(Removed Ubuntu support)
 
(31 intermediate revisions by 4 users not shown)
Line 301: Line 301:
  
 
  print "sum of 2+5=" . add(2,5) . "\n";
 
  print "sum of 2+5=" . add(2,5) . "\n";
 +
</pre>
 +
</code>
 +
 +
The command line arguments to the Perl script (excluding the script name) are stored in the array ''@ARGV''. For example, the following code prints out the number of command line arguments:
 +
 +
<code><pre>
 +
print "There are " . scalar(@ARGV) . " arguments\n";
 
</pre>
 
</pre>
 
</code>
 
</code>
  
 
=Regular Expressions=
 
=Regular Expressions=
 +
 +
Perl is widely known for excellence in text processing, and regular expressions are one of the big factors behind this fame. There is a lot to learn about regular expressions, and here is only a very brief introduction that serves as a starting point. Please refer to the CPAN [http://perldoc.perl.org/perlretut.html tutorial] (from which the following is extracted) and [http://perldoc.perl.org/perlre.html reference] for more complete and in-depth coverage.
 +
 +
A regular expression (or simply regex) can be thought of as a ''pattern'', composed of characters and symbols, which expresses what you want to search in a body of text. A regex is used in Perl functions such as string matching, search and replace, and splitting. In the following, we will first look at some typical ways of defining regex. Then we will give examples of how they are used in Perl functions.
 +
 +
==Defining Regular Expressions==
 +
 +
A regex is marked by ''//''. The simplest regex is just a word, or more generally, a string of characters. Such a regex simply means that the ''pattern'' to be matched is that word.
 +
 +
<code><pre>
 +
/World/                      # matches the whole word 'World' (case sensitive)
 +
/Hello World /                # matches the whole phrase with a space at the end
 +
</pre></code>
 +
 +
Not all characters can be used 'as is' in a match. Some characters, called metacharacters, are reserved for use in regex notation. The metacharacters are
 +
 +
<code><pre>
 +
{}[]()^$.|*+?\
 +
</pre></code>
 +
 +
A metacharacter can be used in the regex by putting a backslash before it:
 +
 +
<code><pre>
 +
/\/usr\/bin\/perl/            # matches the word '/usr/bin/perl'
 +
</pre></code>
 +
 +
The backslash also works for non-printable ASCII characters, represented as escape sequences. Common examples are \s for a space, \t for a tab, \n for a newline, and \r for a carriage return.
 +
 +
A '''character class''' allows a set of possible characters, rather than just a single character, to match at a particular point in a regex. Character classes are denoted by brackets [...] , with the set of characters to be possibly matched inside. A "-" can be used to specify range of characters. Here are some examples:
 +
 +
<code><pre>
 +
/[yY][eE][sS]/                # matches 'yes' in a case-insensitive way (e.g., 'Yes', 'YES', 'yes', etc.)
 +
/item[012345]/                # matches 'item0' or ... or 'item5'
 +
/item[0-5]/                  # does the same thing
 +
/[0-9a-fA-F]/                # matches a hexadecimal digit
 +
</pre></code>
 +
 +
Different character strings can be matched with the '''alternation''' metacharacter '|'. For example:
 +
 +
<code><pre>
 +
/cat|dog|bird/                # matches 'cat', or 'dog', or 'bird'
 +
</pre></code>
 +
 +
We can '''group''' parts of a regex by enclosing them in parentheses '()'. For example:
 +
 +
<code><pre>
 +
/House(cat|keeper)/          # matches 'Housecat' or 'Housekeeper'
 +
</pre></code>
 +
 +
One of the powerful features of regex is matching '''repetitions'''. The quantifier metacharacters ?, * , + , and {} allow us to determine the number of repeats of a portion of a regex we consider to be a match. Quantifiers are put immediately after the character, character class, or grouping that we want to specify. They have the following meanings:
 +
 +
<code><pre>
 +
/a?/    # matches 'a' 1 or 0 times
 +
/a*/    # matches 'a' 0 or more times, i.e., any number of times
 +
/a+/    # matches 'a' 1 or more times, i.e., at least once
 +
/a{3,5}/ # matches at least 3 times, but not more than 5 times.
 +
/a{3,}/  # matches at least 3 or more times
 +
/a{3}/  # matches exactly 3 times
 +
</pre></code>
 +
 +
Here are more examples:
 +
 +
<code><pre>
 +
/[a-zA-Z]+.pl/                # matches any .pl filename that consists of only alphabet characters
 +
/[a-z]+\s+\d*/                # matches a lowercase word, at least some space, and any number of digits
 +
</pre></code>
 +
 +
==Using Regular Expressions==
 +
 +
Regex is used almost any time when string matching is needed. To start, consider '''testing''' if a string contains the matching pattern (e.g., a regex). The operator =~ associates the string with the regex match and produces a true value if the regex matched, or false if the regex did not match. Perl will always match at the earliest possible point in the string. For example:
 +
 +
<code><pre>
 +
print "It matches\n" if "Hello World" =~ /World/;    # prints "It matches"
 +
print "It matches\n" if "Hello World" =~ /o/;        # prints "It matches" (it matches the 'o' in 'Hello')
 +
print "It matches\n" if "Hello World" =~ / o/;        # does not match
 +
</pre></code>
 +
 +
'''Search and replace''' is performed using the format =~ s/regex/replacement/modifiers. The replacement is a Perl double quoted string that replaces in the string whatever is matched with the regex. For example:
 +
 +
<code><pre>
 +
$x = "I batted 4 for 4";
 +
$x =~ s/4/four/;                        # $x now contains "I batted four for 4"
 +
$x = "I batted 4 for 4";
 +
$x =~ s/4/four/g;                        # $x now contains "I batted four for four"
 +
</pre></code>
 +
 +
In the above example, the global modifier g asks Perl to search and replace all occurrences of the regex in the string. If there is a match, s/// returns the number of substitutions made, otherwise it returns false.
 +
 +
Another use of regex is to '''split''' a string into a set of strings marked by delimiters, where the delimiters are specified by a given regex. This is done using the split function. For example, to split a string into words, do:
 +
 +
<code><pre>
 +
$x = "Calvin and Hobbes";
 +
@word = split /\s+/, $x;                # $word[0] = 'Calvin'
 +
                                        # $word[1] = 'and'
 +
                                        # $word[2] = 'Hobbes'
 +
</pre></code>
 +
 +
As seen above, the split function returns the separated strings as an array. If the empty regex // is used, the string is split into individual characters.
 +
 
=Perl Modules=
 
=Perl Modules=
Perl modules are libraries that other people developed for several useful task. [http://www.cpan.org] is the portal that contains most of the modules. It also has manuals for each module (most of which contain examples). While most of the modules are perl programs themselves, some may require compilation of libraries written using other languages.  
+
Perl modules are simply Perl libraries that provide new or enhanced functionality.  Although some modules are distributed as part of the core Perl distribution, most of them are written and maintained by other Perl users or developers who needed some specific functionality that was missing. The [http://www.cpan.org CPAN] repository is the portal for finding and getting modules. CPAN also contains manuals for the modules, most of which contain examples. Generally speaking, the modules are all perl programs, although sometimes they require libraries from other languages.
 +
 
 +
The typical location for perl modules is ''/usr/lib/perl/XXX'' where XXX is the version of perl.  You can define other locations by setting the ''PERL5LIB'' environmental variable.  
  
The  typical location for perl modules is ''/usr/lib/perl/XXX'' where XXX is the version of perl. You can define other locations by setting ''PERL5LIB'' environmental variable.
 
 
  export PERL5LIB=directory1:directory2
 
  export PERL5LIB=directory1:directory2
  
A module is described in ''.pm'' file located in one of the module directories (either default or through PERL5LIB)  and accessible in a perl script through ''use'' command
+
A module is contained in ''.pm'' files located in one of the module directories (either the default or through PERL5LIB). They are accessed in a perl script by including them via the ''use'' command.  Most modules are grouped in to common packages. For example, there is a ''DBI'' package that contains many database interface modules.  One of those modules is the ''mysql'' module.  To include that module (assuming you have it installed on your system) in your script, you would add this line:
  use PERL-MODULE
+
 
 +
  use DBI::mysql;
 +
 
 +
If a module you want is not installed on your system, you need to install it either manually or through the automated ''cpan'' command.  The later approach takes a bit of set up, but it makes installing new modules much simpler.
  
If your modules is not installed in the system, you need to install it either by manually or through automated ''cpan'' command.
 
 
==Manual Module Installation==
 
==Manual Module Installation==
After you downloaded your module from CPAN, untar it (or unzip) and change to the directory containing the source code. You will notice that there is a Makefile.PL. That is a perl script that generates a makefile for your perl installation. Run this
+
Search the CPAN website for the module you want, then download the module.  Untar it (or unzip it) and change to the directory containing the source code. You will notice that there is a file named ''Makefile.PL''. It's a perl script that generates a makefile for your perl installation. Run this:
 +
 
 
  perl Makefile.PL
 
  perl Makefile.PL
  
This generates the actual makefile. If you want to install it to a non-default module location, set PREFIX variable
+
This generates the actual makefile you need to build the module. If you want to install the module to a non-default module location, set the PREFIX environment variable:
 +
 
 
  perl Makefile.PL PREFIX=my_module_directory
 
  perl Makefile.PL PREFIX=my_module_directory
  
Once you have the makefile ready, you can make  module, test it and install it. Most Perl modules requires a test to verify that it is working. If a critical component of the test fails, you may have to force it to install.
+
Once you have the makefile, you can build the module, test it, and install it. Most Perl modules require testing to ensure they are functioning correctly on your system, so doing the tests is part of the build process. If a critical component of the test fails, you may have to resolve any problems before moving on.
  
 
  make
 
  make
Line 328: Line 439:
 
  make install
 
  make install
  
 +
Some modules require prerequisites which you need to install yourself.  In some cases, the dependencies can easily spiral out of control so that it becomes very difficult to install everything manually, which is why the CPAN approach is preferred.
  
Some modules require prerequisites which you need to install yourself. Some cases, such dependencies could branch easily so it may be hard to install everything  manually. Perl provides a tool that makes module installation easier as we will see next.
 
 
==Automated Module Installation==
 
==Automated Module Installation==
''cpan'' is a perl script that uses perl module Cpan. It is a tool that installs a module with minimum user interaction and auto-installs prerequisites.  
+
''cpan'' is a perl script that comes with the perl package that uses the CPAN perl module. It is a tool that installs new modules with minimum user interaction and that auto installs prerequisites. To install new modules system-wide, you'll need to run ''cpan'' as root.  So, run:
  cpan
+
will start it. If it is the first time you are running ''cpan'', it will ask for configuration options. You can select the automated configuration and it will fill out the options for you. Alternatively, you can go with manual selection and specify each option by hand. The last option asks for one or more CPAN mirror locations. Make sure you have selected ''North America->United States'' and select a nearby mirror. I usually go with ''perl.com''  as my main mirror and add a few in case ''perl.com'' is down. Don't worry, if you are not happy with your configuration, you can modify it  later.
+
  sudo cpan
 +
 
 +
If it is the first time you are running ''cpan'', you will need to answer some configuration questions. When prompted, you should select the automated configuration to get things going without any extra hassle.  You will then get to the ''cpan'' prompt, where you can run commands to search for and install new modules.
  
Once you are in ''cpan'' prompt, you can  interact with it
 
 
  '''cpan>'''
 
  '''cpan>'''
  
help command gives you the available options. You can search a module with ''i'' command
+
The first thing should probably install is ''Bundle::CPAN'' which installs a bunch of packages that make cpan easier to use.  To do that, at the ''cpan>'' prompt, run:
  ''cpan>''' i /KEYWORDS/
+
 
 +
'''cpan>''' install Bundle::CPAN
 +
 
 +
Note that as it installs, it will probably ask if it's ok to install additional packages and such along the way.  It's generally ok to accept whatever it asks when you are prompted.  The ''help'' command shows you all the available commands and options. You can search for a module with the ''i'' command:
 +
 
 +
  '''cpan>''' i /KEYWORDS/
 +
 
 +
You can install a module with the ''install'' command:
  
and you can install a module with ''install'' command
+
'''cpan>''' install MODULE
''cpan>''' install /MODULENAME/
 
  
''install'' command will download the module, compile and test it and finally install it if everything goes well. The downloaded programs are stored under ''~/.cpan '' directory (unless you specified another one during configuration). Although ''cpan'' will download the other prerequistid perl modules, if you have missing library in your system (such as graphics) that your module requires, cpan installation may fail. If it happens, you can go to the build directory ''~/.cpan/build/YOUR-MODULE-NAME'' and try to fix the problem. ''look'' command will also take you to the compilation directory. You can leave ''cpan'' with ''q'' command.
+
''install'' downloads the module, compiles and tests it, and finally installs it if everything went well. The downloaded modules are stored under your ''~/.cpan '' directory (unless you specified another one during configuration), in case you need to look at source code or anything related to the module. Although ''cpan'' will download any perl modules that are needed by the module you actually want to install, it might fail if any of the modules rely on non-perl resources elsewhere on the system (e.g., graphics libraries). If that happens, you can go to the build directory, ''~/.cpan/build/MODULE'', and try to fix the problem.
  
 
=Connecting MySQL with Perl=
 
=Connecting MySQL with Perl=
  
In order to connect a MySQL you need to have DBI module and DBI::mysql installed in your system. Following guide has a good description of DBI interface in general [http://www.perl.com/pub/a/1999/10/DBI.html]
+
In order to connect to a mysql server, you need to have the DBI module and DBD::mysql installed on your system. The DBI is a generic database interface that is relatively easy to use, and a good overall description of the DBI is [http://www.perl.com/pub/a/1999/10/DBI.html here].
 +
 
 +
Once you have the correct modules installed, you can connect to a mysql server with the function:
  
Then you can connect a mysql database with the command
 
 
  DBI->connect(DATABASE_DESCRIPTION,USERNAME,PASS);
 
  DBI->connect(DATABASE_DESCRIPTION,USERNAME,PASS);
The databse description contains the type of the database (so that DBI knows how to connect), the name of the database and the host name. It returns a handle that you will use for this connection.
 
  
 +
The database description contains the type of the database (so that the DBI knows how to connect), the name of the database and the host name.  It returns a handle that you will use to execute SQL statements on that database.  Here is an example to connect the grades databases on the localhost:
  
<code>
+
<code><pre>
<pre>
+
use DBI;
use DBI;
+
$dbh=DBI->connect("DBI:mysql:database=grades",'dbuser','dbpass') or die "error connecting to database"; #note that $dbh is your database handle
$database="company;host=www.bayazit.net";
+
</pre></code>
$datasource="dbi:mysql";
 
$user="boss";
 
$pass="employer";
 
$dbh=DBI->connect("$datasource:$database",$user,$pass) || die "error connecting"; #note $dbh is your database handle
 
</pre>
 
</code>
 
  
Once you have the connection established you can send SQL queries to your database server. First you need to prepare your SQL with ''prepare'' command on the database handle.
+
Once you have the connection established, you can send SQL queries to the database server. First you need to prepare your SQL statement with the ''prepare'' function on the database handle:
  
$dbh->prepare("TYPE YOUR SQL HERE");
+
$stmth = $dbh->prepare("SQL STATEMENT");
  
''prepare'' command actually returns a statement handle ($sth in our example below) that you will use for the database operations.
+
''prepare'' returns a statement handle that you will use when executing or working with that statement. Here is an example:
  
  $sth=$dbh->prepare("insert into employee values ('Alaaddin'','Arabian Nights','Adventurer')");
+
  $sth=$dbh->prepare("insert into employee values ('Aladdin','Arabian Nights','Adventurer')");
  
You have to be careful about the your quotation marks. An easier way is to use perl qq function (which is generalize quotation)
+
You have to be careful about quotation marks. It's usually simpler to use the ''qq'' function (which is just a generalized quotation mechanism):
  
  $sql=qq(insert into employee values ('Alaaddin'','Arabian Nights','Adventurer'));
+
  $sql=qq(insert into employee values ('Aladdin','Arabian Nights','Adventurer'));
 
  $sth=$dbh->prepare($sql);
 
  $sth=$dbh->prepare($sql);
  
 +
Once you have a statement prepared, you can execute it with the ''execute'' function:
  
Once you have a statement prepared, you can execute it with execute command:
+
$sth->execute() or die "problem executing statement";
  
$sth->execute() ||die "Can't execute sql";
+
''prepare'' also lets you bind variables so that you don't have to repeat them.  For example, if you want to have a generic way to insert new values, you can prepare your statement like this:
  
''prepare'' command actually helps you bind variables so that you don't have to repeat them. For example, if you want to have a generic way to insert new values, you can prepare your statement
 
 
  $inserth=$dbh->prepare("insert into employee values (?,?,?)");
 
  $inserth=$dbh->prepare("insert into employee values (?,?,?)");
  
and during execution specify the arguments represented with ''?''
+
Then, to execute the statement, you supply the values to fill in for each ''?'':
$inserth->execute('Alaaddin'','Arabian Nights','Adventurer')|| die "Can't execute SQL";
 
  
 +
$inserth->execute('Aladdin','Arabian Nights','Adventurer') or die "problem executing statement";
  
If you are querying your database, the query results are also accessed through your statement handle. After execution, ''$sth->fetchrow_array'' returns a row from your results. This function returns an array, so you can access any attribute from this array. As ''fetchrow_array'' returns only one row, you can call it continuously until it fails to return a value.
+
When executing statements that return data, the results are also accessed through the statement handle. After execution, ''$sth->fetchrow_array'' returns a row from your results. This function returns an array, and you can access the columns in each sql row as members of the perl array. As ''fetchrow_array'' returns only one row, you call it continuously until it stops returning rows. For example:
  
<code>
+
<code><pre>
<pre>
+
sql=qq(select name,jobtitle from employee);
sql="select name,jobtitle from employee";
+
$sth=$dbh->prepare($sql);
$sth=$dbh->prepare($sql);
+
$sth->execute() or die "problem executing statement";
$sth->execute()|| die "Can't execute my sql";
+
while(@row=$sth->fetchrow_array)
while(@attributes=$sth->fetchrow_array) {
+
{
print "$attributes[0] $attributes[1] \n";
+
  print "$row[0] $row[1] \n";
}
+
}
</pre>
+
</pre></code>
</code>
 
<code>
 
<pre>
 
#!/usr/bin/perl
 
use DBI;
 
$database="company;host=www.bayazit.net";
 
$datasource="dbi:mysql";
 
$user="boss";
 
$pass="employer";
 
$dbh=DBI->connect("$datasource:$database",$user,$pass) || die "error connecting";
 
 
 
$sql="insert into employee values ('aladdin','Arabian Nights','Adventurer')"; # sql for inserting
 
$sql="select name,jobtitle from employee";
 
$sth=$dbh->prepare($sql);
 
$sth->execute()|| die "Can't execute my sql";
 
while(@attributes=$sth->fetchrow_array) {
 
print "$attributes[0] $attributes[1] \n";
 
}
 
$sth->finish();
 
 
 
 
 
</pre>
 
</code>
 
  
 
=Parsing HTML with Perl=
 
=Parsing HTML with Perl=
  
Perl module [http://search.cpan.org/~petek/HTML-Tree-3.23/lib/HTML/Element.pm HTML::Element] provides a structured way to represent HTML elements (starting with <tag>, have some attributes and finish with </tag>).  Any HTML document can be represented as a tree made up of HTML::Elements.
+
The Perl module [http://search.cpan.org/~petek/HTML-Tree-3.23/lib/HTML/Element.pm HTML::Element] provides a structured way to represent HTML elements (starting with <tag>, having some attributes and children, and finishing with </tag>).  Any HTML document can be represented as a tree made up of HTML::Elements. For example, consider the following HTML page:
  
For example, consider the following HTML page
+
<code><pre>
<code>
 
<pre>
 
 
&lt;html>
 
&lt;html>
 
&lt;head>
 
&lt;head>
Line 456: Line 544:
 
&lt;/body>
 
&lt;/body>
 
&lt;/html>
 
&lt;/html>
</pre>
+
</pre></code>
</code>
 
  
 +
This document can be represented in a tree format:
  
This code can be represented in a tree format
+
<code><pre>
<code>
 
<pre>
 
 
&lt;html>  
 
&lt;html>  
 
   &lt;head>  
 
   &lt;head>  
Line 494: Line 580:
 
           &lt;td>
 
           &lt;td>
 
             "Team leader"
 
             "Team leader"
 +
</pre></code>
 +
 +
Notice the hierarchical relationships in the tree. For example, the ''<html>'' node has two direct children, ''<head>'' and ''<body>'', and these elements have their own children. The Perl module [http://search.cpan.org/~sburke/HTML-Tree/lib/HTML/Tree.pm HTML::Tree] builds this tree from an HTML document. You can find more about HTML::Trees [http://search.cpan.org/~sburke/HTML-Tree-3.18/lib/HTML/Tree/AboutTrees.pod here]. In order to use ''HTML::Tree''s, you have to use the ''HTML::TreeBuilder'' module.
  
</pre>
+
<code><pre>
</code>
+
use HTML::TreeBuilder;
 +
$tree=HTML::TreeBuilder->new;
 +
</pre></code>
  
Notice the hiararchical relation. For example, <html> node has two direct children, <head> and <body> and these siblings have their own children. Perl module [http://search.cpan.org/~sburke/HTML-Tree/lib/HTML/Tree.pm HTML::Tree] builds this tree. You can find more about HTML::Trees at [http://search.cpan.org/~sburke/HTML-Tree-3.18/lib/HTML/Tree/AboutTrees.pod]].
+
This will create a new tree for you. You can then use this tree to parse an HTML file. There are two ways to parse HTML, either directly with ''parse_file'', or by parsing individual strings with ''parse''.
In order to use ''HTML::Tree''s, you have to use ''HTML::TreeBuilder'' module.
 
use ''HTML::TreeBuilder'';
 
$tree=TML::TreeBuilder->new;
 
will create a new tree for you. You can then use this tree parse an html file. There are two ways to parse HTML, either directly parse a file with ''parse_file'' or parse a string with ''parse''.
 
  
 
  $tree->parse_file('downloaded.html');
 
  $tree->parse_file('downloaded.html');
  
You can dump the contents of an HTML::Tree by calling ''dump'' function.
+
You can print the contents of an HTML::Tree by calling the ''dump'' function.
  
 
  $tree->dump;
 
  $tree->dump;
  
You can call the parse function several times for different content, but before doing that you need to empty the existing parse data stored in the tree with ''delete'' function, otherwise, you may run out of memory.
+
You can call the parse function several times for different content, but before parsing new data, you need to empty the existing parsed data in the tree with the ''delete'' function.
 +
 
 
  $tree->delete;
 
  $tree->delete;
  
 
==HTML Scanning==
 
==HTML Scanning==
  
Lets say we are interested in capturing headlines for  science news at  [http://news.bbc.co.uk/2/hi/science/nature/default.stm BBC Science News]. Assuming you have downloaded this web page to ''bbc.html'', you can build a tree on it.
+
This section details an example that captures science headlines from an HTML document from [http://news.bbc.co.uk/2/hi/science/nature/default.stm BBC Science News]. Assuming you have downloaded the web page to ''bbc.html'', you can build a tree from it:
  
<code>
+
<code><pre>
<pre>
 
 
#!/usr/bin/perl
 
#!/usr/bin/perl
#
+
 
 
use HTML::TreeBuilder;
 
use HTML::TreeBuilder;
 
$tree=HTML::TreeBuilder->new;
 
$tree=HTML::TreeBuilder->new;
 
$tree->parse_file("bbc.html");
 
$tree->parse_file("bbc.html");
 +
</pre></code>
  
</pre>
+
Now you can use this tree to retrieve information from the document. The ''look_down'' function on an HTML::Tree or HTML::Element is used to return any child satisfying some criterion:
</code>
 
  
Now you can use this tree to retrieve information. Function ''look_down'' on an HTML::Tree or HTML::Element will return any child satisfying some criteria
 
 
  $tree->look_down(SOMEATTRIBUTE,ITSVALUE);
 
  $tree->look_down(SOMEATTRIBUTE,ITSVALUE);
  
For example, following code will return all  HTML::Elements that are actually  hyperlinks in the page and put it to an array.
+
For example, the following code will return the set of hyperlink HTML::Elements in the page, and put them into an array.
 +
 
 
  @links=$tree->look_down("_tag","a");
 
  @links=$tree->look_down("_tag","a");
  
left hand side of the call could be a variable instead of array,
+
The return value can be a variable instead of array:
 +
 
 
  $first_link=$tree->look_down("_tag","a");
 
  $first_link=$tree->look_down("_tag","a");
In this case, instead of returning all links, it returns the first hyperlink element found.
 
  
The power of look_down function comes from ability to specify further criteria in the following format:
+
In this case''look_down'' returns the first hyperlink element found.
 +
 
 +
The power of the ''look_down'' function comes from the ability to specify further criteria like so:
  
 
  $tree->look_down(SOMEATTRIBUTE,ITSVALUE,  
 
  $tree->look_down(SOMEATTRIBUTE,ITSVALUE,  
 
       sub { SOME FUNCTION CODE HERE });
 
       sub { SOME FUNCTION CODE HERE });
  
This will return any element that has its SOMEATTRIBUTE has ITVALUE and the function specified in look_down returns 1. Usually that function evaluates other attributes of the element. The element that is being evaluated is referred as $_[0]. For example, the following code returns all hyperlinks that contains URLs to cse330 website.
+
This will return any element that has SOMEATTRIBUTE with value ITSVALUE, and the function specified in look_down returns 1. Usually this function evaluates other attributes of the element. The element that is being evaluated is referred as $_[0]. For example, the following code returns all hyperlinks that contain URLs to cnn.com.
  
 
  @links=$tree->look_down("_tag","a",
 
  @links=$tree->look_down("_tag","a",
 
           sub {  
 
           sub {  
                 $_[0]->attr("href")=~m/www.cse330.org/
+
                 if($_[0]->attr("href") =~ /www.cnn.com/)
 +
                {
 +
                  return 1;
 +
                }
 +
                return 0;
 
               });
 
               });
  
Note that we are accessing ''href'' attribute and using a regular expression to match it to ''www.cse330.org''.
+
Note that we are accessing the ''href'' attribute of the element and using a regular expression to match it against ''www.cnn.com''.
  
Now, as we said we are interested in capturing the BBC news headlines, we need to look at bbc.html to see what are the characteristics of the HTML::Elements containing the headlines.
+
To actually find news headlines, we need to determine the characteristics of the bbc.html file to discern what the headlines look like. For example, this page has a main headline followed by a description:
  
For example, in the current website, the main headline has the following lines (headline followed by a description)
+
<code><pre>
<code>
 
<pre>
 
 
&lt;div class="mvb">
 
&lt;div class="mvb">
 
 
&lt;a class="tsh" href="/2/hi/health/7026443.stm">
 
&lt;a class="tsh" href="/2/hi/health/7026443.stm">
 
 
 
 
Chilli compound fires painkiller
 
Chilli compound fires painkiller
 
 
&lt;/a>
 
&lt;/a>
 
 
 
&lt;/div>
 
&lt;/div>
 
 
&lt;div class="mvb">
 
&lt;div class="mvb">
 
 
 
 
A chemical from chilli peppers may be able to kill pain without affecting touch or movement.
 
A chemical from chilli peppers may be able to kill pain without affecting touch or movement.
 
 
 
 
&lt;/div>
 
&lt;/div>
 
.......
 
.......
 +
 
&lt;div class="mvb">
 
&lt;div class="mvb">
 
&lt;a href="/2/hi/science/nature/7023731.stm">&lt;img src="http://newsimg.bbc.co.uk/media/images/44151000/jpg/_44151791_art_66pic.jpg" align="left" width="66" height="49" alt="Artists impression of Gryposaurus monumentensis (copyright: Larry Felder)" border="0" vspace="0" hspace="0">&lt;/a>
 
&lt;a href="/2/hi/science/nature/7023731.stm">&lt;img src="http://newsimg.bbc.co.uk/media/images/44151000/jpg/_44151791_art_66pic.jpg" align="left" width="66" height="49" alt="Artists impression of Gryposaurus monumentensis (copyright: Larry Felder)" border="0" vspace="0" hspace="0">&lt;/a>
 
&lt;img src="http://newsimg.bbc.co.uk/shared/img/o.gif" align="left" width="5" height="49" alt="" border="0" vspace="0" hspace="0">
 
&lt;img src="http://newsimg.bbc.co.uk/shared/img/o.gif" align="left" width="5" height="49" alt="" border="0" vspace="0" hspace="0">
 
 
 
 
&lt;a class="shl" href="/2/hi/science/nature/7023731.stm">
 
&lt;a class="shl" href="/2/hi/science/nature/7023731.stm">
 
 
 
 
Duck-billed dinosaur had big bite
 
Duck-billed dinosaur had big bite
 
 
&lt;/a>
 
&lt;/a>
 
 
&lt;br clear="all" />
 
&lt;br clear="all" />
 
 
 
 
 
&lt;/div>
 
&lt;/div>
 
&lt;div class="o">
 
&lt;div class="o">
 
 
 
 
A new species of duck-billed dinosaur that had up to 800 teeth is described by scientists.
 
A new species of duck-billed dinosaur that had up to 800 teeth is described by scientists.
 
 
 
 
&lt;/div>
 
&lt;/div>
 
+
.......
   
 
....
 
           
 
 
           
 
           
 
 
&lt;div class="mvb">
 
&lt;div class="mvb">
 
&lt;a href="/2/hi/technology/7024672.stm">&lt;img src="http://newsimg.bbc.co.uk/media/images/44152000/jpg/_44152097_tombstone_66pic.jpg" align="left" width="66" height="49" alt="Before and after tombstone" border="0" vspace="0" hspace="0">&lt;/a>
 
&lt;a href="/2/hi/technology/7024672.stm">&lt;img src="http://newsimg.bbc.co.uk/media/images/44152000/jpg/_44152097_tombstone_66pic.jpg" align="left" width="66" height="49" alt="Before and after tombstone" border="0" vspace="0" hspace="0">&lt;/a>
 
&lt;img src="http://newsimg.bbc.co.uk/shared/img/o.gif" align="left" width="5" height="49" alt="" border="0" vspace="0" hspace="0">
 
&lt;img src="http://newsimg.bbc.co.uk/shared/img/o.gif" align="left" width="5" height="49" alt="" border="0" vspace="0" hspace="0">
 
 
 
 
&lt;a class="shl" href="/2/hi/technology/7024672.stm">
 
&lt;a class="shl" href="/2/hi/technology/7024672.stm">
 
 
 
 
Scans reveal lost gravestone text
 
Scans reveal lost gravestone text
 
 
&lt;/a>
 
&lt;/a>
 
 
&lt;br clear="all" />
 
&lt;br clear="all" />
 
 
 
 
 
&lt;/div>
 
&lt;/div>
 
&lt;div class="o">
 
&lt;div class="o">
+
        Illegible words on church headstones could be read once more thanks to a new scan technology.
 
 
Illegible words on church headstones could be read once more thanks to a new scan technology.
 
 
 
 
 
&lt;/div>
 
&lt;/div>
 +
</pre></code>
  
 +
In this case, it's the hyperlinks that have class types ''tsh'' and ''shl'' that we are interested in, so we look down to find those class types:
  
</pre>
+
<code><pre>
</code>
+
@links=$tree->look_down('_tag','a',
 
 
Hence if we look down on the hyperlinks that have  class types ''tsh'' and ''shl'', we will get the hyperlinks that contain the headlines.
 
<code>
 
<pre>
 
 
 
 
 
@links=$tree->look_down('_tag','a',
 
        sub {
 
            $_[0]->attr('class') eq "tsh" ||
 
            $_[0]->attr('class') =~m/hl$/
 
            ;
 
        }
 
 
 
        );
 
</pre>
 
</code>
 
 
 
Then by printing out these hyperlink elements as text, we can get the headlines
 
<code>
 
<pre>
 
for($i=0;$i<scalar(@links);$i++) {
 
  print $links[$i]->as_text;
 
}
 
</pre>
 
</code>
 
 
 
 
 
However, these elements just give us the headlines but not the summaries. But no worries, if you look at the BBC html code, the headlines are inside a <div> element, and the summaries are in the next <div>. So we can reach those summaries by first moving to parent of the hyperlink and reaching the next sibling of that parent.
 
<code>
 
<pre>
 
#!/usr/bin/perl
 
#
 
 
 
 
 
 
 
use HTML::TreeBuilder;
 
$tree=HTML::TreeBuilder->new;
 
$tree->parse_file("bbc.html");
 
@links=$tree->look_down('_tag','a',
 
 
         sub {
 
         sub {
 
             $_[0]->attr('class') eq "tsh" ||
 
             $_[0]->attr('class') eq "tsh" ||
             $_[0]->attr('class') =~m/hl$/
+
             $_[0]->attr('class') =~m/hl$/;
            ;
+
         });
         }
+
</pre></code>
 
 
        );
 
for($i=0;$i<scalar(@links);$i++) {
 
  print $links[$i]->as_text,":";
 
  print $links[$i]->parent->right->as_text,"\n";
 
}
 
  
 +
Then by printing out these hyperlink elements as text, we can get the headlines:
 +
 +
<code><pre>
 +
foreach $link (@links)
 +
{
 +
  print $link->as_text;
 +
}
 +
</pre></code>
  
</pre>
+
However, these elements just give us the headlines, not the summaries. If you look at the BBC html document, the headlines are inside a ''div'' element.  The summaries are in the following ''div''.  So, we can reach those summaries by first moving to the parent of the hyperlink HTML::Element and then going to the next sibling of that parent:
</code>
 
  
you should be careful about these kind of operations, if, for example parent was undefined, the above code was going to be interrupted during run time. A better way is first to check if $links[$i]->parent was defined and then check if parent's right sibling was defined.
+
<code><pre>
 +
  foreach $link (@links)
 +
{
 +
  print $link->as_text;
 +
  print $link->parent->right->as_text;
 +
}
 +
</pre></code>
  
 +
In general, you have to be careful about these kind of operations.  If, for example, parent was undefined, the above code would be halted during execution with an error.  It's therefore important to always check if $link->parent is actually defined before following the pointer.  Similarly, you would need to check if $link->parent->right was defined.
  
Finally, you can use LWP::UserAgent to actually make a request to the website, get the contents and pass those contents to the tree, instead of downloading the a website and calling the parser on that file.
+
Finally, you can use the LWP::UserAgent module to make web requests to get the contents of a page and pass those contents to the tree all in one perl script. For example:
  
 +
<code><pre>
 +
#!/usr/bin/perl -w
  
<code>
 
<pre>
 
#!/usr/bin/perl
 
#
 
 
use LWP::UserAgent;
 
use LWP::UserAgent;
 +
use HTML::TreeBuilder;
  
 
$ua=LWP::UserAgent->new;
 
$ua=LWP::UserAgent->new;
 
$req=$ua->get("http://news.bbc.co.uk/2/hi/science/nature/default.stm");
 
$req=$ua->get("http://news.bbc.co.uk/2/hi/science/nature/default.stm");
use HTML::TreeBuilder;
+
 
 
$ua->agent('Mozilla/5.0'); #you can modify several internal parameters, such as browser identification
 
$ua->agent('Mozilla/5.0'); #you can modify several internal parameters, such as browser identification
  
 
$tree=HTML::TreeBuilder->new;
 
$tree=HTML::TreeBuilder->new;
 +
$tree->parse($req->as_string);
  
$tree->parse($req->as_string);
 
 
@links=$tree->look_down('_tag','a', #get all links
 
@links=$tree->look_down('_tag','a', #get all links
 
         sub {
 
         sub {
             $_[0]->attr('class') eq "tsh" || #that are tsh class
+
             $_[0]->attr('class') eq "tsh" || # class name is tsh
 
             ($_[0]->parent->attr('class') eq "arr" &&  
 
             ($_[0]->parent->attr('class') eq "arr" &&  
             $_[0]->attr('href')=~m/science/
+
             $_[0]->attr('href') =~ m/science/) || # or their url contain the keyword science and their parents belong to class arr
            )             ||# or their url contain the keyword science and their parents belong to class arr
+
             $_[0]->attr('class') =~ m/hl$/; # or their class name ends with hl
             $_[0]->attr('class') =~m/hl$/ #or their class name ends with hl
+
        });
            ;
+
 
        }
+
foreach $link (@links)
 +
{
 +
  print $link->as_text . ":";
  
        );
+
   if( !($link->parent->attr('class') eq "arr"))
for($i=0;$i<scalar(@links);$i++) {
+
  {
   print $links[$i]->as_text,":";
+
    print $link->parent->right->as_text . "\n";
if( !($links[$i]->parent->attr('class') eq "arr")) {print $links[$i]->parent->right->as_text,"\n";}
+
  }
   else {
+
   else
 +
  {
 
     print "Sideline \n";
 
     print "Sideline \n";
 
   }
 
   }
 
}
 
}
 +
</pre></code>
  
 +
Additional examples can be found in the Perl documentation for Tree Scanning [http://search.cpan.org/~sburke/HTML-Tree-3.18/lib/HTML/Tree/Scanning.pod here].
  
</pre>
+
[[Category:Module 4]]
</code>
+
[[Category:Former Content]]
 
 
 
 
Some other examples can be found at Perl documentation for Tree Scanning at [http://search.cpan.org/~sburke/HTML-Tree-3.18/lib/HTML/Tree/Scanning.pod]
 

Latest revision as of 16:42, 24 August 2017

Perl

Perl is an interpreted language with syntax similar to bash and PHP. It's a high level, general purpose language that is used primarily to automate repetitive tasks, and to parse and process text. Indeed, text processing was the original intended use for the language, although it has grown to be an extremely flexible language for many different types of processing.

Running Perl

Like many scripting languages, Perl programs can be run in several ways. The programs, or scripts, themselves are plain text files, usually ending in '.pl' to indicate that it is a Perl script. Given a Perl script file, the most common ways to execute the script are shown below.

  • Pass the script as input to the 'perl' executable:
    perl -w SCRIPT
  • Execute the script directly by adding the following line to the top of the script:
    #!/usr/bin/perl -w
    

    Then, change the script to be executable, and finally execute the script:

    chmod a+x SCRIPT
    ./SCRIPT 
    

While flexibility is one of the major strengths of Perl, the flexibility in the language itself can often lead to unintended consequences. To overcome this, it is good practice to explicitly enable warnings and strict behavior for you scripts. The -w above when running perl gets you part of the way there, and although it's not necessary to supply the -w it is nearly always the right thing to do. You should also add the line:

use strict;

to all of your scripts, near the top, to enable strict error checking.

Variables and Arrays

Standard variables start with a dollar sign, $, at the beginning of their names. Unlike shell scripts, the $ is always used, regardless of whether the variable is being set or accessed. As with other scripting languages, you don't specify the type of the variable; it will be used treated as a integer, floating point number, string, etc, depending on how it is used in your script.

 #this is an integer
 $i=5;
 #this is a string
 $msg="my string";

Comments start with the number/sharp/hash sign, #, and extend until the end of the line. All non-comment statements end with a semicolon, ;. You can print variables with the print function:

print $msg;

Variables can be used inside strings, and the value of that variable will be included in the string. (Note that you can also use the printf function when you need specific formatting of the output).

 $i=5;
 print "The value of i is $i";
 

This will generate the string:

The value of i is 5

All variables are global by default, so you need to specify locality if you want local variables. This can be done with the local and my keywords. Generally, my is safer and faster, and is therefore used most often. If a variable is defined by my in a block, it will be local to the block. If a variable is defined by local, any subroutine call from that block will also have that variable defined. For example:

 $i=0;
 while($i<5)
 {
   my $squared = $i*$i;
   $cubed = $i*$i*$i;
   $i++;
 }
 print $squared;
 print $cubed;

In the above program segment, $squared is defined with my inside the while loop, so it is only defined inside that loop. $cubed is not given any special scope, so it is a global variable. Hence, after the loop, $squared will be undefined, but $cubed with have the value 64 (the last value it was assigned in the loop).

Arrays

Arrays are represented with two different symbols, depending on whether you are referring to elements of the array, or the entire array. Array variable names start with @ when declaring them or when you are referring to the array itself. The normal $ is used when you want to access individual elements of the array. Declaring arrays is fairly easy. You can enter all elements one by one, or you can enter them together:

 $student[0]="Alice";
 $student[1]="Bruce";
 $student[2]="Charlie";
 $student[3]="Dan";
or alternatively
 @student=("Alice","Bruce","Charlie","Dan");

In general, lists strings or variables enclosed in parentheses, (value1,value2,...,valuen ), are interpreted as arrays. You can access the contents with an array index:

 print $student[2];

You can also print the whole array:

print @student;

The above will concatenate all the elements together without a delimiter. Alternatively, you can print the array inside of a quotes

print "@student';

which will print each elements separated by a space.

The current size of an array can be accessed with the scalar() function:

$sz=scalar(@student);

You can also use special syntax to access the index of the last element in an array like so:

$last_index = $#student;

Note that you start the array variable with $# instead of @ or just $. In the example above, scalar(@student) is 4, and $#student is 3.

You can declare a sequence with

(START..END);

START and END could be numbers or a characters:

@newarray = (1 .. 20);
@newchars = ('l' .. 't');

There is also support for adding and removing elements.

  • push() - adds an element to the end of an array.
  • unshift() - adds an element to the beginning of an array.
  • pop() - removes the last element of an array.
  • shift() - removes the first element of an array.

push() and unshift() take an array and a variable as arguments:

push @ARRAY, VARIABLE

VARIABLE can actually be a single variable or another array:

push @student, "Eve";
push @student, @newstudents;
push @student, ("Frank","Gabriel");

pop() and shift() take only the array as an argument:

$laststudent = pop(@student);

You can use the slice operation to retrieve portions of an array.

@newarray = @student[1..3]; # elements 1-3
@newarray = @student[1..3,5,7,9..15]; # elements 1-3,5,7,9-15

The splice() function lets you replace arbitrary elements within an array. Here is the general syntax:

splice(@ARRAY,STARTING_INDEX,LENGTH,@REPLACING-ARRAY);
 

Here is an example that replaces element 3 with Jerry and elements 4 with Lena:

splice(@student,3,2,("Jerry","Lena"));

The split() operation creates an array from a string. It takes a delimiter and a string as arguments. For example:

@array=split(':',"test:array:1:2:3");

The above will produce an array with 5 elements (test,array,1,2,3).

The join() function does the reverse, i.e., it creates a string from the elements of an array. It takes a delimiter and an array as arguments. For example:

$studentNames=join(',',@student);

The above will form one string with every element from @student separated by a comma.

Perl also has hash arrays, which are similar to associate arrays. Whereas arrays are represented with the @, hash arrays are represented by the %symbol.

%ages = ("Leela", 25,
        "Fry", 28,
        "Bender", 4,
        "Lord Nibbler", "Unknown");

Elements within a hash array accessed by prefacing the hash array name with $ and then using curly braces for the index. For example:

$ages{"Leela"};	# Returns 25
$ages{"Fry"};		# Returns 28
$ages{"Bender"};	# Returns 4
$ages{"Lord Nibbler"};	# Returns "Unknown"

Conditional Statements

if statements work as they do in most other languages. The general format is

if ( CONDITION )
{
  #some statements
}
elsif ( ANOTHER_CONDITION )
{
  #whatever
}
else
{
  #other stuff
}

The standard comparison operators work as usual, including ==, <, >, etc. The only thing to be aware of is that you should use eq and neq to compare strings instead of == and !=.

if($string eq "some string")
{
  print "strings are equal\n";
}
elsif($int == 5)
{
  print "integer is five\n";
}

Loops

There are three main types of loops. for and while loops work similarly to other languages. Here are the general forms:

for(initialization;condition;loop_increment)
{
}
 
while(condition)
{
}

Here are simple examples:

 $sum=0;
 $i=1;
 while($i<=10)
 {
   $sum=$sum+$i;
   $i++;
 }
 print "sum is $sum\n";

 for($j=0; $j<5; ++$j)
 {
   print "j is $j\n";
 }

next and last are special commands that will jump to the next loop iteration, or to the end of the loop, respectively. For example:

 for($j=0; $j<5; ++$j)
 {
   if($j == 2) { next; }
   if($j == 4) { last; }
   print "j is $j\n";
 }

The above code will output:

j is 0
j is 1
j is 3

Finally, there is the foreach loop. foreach loops are most useful when iterating through arrays or hashs. Here is the general form:

foreach $iterator (@array)
{
}

For example, to print out all the values in an array, you could do this:

 @names = ('Bugs', 'Daffy', 'Marvin');
 foreach $name (@names)
 {
   print "$name\n";
 } 

Functions

Perl functions are identified with the keyword sub followed by the function name. Arguments passed to functions are stored in a special array @_. You can either access all of the arguments by assigning them to local variables via @_ or you can use the shift function. Here is the general form for a function:

 sub functionname
 {
   my ($arg1,$arg2,...,$argn)=@_;
 }

Or:

 sub functionname
 {
   my $arg1 = shift;
   my $arg2 = shift;
   ...
   my $argn = shift;
 }

Here is a simple function to add two numbers and a call to that function (note that functions can be defined above or below where they are called):

 sub add
 {
   my ($x,$y) = @_;
   return ($x+$y);
 }  

 print "sum of 2+5=" . add(2,5) . "\n";

The command line arguments to the Perl script (excluding the script name) are stored in the array @ARGV. For example, the following code prints out the number of command line arguments:

 print "There are " . scalar(@ARGV) . " arguments\n";

Regular Expressions

Perl is widely known for excellence in text processing, and regular expressions are one of the big factors behind this fame. There is a lot to learn about regular expressions, and here is only a very brief introduction that serves as a starting point. Please refer to the CPAN tutorial (from which the following is extracted) and reference for more complete and in-depth coverage.

A regular expression (or simply regex) can be thought of as a pattern, composed of characters and symbols, which expresses what you want to search in a body of text. A regex is used in Perl functions such as string matching, search and replace, and splitting. In the following, we will first look at some typical ways of defining regex. Then we will give examples of how they are used in Perl functions.

Defining Regular Expressions

A regex is marked by //. The simplest regex is just a word, or more generally, a string of characters. Such a regex simply means that the pattern to be matched is that word.

/World/                       # matches the whole word 'World' (case sensitive)
/Hello World /                # matches the whole phrase with a space at the end

Not all characters can be used 'as is' in a match. Some characters, called metacharacters, are reserved for use in regex notation. The metacharacters are

{}[]()^$.|*+?\

A metacharacter can be used in the regex by putting a backslash before it:

/\/usr\/bin\/perl/            # matches the word '/usr/bin/perl' 

The backslash also works for non-printable ASCII characters, represented as escape sequences. Common examples are \s for a space, \t for a tab, \n for a newline, and \r for a carriage return.

A character class allows a set of possible characters, rather than just a single character, to match at a particular point in a regex. Character classes are denoted by brackets [...] , with the set of characters to be possibly matched inside. A "-" can be used to specify range of characters. Here are some examples:

/[yY][eE][sS]/                # matches 'yes' in a case-insensitive way (e.g., 'Yes', 'YES', 'yes', etc.)
/item[012345]/                # matches 'item0' or ... or 'item5'
/item[0-5]/                   # does the same thing
/[0-9a-fA-F]/                 # matches a hexadecimal digit

Different character strings can be matched with the alternation metacharacter '|'. For example:

/cat|dog|bird/                # matches 'cat', or 'dog', or 'bird'

We can group parts of a regex by enclosing them in parentheses '()'. For example:

/House(cat|keeper)/           # matches 'Housecat' or 'Housekeeper'

One of the powerful features of regex is matching repetitions. The quantifier metacharacters ?, * , + , and {} allow us to determine the number of repeats of a portion of a regex we consider to be a match. Quantifiers are put immediately after the character, character class, or grouping that we want to specify. They have the following meanings:

/a?/     # matches 'a' 1 or 0 times
/a*/     # matches 'a' 0 or more times, i.e., any number of times
/a+/     # matches 'a' 1 or more times, i.e., at least once
/a{3,5}/ # matches at least 3 times, but not more than 5 times.
/a{3,}/  # matches at least 3 or more times
/a{3}/   # matches exactly 3 times

Here are more examples:

/[a-zA-Z]+.pl/                # matches any .pl filename that consists of only alphabet characters
/[a-z]+\s+\d*/                # matches a lowercase word, at least some space, and any number of digits

Using Regular Expressions

Regex is used almost any time when string matching is needed. To start, consider testing if a string contains the matching pattern (e.g., a regex). The operator =~ associates the string with the regex match and produces a true value if the regex matched, or false if the regex did not match. Perl will always match at the earliest possible point in the string. For example:

print "It matches\n" if "Hello World" =~ /World/;     # prints "It matches"
print "It matches\n" if "Hello World" =~ /o/;         # prints "It matches" (it matches the 'o' in 'Hello')
print "It matches\n" if "Hello World" =~ / o/;        # does not match

Search and replace is performed using the format =~ s/regex/replacement/modifiers. The replacement is a Perl double quoted string that replaces in the string whatever is matched with the regex. For example:

$x = "I batted 4 for 4";
$x =~ s/4/four/;                         # $x now contains "I batted four for 4"
$x = "I batted 4 for 4";
$x =~ s/4/four/g;                        # $x now contains "I batted four for four"

In the above example, the global modifier g asks Perl to search and replace all occurrences of the regex in the string. If there is a match, s/// returns the number of substitutions made, otherwise it returns false.

Another use of regex is to split a string into a set of strings marked by delimiters, where the delimiters are specified by a given regex. This is done using the split function. For example, to split a string into words, do:

$x = "Calvin and Hobbes";
@word = split /\s+/, $x;                 # $word[0] = 'Calvin'
                                         # $word[1] = 'and'
                                         # $word[2] = 'Hobbes'

As seen above, the split function returns the separated strings as an array. If the empty regex // is used, the string is split into individual characters.

Perl Modules

Perl modules are simply Perl libraries that provide new or enhanced functionality. Although some modules are distributed as part of the core Perl distribution, most of them are written and maintained by other Perl users or developers who needed some specific functionality that was missing. The CPAN repository is the portal for finding and getting modules. CPAN also contains manuals for the modules, most of which contain examples. Generally speaking, the modules are all perl programs, although sometimes they require libraries from other languages.

The typical location for perl modules is /usr/lib/perl/XXX where XXX is the version of perl. You can define other locations by setting the PERL5LIB environmental variable.

export PERL5LIB=directory1:directory2

A module is contained in .pm files located in one of the module directories (either the default or through PERL5LIB). They are accessed in a perl script by including them via the use command. Most modules are grouped in to common packages. For example, there is a DBI package that contains many database interface modules. One of those modules is the mysql module. To include that module (assuming you have it installed on your system) in your script, you would add this line:

use DBI::mysql;

If a module you want is not installed on your system, you need to install it either manually or through the automated cpan command. The later approach takes a bit of set up, but it makes installing new modules much simpler.

Manual Module Installation

Search the CPAN website for the module you want, then download the module. Untar it (or unzip it) and change to the directory containing the source code. You will notice that there is a file named Makefile.PL. It's a perl script that generates a makefile for your perl installation. Run this:

perl Makefile.PL

This generates the actual makefile you need to build the module. If you want to install the module to a non-default module location, set the PREFIX environment variable:

perl Makefile.PL PREFIX=my_module_directory

Once you have the makefile, you can build the module, test it, and install it. Most Perl modules require testing to ensure they are functioning correctly on your system, so doing the tests is part of the build process. If a critical component of the test fails, you may have to resolve any problems before moving on.

make
make test
make install

Some modules require prerequisites which you need to install yourself. In some cases, the dependencies can easily spiral out of control so that it becomes very difficult to install everything manually, which is why the CPAN approach is preferred.

Automated Module Installation

cpan is a perl script that comes with the perl package that uses the CPAN perl module. It is a tool that installs new modules with minimum user interaction and that auto installs prerequisites. To install new modules system-wide, you'll need to run cpan as root. So, run:

sudo cpan

If it is the first time you are running cpan, you will need to answer some configuration questions. When prompted, you should select the automated configuration to get things going without any extra hassle. You will then get to the cpan prompt, where you can run commands to search for and install new modules.

cpan>

The first thing should probably install is Bundle::CPAN which installs a bunch of packages that make cpan easier to use. To do that, at the cpan> prompt, run:

cpan> install Bundle::CPAN

Note that as it installs, it will probably ask if it's ok to install additional packages and such along the way. It's generally ok to accept whatever it asks when you are prompted. The help command shows you all the available commands and options. You can search for a module with the i command:

cpan> i /KEYWORDS/

You can install a module with the install command:

cpan> install MODULE

install downloads the module, compiles and tests it, and finally installs it if everything went well. The downloaded modules are stored under your ~/.cpan directory (unless you specified another one during configuration), in case you need to look at source code or anything related to the module. Although cpan will download any perl modules that are needed by the module you actually want to install, it might fail if any of the modules rely on non-perl resources elsewhere on the system (e.g., graphics libraries). If that happens, you can go to the build directory, ~/.cpan/build/MODULE, and try to fix the problem.

Connecting MySQL with Perl

In order to connect to a mysql server, you need to have the DBI module and DBD::mysql installed on your system. The DBI is a generic database interface that is relatively easy to use, and a good overall description of the DBI is here.

Once you have the correct modules installed, you can connect to a mysql server with the function:

DBI->connect(DATABASE_DESCRIPTION,USERNAME,PASS);

The database description contains the type of the database (so that the DBI knows how to connect), the name of the database and the host name. It returns a handle that you will use to execute SQL statements on that database. Here is an example to connect the grades databases on the localhost:

 use DBI;
 $dbh=DBI->connect("DBI:mysql:database=grades",'dbuser','dbpass') or die "error connecting to database";  #note that $dbh is your database handle

Once you have the connection established, you can send SQL queries to the database server. First you need to prepare your SQL statement with the prepare function on the database handle:

$stmth = $dbh->prepare("SQL STATEMENT");

prepare returns a statement handle that you will use when executing or working with that statement. Here is an example:

$sth=$dbh->prepare("insert into employee values ('Aladdin','Arabian Nights','Adventurer')");

You have to be careful about quotation marks. It's usually simpler to use the qq function (which is just a generalized quotation mechanism):

$sql=qq(insert into employee values ('Aladdin','Arabian Nights','Adventurer'));
$sth=$dbh->prepare($sql);

Once you have a statement prepared, you can execute it with the execute function:

$sth->execute() or die "problem executing statement";

prepare also lets you bind variables so that you don't have to repeat them. For example, if you want to have a generic way to insert new values, you can prepare your statement like this:

$inserth=$dbh->prepare("insert into employee values (?,?,?)");

Then, to execute the statement, you supply the values to fill in for each ?:

$inserth->execute('Aladdin','Arabian Nights','Adventurer') or die "problem executing statement";

When executing statements that return data, the results are also accessed through the statement handle. After execution, $sth->fetchrow_array returns a row from your results. This function returns an array, and you can access the columns in each sql row as members of the perl array. As fetchrow_array returns only one row, you call it continuously until it stops returning rows. For example:

 sql=qq(select name,jobtitle from employee);
 $sth=$dbh->prepare($sql);
 $sth->execute() or die "problem executing statement";
 while(@row=$sth->fetchrow_array)
 {
   print "$row[0] $row[1] \n";
 }

Parsing HTML with Perl

The Perl module HTML::Element provides a structured way to represent HTML elements (starting with <tag>, having some attributes and children, and finishing with </tag>). Any HTML document can be represented as a tree made up of HTML::Elements. For example, consider the following HTML page:

<html>
<head>
<title>List of Employees</title>
</head>
<body>
<center>
<h1>Employees</h1> <br>

<table>
<tr>
   <td>Name</td><td>Department</td><td>Title</td>
</tr>
<tr>
   <td>Alice</td><td>Wonderland</td><td>Lost traveler</td>
</tr>
<tr>
   <td>Peter</td><td>Neverland</td><td>Team leader</td>
</tr>
</table>
</center>

</body>
</html>

This document can be represented in a tree format:

<html> 
  <head> 
    <title> 
      "List of Employees"
  <body> 
    <center> 
      <h1> 
        "Employees"
      <br> 
      <table> 
        <tr> 
          <td> 
            "Name"
          <td> 
            "Department"
          <td> 
            "Title"
        <tr> 
          <td> 
            "Alice"
          <td> 
            "Wonderland"
          <td> 
            "Lost traveler"
        <tr> 
          <td> 
            "Peter"
          <td> 
            "Neverland"
          <td>
            "Team leader"

Notice the hierarchical relationships in the tree. For example, the <html> node has two direct children, <head> and <body>, and these elements have their own children. The Perl module HTML::Tree builds this tree from an HTML document. You can find more about HTML::Trees here. In order to use HTML::Trees, you have to use the HTML::TreeBuilder module.

 use HTML::TreeBuilder;
 $tree=HTML::TreeBuilder->new;

This will create a new tree for you. You can then use this tree to parse an HTML file. There are two ways to parse HTML, either directly with parse_file, or by parsing individual strings with parse.

$tree->parse_file('downloaded.html');

You can print the contents of an HTML::Tree by calling the dump function.

$tree->dump;

You can call the parse function several times for different content, but before parsing new data, you need to empty the existing parsed data in the tree with the delete function.

$tree->delete;

HTML Scanning

This section details an example that captures science headlines from an HTML document from BBC Science News. Assuming you have downloaded the web page to bbc.html, you can build a tree from it:

#!/usr/bin/perl

use HTML::TreeBuilder;
$tree=HTML::TreeBuilder->new;
$tree->parse_file("bbc.html");

Now you can use this tree to retrieve information from the document. The look_down function on an HTML::Tree or HTML::Element is used to return any child satisfying some criterion:

$tree->look_down(SOMEATTRIBUTE,ITSVALUE);

For example, the following code will return the set of hyperlink HTML::Elements in the page, and put them into an array.

@links=$tree->look_down("_tag","a");

The return value can be a variable instead of array:

$first_link=$tree->look_down("_tag","a");

In this caselook_down returns the first hyperlink element found.

The power of the look_down function comes from the ability to specify further criteria like so:

$tree->look_down(SOMEATTRIBUTE,ITSVALUE, 
      sub { SOME FUNCTION CODE HERE });

This will return any element that has SOMEATTRIBUTE with value ITSVALUE, and the function specified in look_down returns 1. Usually this function evaluates other attributes of the element. The element that is being evaluated is referred as $_[0]. For example, the following code returns all hyperlinks that contain URLs to cnn.com.

@links=$tree->look_down("_tag","a",
          sub { 
                if($_[0]->attr("href") =~ /www.cnn.com/)
                {
                  return 1;
                }
                return 0;
              });

Note that we are accessing the href attribute of the element and using a regular expression to match it against www.cnn.com.

To actually find news headlines, we need to determine the characteristics of the bbc.html file to discern what the headlines look like. For example, this page has a main headline followed by a description:

	<div class="mvb">
		<a class="tsh" href="/2/hi/health/7026443.stm">
			Chilli compound fires painkiller
		</a>
</div>
	<div class="mvb">
			A chemical from chilli peppers may be able to kill pain without affecting touch or movement.
</div>
.......

	<div class="mvb">
		<a href="/2/hi/science/nature/7023731.stm"><img src="http://newsimg.bbc.co.uk/media/images/44151000/jpg/_44151791_art_66pic.jpg" align="left" width="66" height="49" alt="Artists impression of Gryposaurus monumentensis (copyright: Larry Felder)" border="0" vspace="0" hspace="0"></a>
		<img src="http://newsimg.bbc.co.uk/shared/img/o.gif" align="left" width="5" height="49" alt="" border="0" vspace="0" hspace="0">
		<a class="shl" href="/2/hi/science/nature/7023731.stm">
			Duck-billed dinosaur had big bite
		</a>
			<br clear="all" />
	</div>
	<div class="o">
			A new species of duck-billed dinosaur that had up to 800 teeth is described by scientists.
</div>
.......
	<div class="mvb">
		<a href="/2/hi/technology/7024672.stm"><img src="http://newsimg.bbc.co.uk/media/images/44152000/jpg/_44152097_tombstone_66pic.jpg" align="left" width="66" height="49" alt="Before and after tombstone" border="0" vspace="0" hspace="0"></a>
		<img src="http://newsimg.bbc.co.uk/shared/img/o.gif" align="left" width="5" height="49" alt="" border="0" vspace="0" hspace="0">
		<a class="shl" href="/2/hi/technology/7024672.stm">
			Scans reveal lost gravestone text
		</a>
			<br clear="all" />
	</div>
	<div class="o">
		        Illegible words on church headstones could be read once more thanks to a new scan technology.
</div>

In this case, it's the hyperlinks that have class types tsh and shl that we are interested in, so we look down to find those class types:

 @links=$tree->look_down('_tag','a',
         sub {
            $_[0]->attr('class') eq "tsh" ||
            $_[0]->attr('class') =~m/hl$/;
         });

Then by printing out these hyperlink elements as text, we can get the headlines:

 foreach $link (@links)
 {
   print $link->as_text;
 }

However, these elements just give us the headlines, not the summaries. If you look at the BBC html document, the headlines are inside a div element. The summaries are in the following div. So, we can reach those summaries by first moving to the parent of the hyperlink HTML::Element and then going to the next sibling of that parent:

 foreach $link (@links)
 {
   print $link->as_text;
   print $link->parent->right->as_text;
 }

In general, you have to be careful about these kind of operations. If, for example, parent was undefined, the above code would be halted during execution with an error. It's therefore important to always check if $link->parent is actually defined before following the pointer. Similarly, you would need to check if $link->parent->right was defined.

Finally, you can use the LWP::UserAgent module to make web requests to get the contents of a page and pass those contents to the tree all in one perl script. For example:

#!/usr/bin/perl -w

use LWP::UserAgent;
use HTML::TreeBuilder;

$ua=LWP::UserAgent->new;
$req=$ua->get("http://news.bbc.co.uk/2/hi/science/nature/default.stm");

$ua->agent('Mozilla/5.0'); #you can modify several internal parameters, such as browser identification

$tree=HTML::TreeBuilder->new;
$tree->parse($req->as_string);

@links=$tree->look_down('_tag','a', #get all links
         sub {
            $_[0]->attr('class') eq "tsh" || # class name is tsh
            ($_[0]->parent->attr('class') eq "arr" && 
             $_[0]->attr('href') =~ m/science/) || # or their url contain the keyword science and their parents belong to class arr
            $_[0]->attr('class') =~ m/hl$/; # or their class name ends with hl
         });

foreach $link (@links)
{
  print $link->as_text . ":";

  if( !($link->parent->attr('class') eq "arr"))
  {
    print $link->parent->right->as_text . "\n";
  }
  else
  {
    print "Sideline \n";
  }
}

Additional examples can be found in the Perl documentation for Tree Scanning here.