Difference between revisions of "Perl"

From CSE330 Wiki
Jump to: navigation, search
(Created page with '=Perl= Perl is an interpreted language. It is syntax is similar to Bash and PHP. It is very powerful tool to automate several tasks that you may find too repetitive for you to d…')
 
(Teams for this module)
Line 674: Line 674:
  
 
Some other examples can be found at Perl documentation for Tree Scanning at [http://search.cpan.org/~sburke/HTML-Tree-3.18/lib/HTML/Tree/Scanning.pod]
 
Some other examples can be found at Perl documentation for Tree Scanning at [http://search.cpan.org/~sburke/HTML-Tree-3.18/lib/HTML/Tree/Scanning.pod]
 
=Teams for this module =
 
 
Team 0: Adam Michael Basloe, Andrew David Kanyer
 
 
Team 1: Michael Rene Browning, Mark Evan Davis
 
 
Team 2: Andrew Nemec Bort, Gail Crystal Burks
 
 
Team 3: Vanetia Nikole Cannon, Young Kook Park
 
 
Team 4: William Cannon Fargo,  Andrew Tateh Shaw
 
 
Team 5:  Jonathan Matthew Wald, Philip Jon Melzer
 
 
Team 6: Natalie Nikolayevna Sklobovskaya, Michael Frances Fahey
 
 
Team 7: Benjamin Kozac Reiter, Jacqueline Rose Steege
 
 
Team 8: Jonathan Stephen Kirst, Paul Manfred Heider, John Thomas Pizzini
 

Revision as of 13:08, 25 October 2009

Perl

Perl is an interpreted language. It is syntax is similar to Bash and PHP. It is very powerful tool to automate several tasks that you may find too repetitive for you to do. It has a very large set of modules for any task you can imagine and support regular expressions which makes it very easy to process text.

Running Perl

Like many scripting languages, Perl programs can be run in several ways:

  • as an input to perl executable
    perl INPUTFILE
  • directly through executable script
    #!/usr/bin/perl
    ......
    your code here
    ....
    

    and change it to executable and run

    chmod a+x perlfile
    ./perlfile 
    
  • typing in perl interface
    $perl
    type-your-code-here....
    end-with-ctrl-d.....
    

As we will see, Perl is very flexible and this may result in relaxation of the programmer's discipline to write good program. It is usually better to force a strict perl by using

use strict;

so that Perl will complain on possible errors on your side (like trying to access undefined variables).

Variables and Arrays

The variables require $ at the beginning of their names (regardless of the purpose, be it setting, be it accessing). You don't need to specify type of the variable, as it will be decided based on your setting.

 #this is an integer
 $i=5;
 #this is a string
 $msg="my string";

Notice that we have to put ; just like most other high level languages. You can print variables with print function.

print $msg;

you can put a variable inside a string, and the value of that variable will be included in the string. (Note that you can also use printf function that enables formatting).


 $i=5;
 print "The value of i is $i";
 

This will generate the string

The value of i is 5

All variables are global so you need to specify locality if you want to have local variables. This could be done by local and my keywords. Generally, my is suggested to be used, as it is safer and faster. If a variable is defined by my in a block, it will be local to the block. local is usually confused with my which in most cases what you want. If a variable is defined by local, any subroutine call from that block will also have that variable defined.

Arrays

In Perl arrays are represented with @ during declerations but accessed with $ variable. Declaring arrays is very easy in Perl. Either you can enter all element one by one, or you can enter them together:

$student[0]="Alice";
$student[1]="Bruce";
$student[2]="Charlie";
$student[3]="Dan";
or alternatively
@student=("Alice","Bruce","Charlie","Dan");

In general, the parentheses, (value1,value2,...,valuen ) are interpreted as arrays. Then you can access the contents with an array index

print $student[2];

Or you can print whole array with

print @student;

This will put all the content without a delimeter. Alternative is to print inside a string:

print "@student';

which will print nicely with a space between each element. The size of an array can be reachable with scalar command

$sz=scalar(@student);

You can declare a sequence with

(START..END);

START and END could be a number or a character

@newarray=(1..20);
@newchars=('l'..'t);

Perl arrays also support adding and removing elements.

  • push() - adds an element to the end of an array.
  • unshift() - adds an element to the beginning of an array.
  • pop() - removes the last element of an array.
  • shift() - removes the first element of an array.

push and unshift takes an array and a variable.

push @ARRAY, VARIABLE

variable could be a single variable or another array.

push @student, "Eve";
push @student,@newstudents;
push @student,("Frank","Gabriel");

pop and shift take array name as input

$onestudent=pop(@student);




You can get only a portion of an array with slice operation.

@newarray=@student[1..3];
@newarray=@student[1..3,10..15,.20];


splice function lets you replace arbitrary elements within the array

 splice(@ARRAY,STARTING_INDEX,LENGTH,@REPLACING-ARRAY)
 
 splice(@student,3,2,("Jerry","Lena"));


Split operating creates an array from a string, it takes a delimater and a string

 @array=split(':',"test:array:1:2:3");

and join function does the reverse, i.e., create a string from the elements of an array. It takes a deliminator and an array

 $studentNames=join(',',@student);

Perl also has hash arrays (similar to associate arrays). They are represented by % and the elements can be accessed by strings.

%ages = ("Michael Caine", 39,
        "Dirty Den", 34,
        "Angie", 27,
        "Willy", "21 in dog years",
        "The Queen Mother", 108);

Now we can find the age of people with the following expressions

$ages{"Michael Caine"}; # Returns 39 $ages{"Dirty Den"}; # Returns 34 $ages{"Angie"}; # Returns 27 $ages{"Willy"}; # Returns "21 in dog years" $ages{"The Queen Mother"}; # Returns 108

Conditional Statements

if statements provide the conditional statements. The general format is

if ( CONDITION ) {
    some statements
} else {
}
You can use standard comparison operators such as ==,<,> etc.
if ( $string eq "this is my string") { #note that for strings, it is eq 
  echo "strings are equal";
}

or

 if ($average_temperature>100) {
   echo "No, no, there is no such a thing called global warming!!!!";
 }

Loops

You can use for or while loops. The formats are very similar to C.

for(initialization;condition;loop_increment){
}
 
while (condition){
}
For example, following code will print the sum of number 1 from 10.
  $sum=0;
  $i=1;
  while($i<=10) {
    $sum=$sum+$i;
    $i++;
  }
 
 print "sum is $sum";
 


break and continue are special commands that will interrupt the loop execution, or jump to the end of loop.

Functions

Perl functions are identified with the keyword sub without any arguments. The arguments are stored in a special array @_followed by a function name and argument list.

sub functionname {

  my ($arg1,$arg2,...,$argn)=@_;
}


For example

 sub add {
     my ($x,$y)=@_
     return ($x+$y);
  

 print  "Sum of 2+5=",add(2,5);
}

Regular Expressions

Perl Modules

Perl modules are libraries that other people developed for several useful task. [1] is the portal that contains most of the modules. It also has manuals for each module (most of which contain examples). While most of the modules are perl programs themselves, some may require compilation of libraries written using other languages.

The typical location for perl modules is /usr/lib/perl/XXX where XXX is the version of perl. You can define other locations by setting PERL5LIB environmental variable.

export PERL5LIB=directory1:directory2
A module is described in .pm file located in one of the module directories (either default or through PERL5LIB)  and accessible in a perl script through use command
use PERL-MODULE

If your modules is not installed in the system, you need to install it either by manually or through automated cpan command.

Manual Module Installation

After you downloaded your module from CPAN, untar it (or unzip) and change to the directory containing the source code. You will notice that there is a Makefile.PL. That is a perl script that generates a makefile for your perl installation. Run this

perl Makefile.PL

This generates the actual makefile. If you want to install it to a non-default module location, set PREFIX variable

perl Makefile.PL PREFIX=my_module_directory

Once you have the makefile ready, you can make module, test it and install it. Most Perl modules requires a test to verify that it is working. If a critical component of the test fails, you may have to force it to install.

make
make test
make install


Some modules require prerequisites which you need to install yourself. Some cases, such dependencies could branch easily so it may be hard to install everything manually. Perl provides a tool that makes module installation easier as we will see next.

Automated Module Installation

cpan is a perl script that uses perl module Cpan. It is a tool that installs a module with minimum user interaction and auto-installs prerequisites.

cpan

will start it. If it is the first time you are running cpan, it will ask for configuration options. You can select the automated configuration and it will fill out the options for you. Alternatively, you can go with manual selection and specify each option by hand. The last option asks for one or more CPAN mirror locations. Make sure you have selected North America->United States and select a nearby mirror. I usually go with perl.com as my main mirror and add a few in case perl.com is down. Don't worry, if you are not happy with your configuration, you can modify it later.

Once you are in cpan prompt, you can interact with it

cpan>

help command gives you the available options. You can search a module with i command

cpan>' i /KEYWORDS/

and you can install a module with install command

cpan>' install /MODULENAME/

install command will download the module, compile and test it and finally install it if everything goes well. The downloaded programs are stored under ~/.cpan directory (unless you specified another one during configuration). Although cpan will download the other prerequistid perl modules, if you have missing library in your system (such as graphics) that your module requires, cpan installation may fail. If it happens, you can go to the build directory ~/.cpan/build/YOUR-MODULE-NAME and try to fix the problem. look command will also take you to the compilation directory. You can leave cpan with q command.

Connecting MySQL with Perl

In order to connect a MySQL you need to have DBI module and DBI::mysql installed in your system. Following guide has a good description of DBI interface in general [2]

Then you can connect a mysql database with the command

DBI->connect(DATABASE_DESCRIPTION,USERNAME,PASS);

The databse description contains the type of the database (so that DBI knows how to connect), the name of the database and the host name. It returns a handle that you will use for this connection.


use DBI;
$database="company;host=www.bayazit.net";
$datasource="dbi:mysql";
$user="boss";
$pass="employer";
$dbh=DBI->connect("$datasource:$database",$user,$pass) || die "error connecting"; #note $dbh is your database handle

Once you have the connection established you can send SQL queries to your database server. First you need to prepare your SQL with prepare command on the database handle.

$dbh->prepare("TYPE YOUR SQL HERE");

prepare command actually returns a statement handle ($sth in our example below) that you will use for the database operations.

$sth=$dbh->prepare("insert into employee values ('Alaaddin,'Arabian Nights','Adventurer')");

You have to be careful about the your quotation marks. An easier way is to use perl qq function (which is generalize quotation)

$sql=qq(insert into employee values ('Alaaddin,'Arabian Nights','Adventurer'));
$sth=$dbh->prepare($sql);


Once you have a statement prepared, you can execute it with execute command:

$sth->execute() ||die "Can't execute sql";

prepare command actually helps you bind variables so that you don't have to repeat them. For example, if you want to have a generic way to insert new values, you can prepare your statement

$inserth=$dbh->prepare("insert into employee values (?,?,?)");

and during execution specify the arguments represented with ?

$inserth->execute('Alaaddin,'Arabian Nights','Adventurer')|| die "Can't execute SQL";


If you are querying your database, the query results are also accessed through your statement handle. After execution, $sth->fetchrow_array returns a row from your results. This function returns an array, so you can access any attribute from this array. As fetchrow_array returns only one row, you can call it continuously until it fails to return a value.

sql="select name,jobtitle from employee";
$sth=$dbh->prepare($sql);
$sth->execute()|| die "Can't execute my sql";
while(@attributes=$sth->fetchrow_array) {
print "$attributes[0] $attributes[1] \n";
}

#!/usr/bin/perl
use DBI;
$database="company;host=www.bayazit.net";
$datasource="dbi:mysql";
$user="boss";
$pass="employer";
$dbh=DBI->connect("$datasource:$database",$user,$pass) || die "error connecting";

$sql="insert into employee values ('aladdin','Arabian Nights','Adventurer')"; # sql for inserting
$sql="select name,jobtitle from employee";
$sth=$dbh->prepare($sql);
$sth->execute()|| die "Can't execute my sql";
while(@attributes=$sth->fetchrow_array) {
print "$attributes[0] $attributes[1] \n";
}
$sth->finish();


Parsing HTML with Perl

Perl module HTML::Element provides a structured way to represent HTML elements (starting with <tag>, have some attributes and finish with </tag>). Any HTML document can be represented as a tree made up of HTML::Elements.

For example, consider the following HTML page

<html>
<head>
<title>List of Employees</title>
</head>
<body>
<center>
<h1>Employees</h1> <br>

<table>
<tr>
   <td>Name</td><td>Department</td><td>Title</td>
</tr>
<tr>
   <td>Alice</td><td>Wonderland</td><td>Lost traveler</td>
</tr>
<tr>
   <td>Peter</td><td>Neverland</td><td>Team leader</td>
</tr>
</table>
</center>

</body>
</html>


This code can be represented in a tree format

<html> 
  <head> 
    <title> 
      "List of Employees"
  <body> 
    <center> 
      <h1> 
        "Employees"
      <br> 
      <table> 
        <tr> 
          <td> 
            "Name"
          <td> 
            "Department"
          <td> 
            "Title"
        <tr> 
          <td> 
            "Alice"
          <td> 
            "Wonderland"
          <td> 
            "Lost traveler"
        <tr> 
          <td> 
            "Peter"
          <td> 
            "Neverland"
          <td>
            "Team leader"

Notice the hiararchical relation. For example, <html> node has two direct children, <head> and <body> and these siblings have their own children. Perl module HTML::Tree builds this tree. You can find more about HTML::Trees at [3]]. In order to use HTML::Trees, you have to use HTML::TreeBuilder module.

use HTML::TreeBuilder;
$tree=TML::TreeBuilder->new;

will create a new tree for you. You can then use this tree parse an html file. There are two ways to parse HTML, either directly parse a file with parse_file or parse a string with parse.

$tree->parse_file('downloaded.html');

You can dump the contents of an HTML::Tree by calling dump function.

$tree->dump;

You can call the parse function several times for different content, but before doing that you need to empty the existing parse data stored in the tree with delete function, otherwise, you may run out of memory.

$tree->delete;

HTML Scanning

Lets say we are interested in capturing headlines for science news at BBC Science News. Assuming you have downloaded this web page to bbc.html, you can build a tree on it.

#!/usr/bin/perl
#
use HTML::TreeBuilder;
$tree=HTML::TreeBuilder->new;
$tree->parse_file("bbc.html");

Now you can use this tree to retrieve information. Function look_down on an HTML::Tree or HTML::Element will return any child satisfying some criteria

$tree->look_down(SOMEATTRIBUTE,ITSVALUE);

For example, following code will return all HTML::Elements that are actually hyperlinks in the page and put it to an array.

@links=$tree->look_down("_tag","a");

left hand side of the call could be a variable instead of array,

$first_link=$tree->look_down("_tag","a");

In this case, instead of returning all links, it returns the first hyperlink element found.

The power of look_down function comes from ability to specify further criteria in the following format:

$tree->look_down(SOMEATTRIBUTE,ITSVALUE, 
      sub { SOME FUNCTION CODE HERE });

This will return any element that has its SOMEATTRIBUTE has ITVALUE and the function specified in look_down returns 1. Usually that function evaluates other attributes of the element. The element that is being evaluated is referred as $_[0]. For example, the following code returns all hyperlinks that contains URLs to cse330 website.

@links=$tree->look_down("_tag","a",
          sub { 
                $_[0]->attr("href")=~m/www.cse330.org/
              });

Note that we are accessing href attribute and using a regular expression to match it to www.cse330.org.

Now, as we said we are interested in capturing the BBC news headlines, we need to look at bbc.html to see what are the characteristics of the HTML::Elements containing the headlines.

For example, in the current website, the main headline has the following lines (headline followed by a description)

	<div class="mvb">
	
		<a class="tsh" href="/2/hi/health/7026443.stm">
			
			
			
			Chilli compound fires painkiller
			
		</a>
		
	
</div>
	
	<div class="mvb">
	
		
			
			A chemical from chilli peppers may be able to kill pain without affecting touch or movement.
			
		
			
</div>
.......
	<div class="mvb">
		<a href="/2/hi/science/nature/7023731.stm"><img src="http://newsimg.bbc.co.uk/media/images/44151000/jpg/_44151791_art_66pic.jpg" align="left" width="66" height="49" alt="Artists impression of Gryposaurus monumentensis (copyright: Larry Felder)" border="0" vspace="0" hspace="0"></a>
		<img src="http://newsimg.bbc.co.uk/shared/img/o.gif" align="left" width="5" height="49" alt="" border="0" vspace="0" hspace="0">
		
		
	
		<a class="shl" href="/2/hi/science/nature/7023731.stm">
			
			
			
			Duck-billed dinosaur had big bite
			
		</a>
		
			<br clear="all" />
		
	

		
	</div>
	<div class="o">
	
		
			
			A new species of duck-billed dinosaur that had up to 800 teeth is described by scientists.
			
		
			
</div>

    
....
            
		
            
            
	<div class="mvb">
		<a href="/2/hi/technology/7024672.stm"><img src="http://newsimg.bbc.co.uk/media/images/44152000/jpg/_44152097_tombstone_66pic.jpg" align="left" width="66" height="49" alt="Before and after tombstone" border="0" vspace="0" hspace="0"></a>
		<img src="http://newsimg.bbc.co.uk/shared/img/o.gif" align="left" width="5" height="49" alt="" border="0" vspace="0" hspace="0">
		
		
	
		<a class="shl" href="/2/hi/technology/7024672.stm">
			
			
			
			Scans reveal lost gravestone text
			
		</a>
		
			<br clear="all" />
		
	

		
	</div>
	<div class="o">
	
		
			
			Illegible words on church headstones could be read once more thanks to a new scan technology.
			
		
			
</div>


Hence if we look down on the hyperlinks that have class types tsh and shl, we will get the hyperlinks that contain the headlines.



@links=$tree->look_down('_tag','a',
         sub {
            $_[0]->attr('class') eq "tsh" ||
            $_[0]->attr('class') =~m/hl$/
            ;
         }

         );

Then by printing out these hyperlink elements as text, we can get the headlines

for($i=0;$i<scalar(@links);$i++) {
  print $links[$i]->as_text;
}


However, these elements just give us the headlines but not the summaries. But no worries, if you look at the BBC html code, the headlines are inside a
element, and the summaries are in the next
. So we can reach those summaries by first moving to parent of the hyperlink and reaching the next sibling of that parent.

#!/usr/bin/perl
#



use HTML::TreeBuilder;
$tree=HTML::TreeBuilder->new;
$tree->parse_file("bbc.html");
@links=$tree->look_down('_tag','a',
         sub {
            $_[0]->attr('class') eq "tsh" ||
            $_[0]->attr('class') =~m/hl$/
            ;
         }

         );
for($i=0;$i<scalar(@links);$i++) {
  print $links[$i]->as_text,":";
  print $links[$i]->parent->right->as_text,"\n";
}


you should be careful about these kind of operations, if, for example parent was undefined, the above code was going to be interrupted during run time. A better way is first to check if $links[$i]->parent was defined and then check if parent's right sibling was defined.


Finally, you can use LWP::UserAgent to actually make a request to the website, get the contents and pass those contents to the tree, instead of downloading the a website and calling the parser on that file.


#!/usr/bin/perl
#
use LWP::UserAgent;

$ua=LWP::UserAgent->new;
$req=$ua->get("http://news.bbc.co.uk/2/hi/science/nature/default.stm");
use HTML::TreeBuilder;
$ua->agent('Mozilla/5.0'); #you can modify several internal parameters, such as browser identification

$tree=HTML::TreeBuilder->new;

$tree->parse($req->as_string);
@links=$tree->look_down('_tag','a', #get all links
         sub {
            $_[0]->attr('class') eq "tsh" || #that are tsh  class
            ($_[0]->parent->attr('class') eq "arr" && 
             $_[0]->attr('href')=~m/science/
             )             ||# or their url contain the keyword science and their parents belong to class arr
            $_[0]->attr('class') =~m/hl$/ #or their class name ends with hl
            ;
         }

         );
for($i=0;$i<scalar(@links);$i++) {
  print $links[$i]->as_text,":";
 if( !($links[$i]->parent->attr('class') eq "arr")) {print $links[$i]->parent->right->as_text,"\n";}
  else {
    print "Sideline \n";
  }
}



Some other examples can be found at Perl documentation for Tree Scanning at [4]