Wednesday 9 May 2012

Playing with the WWW::Mechanize Perl module

WWW::Mechanize, if you're not aware of it already, is a Perl module that acts as a basic programmable web browser.  It builds on and adds features to another Perl module called LWP::UserAgent.

The WWW::Mechanize module can be instructed to GET from and POST to web servers.  The retrieved content can be searched for links that may be followed, allowing click through to other pages; searched for forms that may be filled in and submitted, and content searched, saved or used as data for further processing.

Usage is pretty straightforward.  Below is some code that logs into a website; follows a link with the text 'Noticeboard'; prints the content of the page as text to standard out; follows a link on that page with the text 'next'; prints the content of that page to standard out, and then follows a link with the text 'Sign out' to sign out from the site.

#!/usr/bin/perl -w

use WWW::Mechanize;
use strict;

my $url = 'http://www.mysite.com';
my $username = 'myusername';
my $password = 'mypassword';

my $mech = WWW::Mechanize->new(noproxy => 1);

$mech->get($url);

$mech->submit_form(
  form_name => 'Form1',
  fields    => {
    username => $username,
    password => $password,
    Proceed1 => 'Sign in'
  }
);

$mech->follow_link(text => 'Noticeboard');
print $mech->content(format => 'text');

$mech->follow_link(text => 'next');
print $mech->content(format => 'text');

$mech->follow_link(text => 'Sign out');

One issue I found with the above code is that if the WWW::Mechanize browser is redirected, it doesn't update the referrer header.  This can be a problem, if for example, you are redirected to a page while attempting to log in to a site which checks that the referrer is a known site.  This issue has been raised here.  The developers are aware of this and I am hopeful that a fix will be implemented soon.

If you run into problems with WWW::Mechanize, the following two lines can prove useful for debugging:

$mech->add_handler("request_send", sub { shift->dump; return });
$mech->add_handler("response_done", sub { shift->dump; return });

These will cause the module to output the HTTP headers that are sent in requests to the web server and the HTTP headers that are received in responses from the web server.

You can find more information on WWW::Mechanize on CPAN.  The Mechanize.pm page lists all the methods by type for WWW::Mechanize.