Indexing Web with Head-r

Head-r is a free Perl program that recursively follows links located at (HTML) Web pages hosted on an HTTP server, and performs  requests upon links of interest to the user.

The intended use for this program is to create URI lists for later selective mirroring of file-hosting sites.

Synopsis
head-r [-v|--verbose] [-j|--bzip2|-z|--gzip] [--include-re=RE] [--exclude-re=RE] [--depth=N] [--info-re=RE] [--descend-re=RE] [-i|--input=FILE]... [-o|--output=FILE] [-P|--no-proxy] [-U|--user-agent=USER-AGENT] [-w|--wait=DELAY] [--] [URI]...

Basic usage
Arguably, the most important Head-r options are  and , which determine (by means of regular expressions) which URIs will be considered for mere   requests, and which ones Head-r will try to get more URIs from.

Simplistic, no-recursion example
For the following example, we’ll use  – a regular expression that matches any non-empty string – to allow Head-r to make   requests to both of the URIs given.

$ head-r --info-re=. \     -- http://example.org/ http://example.net/ http://example.org/&#9;1381334900&#9;1&#9;1270&#9;200 http://example.net/&#9;1381334903&#9;1&#9;1270&#9;200

The fields are delimited with ASCII HT (also known as TAB) codes, and are as follows:
 * 1) URI;
 * 2) timestamp (in seconds since system-dependent epoch; see also Unix time);
 * 3) recursion depth used when considering this URI;
 * 4) the length of the response in octets (as per the   HTTP reply header);
 * 5) HTTP status code of the reply.

Recurse once example
For the following example, we’ll also enable actual recursion (still at maximum depth of 1), by using the  option.

$ head-r --info-re=. --descend-re=/\$ \ -- http://example.org/ http://example.net/ http://example.org/&#9;1381337824&#9;1&#9;1270&#9;200 http://www.iana.org/domains/example&#9;1381337829&#9;0&#9;200 http://example.net/&#9;1381337830&#9;1&#9;1270&#9;200

As could be seen, at http://example.org/ Head-r found another URI to consider: http://www.iana.org/domains/example, which it followed and issued a  request for.

It’s easy to check that http://example.net/ actually also references the same URI. However, as Head-r remembers the URIs it processes (along with the recursion depth at the point) no other request was issued.

Limiting HEAD requests
Consider now that the resource we’re to recurse through references URIs that are out of our interest. For the following example, we’ll use a more selective regular expression than  we’ve used above.

$ head-r --{info,descend}-re=wikipedia\\.org/wiki/ \ -- http://en.wikipedia.org/wiki/Main_Page http://en.wikipedia.org/wiki/Main_Page&#9;1381339589&#9;1&#9;61499&#9;200 . . . http://en.wikipedia.org/w/api.php?action=rsd http://creativecommons.org/licenses/by-sa/3.0/ . . . http://meta.wikimedia.org/ http://en.wikipedia.org/wiki/Wikipedia&#9;1381339589&#9;0&#9;609859&#9;200 http://en.wikipedia.org/wiki/Free_content&#9;1381339589&#9;0&#9;124407&#9;200 . ..

(Please note that we’ve just used the Bash  expansion to pass the same regular expression to both   and  .  Be sure to adjust to the command line interpreter actually in use.)

In the output above, a number of URIs came without any of the usual information. These URIs were found by Head-r, but as they matched neither “info” nor “descend”  regular expressions specified, no action was done to them. The URIs are still output, however, just in case we may decide to adjust the regular expressions themselves.

Skipping unwanted URIs altogether
The  and   regular expressions are considered before all the other ones, and currently have the following semantics:
 * 1) the inclusion regular expression is applied first; the URI will be considered if it matches one;
 * 2) unless decided at the step above, the exclusion regular expression is then applied; the URI will not be considered if it matches one;
 * 3) unless decided by the rules above, the URI will be considered.

If none of these options are given, any URI will be considered by Head-r.

The following example exploits these options to further limit the output of Head-r for the case above.

$ head-r --{include,descend}-re=wikipedia\\.org/wiki/ \ --{info,exclude}-re=. \     -- http://en.wikipedia.org/wiki/Main_Page http://en.wikipedia.org/wiki/Main_Page&#9;1381341336&#9;1&#9;61499&#9;200 http://en.wikipedia.org/wiki/Wikipedia&#9;1381341337&#9;0&#9;609859&#9;200 http://en.wikipedia.org/wiki/Free_content&#9;1381341337&#9;0&#9;124407&#9;200 http://en.wikipedia.org/wiki/Encyclopedia&#9;1381341337&#9;0&#9;151164&#9;200 http://en.wikipedia.org/wiki/Wikipedia:Introduction&#9;1381341337&#9;0&#9;50687&#9;200 . ..

Saving state between sessions
Head-r is capable of reading its own output, so to avoid issuing duplicate  requests, and also to discover the URIs of the resources to recurse into.

Restoring what was saved
Let us revisit one of our previous examples, which we’ll now alter to only issue a  request to a couple of pages:

$ head-r --output=state.a \ --info-re='/(Free_content|Wikipedia)$' \ --descend-re=wikipedia\\.org/wiki/ \ -- http://en.wikipedia.org/wiki/Main_Page $ grep -E \\s < state.a http://en.wikipedia.org/wiki/Main_Page&#9;1381417546&#9;1&#9;61499&#9;200 http://en.wikipedia.org/wiki/Wikipedia&#9;1381417546&#9;0&#9;609859&#9;200 http://en.wikipedia.org/wiki/Free_content&#9;1381417546&#9;0&#9;124407&#9;200 $

Now, why not to include a few more pages, such as all the pages with the names starting with ?

$ head-r \ --input=state.a --output=state.b \ --info-re=/wiki/F \ --descend-re=wikipedia\\.org/wiki/ $ grep -E \\s < state.b http://en.wikipedia.org/wiki/File:Diary_of_a_Nobody_first.jpg&#9;1381417906&#9;0&#9;34344&#9;200 http://en.wikipedia.org/wiki/File:Progradungula_otwayensis_cropped.png&#9;1381417906&#9;0&#9;30604&#9;200 http://en.wikipedia.org/wiki/File:AW_TW_PS.jpg&#9;1381417907&#9;0&#9;33297&#9;200 http://en.wikipedia.org/wiki/Fran%C3%A7ois_Englert&#9;1381417907&#9;0&#9;87860&#9;200 http://en.wikipedia.org/wiki/File:Washington_Monument_Dusk_Jan_2006.jpg&#9;1381417907&#9;0&#9;83137&#9;200 http://en.wikipedia.org/wiki/File:Walt_Disney_Concert_Hall,_LA,_CA,_jjron_22.03.2012.jpg&#9;1381417907&#9;0&#9;67225&#9;200 http://en.wikipedia.org/wiki/Frank_Gehry&#9;1381417907&#9;0&#9;152838&#9;200 $

Note that while our  has obviously covered http://en.wikipedia.org/wiki/Free_content, no   request was made to the page, as our   file already had the relevant information.

Also, as all the URIs we wanted for Head-r to consider were already listed in, it was unnecessary to specify any URIs at the command line. When the URIs come from both command line arguments and  files, those coming from command line are considered first.

Compression
As recursing through large Web sites may result in large output lists, Head-r provides support for compression of output data.

The   and    options select the compression method to use for the output file (either specified with , or standard output.)  Head-r, however, will exit with an error if compression is enabled and the output goes to a terminal device.

Head-r transparently decompresses the files given as inputs, thanks to the IO::Uncompress::AnyUncompress library.

Adjusting HTTP client behavior
There’re two options which influence the behavior of the HTTP client used by Head-r:   and   ( .)

The  option specifies the amount of time, in seconds, to wait between two consecutive HTTP requests. The default is about 2.7 seconds.

The  option specifies the value for the   header to use in HTTP requests, and may come handy should the target server block access based on this header’s data. The default is composed of the string, the Head-r’s own version, and the identity of the libwww-perl library used. For example:.

Bugs
Please consider reporting any bugs in the Head-r software not listed below via the CPAN RT, https://rt.cpan.org/Public/Dist/Display.html?Name=head-r. The bugs in this documentation should be reported to the respective Wikibooks Talk page – or you may actually fix them yourself!

As for any other automatic retrieval tool, it isn’t impossible to abuse Head-r to cause excessive load on third-party servers. The user is advised to consider the network environment when using the tool, and especially when lowering the  setting, and raising the maximum recursion   beyond reasonable values.

There’s currently no way to disable the  file processing.

The code only tries to retrieve URIs from content marked with  media type, even though it seems as if the support for   (and perhaps several other XML-based types, such as SVG) could be implemented rather easily.

The resource to retrieve URIs from is first loaded into memory, while it should be possible to process it on-the-fly.

The handling of recursion depths retrieved from  files may be somewhat unintuitive, and out of the user’s control. (Although it’s still possible to edit such files using third-party tools, such as AWK.)

The code implements a trivial work-around for the long-standing Net::HTTP bug #29468.

Availability
The latest stable version of the code is available from CPAN. Check, for instance, the respective Metacpan page at https://metacpan.org/release/head-r.

The latest development version could be downloaded from a Git repository, like:

$ git clone -- \ http://am-1.org/~ivan/archives/git/head-r-2013.git/ head-r

A Gitweb interface is available at http://am-1.org/~ivan/archives/git/gitweb.cgi?p=head-r-2013.git.

Author
Head-r is written by Ivan Shmakov.

Head-r is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This documentation is a free collaborative project going on at Wikibooks, and is available under the Creative Commons Attribution/Share-Alike License (CC BY-SA) version 3.0.