The Sourcerer's Apprentice

The adventures of David Heinemann in IT & software development

Archiving a Website With Wget

| Comments

Lately I’ve been following ArchiveTeam, a group that saves historical parts of the Internet by archiving them before they get pulled down forever. They received significant coverage in 2010 when they archived a 900GB chunk of Geocities before Yahoo! gave it the axe. Even more recently, an anonymous geek (or group?) archived the 172 websites that the BBC decided to kill at the end of 2011. These groups inspire me not only to protect my data, but other peoples’ as well.

I’ve archived websites in the past, for example, when the large CoreWar fansite KOTH.org announced its shutdown in early 2009 (luckily, the decision was later cancelled). Back then I used HTTrack, but after reading the ArchiveTeam’s website, I’ve switched to Wget. It’s much simpler to use and probably just as effective, if not more so. Wget is available out of the box on practically all UNIX systems, but Windows users like myself will need to download the GNUWin32 version, and will probably want to add it to the system path.

I’ve already dug up a few of the great websites I used to visit as a kid and archived them, out of fear that they will one day be gone. Some of these sites are literally 15+ years old, and I’d hate to see them go.

Why not let Archive.org do the archiving?

The Internet Archive is a great site, offering not only archived websites, but other media too. I appreciate the work they do, but I don’t like to rely on them. While they do an excellent job at grabbing website text, their archives often seem slow and incomplete; the service can take ages to serve up a single page and you’re lucky if the entire website is available. This is understandable; the Archive stores an insane amount of data, so I don’t blame them. They probably store hundreds of terabytes of text alone, in addition to video and audio. Moreover, they aren’t responsible for archiving everything. It’s not their job to CTRL+C and CTRL+V the whole Internet. They are a non-profit organisation that archives what they can as a free service.

If you want a complete archive, a DIY job is the way to go. That way you can guarantee that you have a fully-functional local copy, complete with dependencies like images and stylesheets. Don’t be lazy and wait for somebody else to do it - what if nobody does?

Archiving a website

I won’t cover Wget in its entirety. This has already been done to a good extent by other sites (see below for a few). Instead, I’ll share the command I use to archive a single website.

1
wget -mpck --user-agent="" -e robots=off --wait 1 www.foo.com

Explanation

Here is a quick step-by-step explanation of the parameters used. Note that they are case sensitive.

  • -m (Mirror)
    • Turns on mirror-friendly settings like infinite recursion depth, timestamps, etc.
  • -c (Continue)
    • Resumes a partially-downloaded transfer
  • -p (Page requisites)
    • Downloads any page dependencies like images, style sheets, etc.
  • -k (Convert)
    • After completing retrieval of all files…
      • converts all absolute links to other downloaded files into relative links
      • converts all relative links to any files that weren’t downloaded into absolute, external links
      • in a nutshell: makes your website archive work locally
  • –user-agent=””
    • Sometimes websites use robots.txt to block certain agents like web crawlers (e.g. GoogleBot) and Wget. This tells Wget to send a blank user-agent, preventing identification. You could alternatively use a web browser’s user-agent and make it look like a web browser, but it probably doesn’t matter.
  • -e robots=off
    • Sometimes you’ll run into a site with a robots.txt that blocks everything. In these cases, this setting will tell Wget to ignore it. Like the user-agent, I usually leave this on for the sake of convenience.
  • –wait 1
    • Tells Wget to wait 1 second between each action. This will make it a bit less taxing on the servers.

Wget script

You may find it convenient to write a small script that can call Wget with your preferred parameters, thus saving you from having to write or memorise them all the time. You could probably use just about anything, but I went with Perl. This script will take a URL (String) and an optional wait (Integer) as parameters and execute the same Wget command as above. If no wait is given, it will default to 1 second.

1
2
3
4
5
6
7
#!/usr/bin/perl
#Usage:
#wget.pl "<URL(s)>" [wait]
die("Please provide one or more URLs.\n") unless ($ARGV[0]);
my $waitTime = 1;
$waitTime = $ARGV[1] if ($ARGV[1]);
system("wget -mpck --user-agent=\"\" -e robots=off --wait $waitTime $ARGV[0]");

Further information

Comments