There are many reasons why an organization might want to archive a website. For example, it might be a public sector or financial services entity that’s legally obligated to keep accurate records of all website data. Or the organization could be aiming to better protect itself against false claims and intellectual property theft of website content. Or perhaps a completely new website is being launched and the old one has to be archived to ensure the long-term preservation of what amounts to an important historical document for the organization.
Regardless of the reason behind it, however, the question still remains: how do you actually capture and preserve a website—not merely a specific webpage but an entire website with a multitude of pages?
There are several ways to archive a website. A single webpage can simply be saved to your hard drive, free online archive tools such as HTTrack and the Wayback Machine can be used, or you can depend on a CMS backup. But the best way to capture a site is to use an automated archiving solution that captures every change.
The Wayback Machine is great because it not only allows you to type in just about any website and see what it looked like years ago (revisiting the pre-2000s website of a company like Apple can be very entertaining), but it also lets your archive websites as well.
Unfortunately, this isn’t automated archiving. Although it lets you archive a website online free of charge, you have to manually save individual pages, so it can be a slow and laborious process. Moreover, if you’re looking for a complete record you’d need to archive a webpage again every time a change is made to the page in order to accurately capture and preserve changes and deletions.
The Wayback Machine allows for easy archiving of webpages.
By contrast, HTTrack aims to make it easier to archive a complete website. Using this software, it’s possible to download a website to your computer with the click of a button. HTTrack can even update an existing mirrored site and resume interrupted downloads, so ensuring that having an archived version of the latest version of a website is relatively simple.
On the downside, HTTrack is unlikely to provide you with a complete website archive with webpages that look exactly like the online version. Why? Well, modern websites are incredibly complex and accurately archiving all that data isn’t easy. When it comes to things like images, embedded videos, Javascript/Ajax frameworks, web form flows, and password-protected pages, a piece of free software like HTTrack is unlikely to perfectly capture all of this complicated content. Some content gaps and missing images are extremely likely.
Also, even though it allows you to theoretically download an entire site, this isn’t a completely automated process; you’d still need to manually download the site every time you want to create a new archive.
HTTrack lets you download complete websites to your computer.
Many modern website-related Content Management Systems offer some form of backup to help ensure that crucial data isn’t lost. And as mentioned earlier, some organizations assume that they can depend on this backup as a website archive. Although this is true to a certain extent, the average CMS backup has a lot of limitations.
An automated website archiving service like Pagefreezer allows organizations to keep a complete record of website content. We use technology, similar to that used by search engines like Google, to crawl a site at regular intervals and capture all changes and deletions. Through our user-friendly dashboard, customers can then view chronological versions of any given page and instantly see what’s changed—deletions are highlighted in red and additions are shown in green.
Want to learn more? See how Pagefreezer is archiving 150,000 webpages to meet the needs of a leading global tech company’s legal and marketing teams.