There are many reasons why an organization might want to archive a website. For example, it might be a public sector or financial services entity that’s legally obligated to keep accurate records of all website data. Or the organization could be aiming to better protect itself against false claims and intellectual property theft of website content. Or perhaps a completely new website is being launched and the old one has to be archived to ensure the long-term preservation of what amounts to an important historical document for the organization.
Regardless of the reason behind it, however, the question still remains: how do you actually capture and preserve a website—not merely a specific webpage but an entire website with a multitude of pages?
How Do You Archive a Website?
There are several ways to archive a website. A single webpage can simply be saved to your hard drive, free online archive tools such as HTTrack and the Wayback Machine can be used, or you can depend on a CMS backup. But the best way to capture a site is to use an automated archiving solution that captures every change.
The Wayback Machine is great because it not only allows you to type in just about any website and see what it looked like years ago (revisiting the pre-2000s website of a company like Apple can be very entertaining), but it also lets your archive websites as well.
Unfortunately, this isn’t automated archiving. Although it lets you archive a website online free of charge, you have to manually save individual pages, so it can be a slow and laborious process. Moreover, if you’re looking for a complete record you’d need to archive a webpage again every time a change is made to the page in order to accurately capture and preserve changes and deletions.
The Wayback Machine allows for easy archiving of webpages.
By contrast, HTTrack aims to make it easier to archive a complete website. Using this software, it’s possible to download a website to your computer with the click of a button. HTTrack can even update an existing mirrored site and resume interrupted downloads, so ensuring that having an archived version of the latest version of a website is relatively simple.
Also, even though it allows you to theoretically download an entire site, this isn’t a completely automated process; you’d still need to manually download the site every time you want to create a new archive.
HTTrack lets you download complete websites to your computer.
Why a CMS Backup Is Not an Archive
Many modern website-related Content Management Systems offer some form of backup to help ensure that crucial data isn’t lost. And as mentioned earlier, some organizations assume that they can depend on this backup as a website archive. Although this is true to a certain extent, the average CMS backup has a lot of limitations.
A traditional CMS backup lacks the following:
- Full-Text Search: Much of the value of a true website archive lies in being able to find a specific page, post, or even phrase in amongst thousands of archived pages. For instance, if the content from a specific page published years ago is needed for litigation, full-text search becomes immensely useful in finding that specific data. A CMS backup will not provide this sort of search, however.
- Digital Signatures: Speaking of litigation, in order for digital evidence to be defensible in court, a digital signature is required to authenticate it and prove that it hasn’t been tampered with. Data taken from a CMS backup will not have this digital signature.
- Easy Access to Archives: In order for a website archive to truly be useful, departments like HR, Legal, and Marketing should be able to access this data fairly easily. If it takes too much time and effort, teams are far less likely to actually make use of it in their day-to-day work. Gaining access to data hiding in a CMS backup can often be tricky.
- Live Replay: The closer any archive resembles the look and feel of the original platform, the easier it is to navigate and find what you need. A legal team faced with a CMS backup often find themselves spending hours to locate the record they’re looking for, simply because the backup isn’t set up to facilitate search.
- Metadata: As with digital signatures, having access to the metadata associated with any website record is crucial when it comes to litigation or regulatory edits. And CMS backups do not allow legal teams to easily export a record with all its metadata.
- Compliant Data Storage: For regulated industries with specific recordkeeping rules—such as the public sector and financial services—a CMS backup does not meet requirements. In other words, should an organization’s website records be audited, simply supplying data from a CMS backup would not be good enough.
- Accessibility: In order for a website archive to be truly useful, teams should be able to gain quick and easy access to it. This is rarely the case with a CMS backup; gaining access can take hours and require the involvement of IT. Hence, it is typically more of a solution for the IT department than it is for Legal or Compliance.
Automated Website Capture
An automated website archiving service like Pagefreezer allows organizations to keep a complete record of website content. We use technology, similar to that used by search engines like Google, to crawl a site at regular intervals and capture all changes and deletions. Through our user-friendly dashboard, customers can then view chronological versions of any given page and instantly see what’s changed—deletions are highlighted in red and additions are shown in green.
With automated website capture software, finding what you’re looking for and exporting that data is also much easier than with a CMS backup. Pagefreezer offers advanced search that allows you to quickly find a specific keyword or phrase in an archive, and then export that information (complete with metadata) in PDF or WARC.
Want to learn more? See how Pagefreezer is archiving 150,000 webpages to meet the needs of a leading global tech company’s legal and marketing teams.