See the latest news and insights around Information Governance, eDiscovery, Enterprise Collaboration, and Social Media. 

All Posts

How Do You Archive a Website?

There are many reasons why an organization might require a website archive solution. For example, it might be a public sector or financial services entity that’s legally obligated to keep accurate records of all website data. Or the organization could be aiming to better protect itself against false claims and intellectual property theft of website content. Or perhaps a completely new website is being launched and the old one has to be archived to ensure the long-term preservation of what amounts to an important historical document for the organization.

How Do You Archive a Website?

Regardless of the reason behind it, however, the question still remains: how do you actually capture and preserve a website—not merely a specific webpage but an entire website with a multitude of pages?

How Do You Archive a Website?

There are several ways to archive a website. A single webpage can simply be saved to your hard drive, free online archive tools such as HTTrack and the Wayback Machine can be used, or you can depend on a CMS backup. But the best way to capture a site is to use an automated archiving solution that captures every change.

The Wayback Machine is great because it not only allows you to type in just about any website and see what it looked like years ago (revisiting the pre-2000s website of a company like Apple can be very entertaining), but it also lets your archive websites as well. 

Unfortunately, this isn’t automated archiving. Although it lets you archive a website online free of charge, you have to manually save individual pages, so it can be a slow and laborious process. Moreover, if you’re looking for a complete record you’d need to archive a webpage again every time a change is made to the page in order to accurately capture and preserve changes and deletions. 

The Wayback Machine makes it easy to archive a single webpage

The Wayback Machine allows for easy archiving of webpages. 

By contrast, HTTrack aims to make it easier to archive a complete website. Using this software, it’s possible to download a website to your computer with the click of a button. HTTrack can even update an existing mirrored site and resume interrupted downloads, so ensuring that having an archived version of the latest version of a website is relatively simple. 

On the downside, HTTrack is unlikely to provide you with a complete website archive with webpages that look exactly like the online version. Why? Well, modern websites are incredibly complex and accurately archiving all that data isn’t easy. When it comes to things like images, embedded videos, Javascript/Ajax frameworks, web form flows, and password-protected pages, a piece of free software like HTTrack is unlikely to perfectly capture all of this complicated content. Some content gaps and missing images are extremely likely.   

Also, even though it allows you to theoretically download an entire site, this isn’t a completely automated process; you’d still need to manually download the site every time you want to create a new archive.

HTTrack Website Copier lets you archive a whole website

HTTrack lets you download complete websites to your computer.

Why a CMS Backup Is Not an Archive 

Many modern website-related Content Management Systems offer some form of backup to help ensure that crucial data isn’t lost. And as mentioned earlier, some organizations assume that they can depend on this backup as a website archive. Although this is true to a certain extent, the average CMS backup has a lot of limitations.

Screen Shot 2019-11-19 at 11.41.43 AMA traditional CMS backup lacks the following:

  • Full-Text Search: Much of the value of a true website archive lies in being able to find a specific page, post, or even phrase in amongst thousands of archived pages. For instance, if the content from a specific page published years ago is needed for litigation, full-text search becomes immensely useful in finding that specific data. A CMS backup will not provide this sort of search, however.
  • Digital Signatures: Speaking of litigation, in order for digital evidence to be defensible in court, a digital signature is required to authenticate it and prove that it hasn’t been tampered with. Data taken from a CMS backup will not have this digital signature. 
  • Easy Access to Archives: In order for a website archive to truly be useful, departments like HR, Legal, and Marketing should be able to access this data fairly easily. If it takes too much time and effort, teams are far less likely to actually make use of it in their day-to-day work. Gaining access to data hiding in a CMS backup can often be tricky. 
  • Live Replay: The closer any archive resembles the look and feel of the original platform, the easier it is to navigate and find what you need. A legal team faced with a CMS backup often find themselves spending hours to locate the record they’re looking for, simply because the backup isn’t set up to facilitate search.
  • Metadata: As with digital signatures, having access to the metadata associated with any website record is crucial when it comes to litigation or regulatory edits. And CMS backups do not allow legal teams to easily export a record with all its metadata.
  • Compliant Data Storage: For regulated industries with specific recordkeeping rules—such as the public sector and financial services—a CMS backup does not meet requirements. In other words, should an organization’s website records be audited, simply supplying data from a CMS backup would not be good enough.
  • Accessibility: In order for a website archive to be truly useful, teams should be able to gain quick and easy access to it. This is rarely the case with a CMS backup; gaining access can take hours and require the involvement of IT. Hence, it is typically more of a solution for the IT department than it is for Legal or Compliance.

Automated Website Capture

An automated website archiving service like Pagefreezer allows organizations to keep a complete record of website content. We use technology, similar to that used by search engines like Google, to crawl a site at regular intervals and capture all changes and deletions. Through our user-friendly dashboard, customers can then view chronological versions of any given page and instantly see what’s changed—deletions are highlighted in red and additions are shown in green.

How to archive a website automatically

Pagefreezer offers automated archiving of website content.

With automated website capture software, finding what you’re looking for and exporting that data is also much easier than with a CMS backup. Pagefreezer offers advanced search that allows you to quickly find a specific keyword or phrase in an archive, and then export that information (complete with metadata) in PDF or WARC.

Have any questions about website archiving? Or simply looking for more information? Get in touch and we'll email you everything you need.



Peter Callaghan
Peter Callaghan
Peter Callaghan is the Chief Revenue Officer at Pagefreezer. He has a very successful record in the tech industry, bringing significant market share increases and exponential revenue growth to the companies he has served. Peter has a passion for building high-performance sales and marketing teams, developing value-based go-to-market strategies, and creating effective brand strategies.

Related Posts

How to Archive a Twitter Account

As with other social media platforms, like Facebook and Instagram, compliance and legal professionals often need to archive a Twitter account for official use. 

Pagefreezer's Stance on Racism

Statement from Pagefreezer CEO Michael Riedijk. Over the last weeks, I have closely followed the Black Lives Matter protests motivated by the deaths of George Floyd, Rayshard Brooks, Breonna Taylor, Ahmaud Arbery, Michael Brown, and so many other Black Americans who are victims of systemic and institutionalized racism. The movement has quickly expanded to the UK, Europe and globally, exemplifying its importance and how widespread this problem is.

How to Use Social Media in Fraud Investigations

When it comes to investigating potential fraud, modern social media platforms can be a tremendously useful resource. The reason for this is simple: a lot of us are active on social media these days—and we tend to share more than less. At the end of Q1 2020, Facebook reported 1.73 billion daily active users and 2.6 billion monthly active users, with around half of all social media site visits in the United States going to Facebook. Add Instagram’s 500-million daily active users—not to mention the 500 hours of video uploaded to YouTube every minute!—and you’re left with a lot of potential digital evidence.