Schedule a Demo

BLOG

See the latest news and insights around Information Governance, eDiscovery, Enterprise Collaboration, and Social Media. 

All Posts

How Do You Archive a Website?

There are many reasons why an organization might want to archive a website. For example, it might be a public sector or financial services entity that’s legally obligated to keep accurate records of all website data. Or the organization could be aiming to better protect itself against false claims and intellectual property theft of website content. Or perhaps a completely new website is being launched and the old one has to be archived to ensure the long-term preservation of what amounts to an important historical document for the organization.

How Do You Archive a Website?

Regardless of the reason behind it, however, the question still remains: how do you actually capture and preserve a website—not merely a specific webpage but an entire website with a multitude of pages?

How Do You Archive a Website?

There are several ways to archive a website. A single webpage can simply be saved to your hard drive, free online archive tools such as HTTrack and the Wayback Machine can be used, or you can depend on a CMS backup. But the best way to capture a site is to use an automated archiving solution that captures every change.

The Wayback Machine is great because it not only allows you to type in just about any website and see what it looked like years ago (revisiting the pre-2000s website of a company like Apple can be very entertaining), but it also lets your archive websites as well. 

Unfortunately, this isn’t automated archiving. Although it lets you archive a website online free of charge, you have to manually save individual pages, so it can be a slow and laborious process. Moreover, if you’re looking for a complete record you’d need to archive a webpage again every time a change is made to the page in order to accurately capture and preserve changes and deletions. 

The Wayback Machine makes it easy to archive a single webpage

The Wayback Machine allows for easy archiving of webpages. 

By contrast, HTTrack aims to make it easier to archive a complete website. Using this software, it’s possible to download a website to your computer with the click of a button. HTTrack can even update an existing mirrored site and resume interrupted downloads, so ensuring that having an archived version of the latest version of a website is relatively simple. 

On the downside, HTTrack is unlikely to provide you with a complete website archive with webpages that look exactly like the online version. Why? Well, modern websites are incredibly complex and accurately archiving all that data isn’t easy. When it comes to things like images, embedded videos, Javascript/Ajax frameworks, web form flows, and password-protected pages, a piece of free software like HTTrack is unlikely to perfectly capture all of this complicated content. Some content gaps and missing images are extremely likely.   

Also, even though it allows you to theoretically download an entire site, this isn’t a completely automated process; you’d still need to manually download the site every time you want to create a new archive.

HTTrack Website Copier lets you archive a whole website

HTTrack lets you download complete websites to your computer.

Why a CMS Backup Is Not an Archive 

Many modern website-related Content Management Systems offer some form of backup to help ensure that crucial data isn’t lost. And as mentioned earlier, some organizations assume that they can depend on this backup as a website archive. Although this is true to a certain extent, the average CMS backup has a lot of limitations.

Screen Shot 2019-11-19 at 11.41.43 AMA traditional CMS backup lacks the following:

  • Full-Text Search: Much of the value of a true website archive lies in being able to find a specific page, post, or even phrase in amongst thousands of archived pages. For instance, if the content from a specific page published years ago is needed for litigation, full-text search becomes immensely useful in finding that specific data. A CMS backup will not provide this sort of search, however.
  • Digital Signatures: Speaking of litigation, in order for digital evidence to be defensible in court, a digital signature is required to authenticate it and prove that it hasn’t been tampered with. Data taken from a CMS backup will not have this digital signature. 
  • Easy Access to Archives: In order for a website archive to truly be useful, departments like HR, Legal, and Marketing should be able to access this data fairly easily. If it takes too much time and effort, teams are far less likely to actually make use of it in their day-to-day work. Gaining access to data hiding in a CMS backup can often be tricky. 
  • Live Replay: The closer any archive resembles the look and feel of the original platform, the easier it is to navigate and find what you need. A legal team faced with a CMS backup often find themselves spending hours to locate the record they’re looking for, simply because the backup isn’t set up to facilitate search.
  • Metadata: As with digital signatures, having access to the metadata associated with any website record is crucial when it comes to litigation or regulatory edits. And CMS backups do not allow legal teams to easily export a record with all its metadata.
  • Compliant Data Storage: For regulated industries with specific recordkeeping rules—such as the public sector and financial services—a CMS backup does not meet requirements. In other words, should an organization’s website records be audited, simply supplying data from a CMS backup would not be good enough.
  • Accessibility: In order for a website archive to be truly useful, teams should be able to gain quick and easy access to it. This is rarely the case with a CMS backup; gaining access can take hours and require the involvement of IT. Hence, it is typically more of a solution for the IT department than it is for Legal or Compliance.

Automated Website Capture

An automated website archiving service like Pagefreezer allows organizations to keep a complete record of website content. We use technology, similar to that used by search engines like Google, to crawl a site at regular intervals and capture all changes and deletions. Through our user-friendly dashboard, customers can then view chronological versions of any given page and instantly see what’s changed—deletions are highlighted in red and additions are shown in green.

How to archive a website automaticallyWith automated website capture software, finding what you’re looking for and exporting that data is also much easier than with a CMS backup. Pagefreezer offers advanced search that allows you to quickly find a specific keyword or phrase in an archive, and then export that information (complete with metadata) in PDF or WARC.

Want to learn more? See how Pagefreezer is archiving 150,000 webpages to meet the needs of a leading global tech company’s legal and marketing teams.

Download Case Study

 

Peter Callaghan
Peter Callaghan
Peter Callaghan is the Chief Revenue Officer at Pagefreezer. He has a very successful record in the tech industry, bringing significant market share increases and exponential revenue growth to the companies he has served. Peter has a passion for building high-performance sales and marketing teams, developing value-based go-to-market strategies, and creating effective brand strategies.

Related Posts

SEC Rule 17a-3 & FINRA Records Retention Requirements Explained

Financial industry recordkeeping regulatory requirements like the U.S. Securities and Exchange Commission (SEC) Rules 17a-3 and 17a-4, and the Financial Industry Regulatory Authority (FINRA) Rules 4511 and 2210, play a crucial role in maintaining the integrity of the U.S. financial markets. These regulations are not just bureaucratic formalities; their oversight involves ensuring that financial services firms adhere to stringent record retention requirements, essential for the transparency, accountability, and trust that underpin the financial system.

The Reddit OSINT/SOCMINT Investigation Guide

According to its IPO prospectus submitted to the US Securities and Exchange Commission on February 22, 2024, Reddit has more than 100K active communities, 73 million daily active visitors, 267 million weekly unique visitors, and more than 1 billion cumulative posts.

Understanding a Request for Production of Documents (RFP)

Requesting production of documents and responding to requests for production (RFP) are key aspects of the discovery process, allowing both parties involved in a legal matter access to crucial evidence.