Call +1.888.916.3999 or email sales PageFreezer Facebook PageFreezer Twitter PageFreezer Linkedin PageFreezer YouTube PageFreezer Instagram


The Basics of Website Archiving

Web archiving concept

Generally speaking, whatever you post on the Internet stays on the Internet. Nothing is ever truly deleted online. Take, for example, all of the GeoCities online communities that were removed by Yahoo! in 2009. While they are no longer live, these archived websites – with their glittering text and late 1990s/early 2000s-era gifs – still remain, alive and kicking, on a popular Internet archiving service.

But what is website archiving? In a nutshell, website archiving is similar to the traditional archiving of documents, only digital. The process is the same: archivists select which information to save, then they store it and preserve it in an archive, which is then made accessible to the public. Web archivists typically oversee archived websites.

Since the Internet is so huge, website archiving organizations use automated processes to collect websites. Using specially designed software known as crawlers (much like those used by Google to index pages), web archivists harvest websites from the live Internet and preserve them as snapshots of information at a particular point in time. These crawlers travel the Internet and find websites to copy and save. These archived websites can then be navigated as if they were still live. The best-known example of this is undoubtedly The Wayback Machine, which has saved around 357 billion web pages over time.

Types of Web Archiving

There are three main ways of archiving content from the Internet: client-side web archiving, transaction-based web archiving, and server-side web archiving.

Client-Side Web Archiving

Relatively simple and scalable, client-side web archiving is the most popular method of web archiving. This method can archive any website that is available for free on the Internet. The crawlers in client-side web archiving imitate the way that users interact with websites. This usually means starting from a seed page, and then following and getting links from internal pages. The crawlers fetch an array of information and web material – from documents or text pages to photos to audio and video files. It only stops once it reaches the boundary of the domain in which they are operating.

Transaction-Based Web Archiving

Transaction-based web archiving is operated on the server-side. This method of web archiving requires access to the web server hosting the content. It needs collaboration and agreement with the server’s owner. In this approach, content that has never been viewed will not be archived; only web content that was viewed, even just once, will be archived. With this method, it is possible to record exactly what data was seen and when.

Server-Side Web Archiving

This method of web archiving foregoes the HTTP interface and goes directly to the server. Basically, server-side web archiving directly copies files from the server. Like with transaction-based web archiving, it requires the collaboration and consent of the server’s owner. The issue with this method is how to translate the copied scripts, database files, and templates to a usable archived website that can be easily navigated. However, the main benefit of this approach is that it copies and archives parts of the site that are inaccessible to client-side crawlers.

Because of website archiving, important, even historical, data from the Internet can be saved and preserved for future generations. For companies, however, archiving of web content is often a legal requirement. Financial services, for example, are required by law to keep detailed records of all content that appeared on company websites, just as they need to archive all other forms of customer communication. Protecting against false claims is another popular reason for archiving pages.

If you have a business and are looking for ways to archive your company’s website – whether for regulatory compliance or liability protection – then you can turn to PageFreezer Software Inc. We offer innovative website archiving tools and electronic records management solutions that are capable of capturing complex client-side generated Javascript and AJAX frameworks, as well as password protected sites. Contact us today to schedule a demo or request a quote.

Comments are closed.