Generally speaking, whatever you post on the Internet stays on the Internet. Nothing is ever truly deleted online. Take, for example, all of the GeoCities online communities that were removed by Yahoo! in 2009. While they are no longer live, these archived websites – with their glittering text and late 1990s/early 2000s-era gifs – still remain, alive and kicking, on a popular Internet archiving service.
But what is website archiving? In a nutshell, website archiving is similar to the traditional archiving of documents, only digital. The process is the same: archivists select which information to save, then they store it and preserve it in an archive, which is then made accessible to the public. Web archivists typically oversee archived websites.
Since the Internet is so huge, website archiving organizations use automated processes to collect websites. Using specially designed software known as crawlers (much like those used by Google to index pages), web archivists harvest websites from the live Internet and preserve them as snapshots of information at a particular point in time. These crawlers travel the Internet and find websites to copy and save. These archived websites can then be navigated as if they were still live. The best-known example of this is undoubtedly The Wayback Machine, which has saved around 357 billion web pages over time.
Types of Web Archiving
There are three main ways of archiving content from the Internet: client-side web archiving, transaction-based web archiving, and server-side web archiving.
Client-Side Web Archiving
Relatively simple and scalable, client-side web archiving is the most popular method of web archiving. This method can archive any website that is available for free on the Internet. The crawlers in client-side web archiving imitate the way that users interact with websites. This usually means starting from a seed page, and then following and getting links from internal pages. The crawlers fetch an array of information and web material – from documents or text pages to photos to audio and video files. It only stops once it reaches the boundary of the domain in which they are operating.
Transaction-Based Web Archiving
Transaction-based web archiving is operated on the server-side. This method of web archiving requires access to the web server hosting the content. It needs collaboration and agreement with the server’s owner. In this approach, content that has never been viewed will not be archived; only web content that was viewed, even just once, will be archived. With this method, it is possible to record exactly what data was seen and when.
Server-Side Web Archiving
This method of web archiving foregoes the HTTP interface and goes directly to the server. Basically, server-side web archiving directly copies files from the server. Like with transaction-based web archiving, it requires the collaboration and consent of the server’s owner. The issue with this method is how to translate the copied scripts, database files, and templates to a usable archived website that can be easily navigated. However, the main benefit of this approach is that it copies and archives parts of the site that are inaccessible to client-side crawlers.
Because of website archiving, important, even historical, data from the Internet can be saved and preserved for future generations. For companies, however, archiving of web content is often a legal requirement. Financial services, for example, are required by law to keep detailed records of all content that appeared on company websites, just as they need to archive all other forms of customer communication. Protecting against false claims is another popular reason for archiving pages.