Knowledge lost

When the World Wide Web began its transformation of information and society in the 1990s, one area of great concern was the preservation of digital content. The ephemerality of the medium meant that information and arguably, knowledge, was created and lost at an alarming rate. This problem struck a chord with me and I drafted this piece in and posted it on an early website around 1999. The themes it presents are as relevant today as back then, so I have republished here in its original form.

At one point someone said that the average life of a web page was 45 days. With content turnover every month-and-a-half, did anyone bother to keep what was written for future generations? There are a few exceptions (the Internet Archive comes to mind), but by-in-large the answer is and has been simply “no.”

The rapid-fire philosophy of the decade railed on about pages being stale, static content being bad content, and with dot-com turnover, few “new” things lasted more than a year. Even in higher education where things seem to go unchanged for years — if not decades, numerous electronic journals appeared but few survived. When the faculty member or student group lost interest, the disk died, grant ran out, or the system was replaced, the content disappeared with no record beyond a few broken links. Libraries, long considered the archivists of academe, had no way of capturing the intellectual capital that was slipping away. The task was simply too large, too costly, and too far-flung to deal with.

Of course, not all content was worth saving though, arguably, some of it was. But how does one define value when the volume of content is so large and in many cases, boundless? What exactly constitutes a page of content in the context of the Web? Are links to pages part of the content? Should the linked pages be collected? What about the different media types that are embedded within the code of the page? Should they be collected as well? What if they are dynamic and generated on-the-fly? Can one legally do anything at all?

The problem is immensely complex on both the legal and technical levels. So instead of saving, we tossed away. The problem was simply too difficult to tackle and most people didn’t care. We lost a decade of history and if we don’t do something soon, we’ll lose a decade more.

From my perspective, tradition has dictated that determining what to keep for later generations typically falls into one of two categories: keep everything or select materials based on tightly defined criteria. In a material-based world, keeping everything is relatively easy. Take the Library of Congress, for example. Most everything that is published in the U.S. makes its way into the Library’s collection. Although the save everything model seems easy from a selection standpoint, both storage and retrieval become huge problems. What good is a complete collection if no one can find anything? If one can find what they are looking for, how does one get the material out of the collection and in the hands of the individual?

Because virtually no one can afford to store everything (including the Library of Congress), smaller specialized collections evolve. Libraries may specialize in regional or academic topical areas, or focus on serving a particular community. Bibliographers and archivists whether corporate, government, or academic, select materials to save based on narrowly defined areas of interest such as legal documents, financial records, or institutional materials. The problem here is determining what will be important in the future. Because one cannot afford to save everything, selection becomes absolutely critical.

Both of these models have succeeded thus far because they are based on tangible assets — books, records, maps, photographs, etc. — that are published at a fixed point in time. But what about the Web? Storage is practically an unfathomable problem and unlike the physical world, the Web is constantly changing. Compounding these issues is the fact that a single piece of content may actually contain materials from several different sources located anywhere on the globe. In essence, a Web page can be considered nothing more than a framework for collecting and organizing knowledge. Articles (such as this one) may be encapsulated within someone else’s document not as a citation, but as the complete work. Adding to the seemingly hopeless problem is that anyone can become a publisher in this medium, so there are few institutional entities to turn to for content. Content is published by anyone from anywhere at any time. As a result, the non-linear stateless hypertext fabric of the Web erodes the basic notion of collection development and archiving.

Over the years, there has been considerable research into how to “collect” the Web, and I am certain that we’re on the brink of coming up with the first layer of a solution. What is interesting to me is the rise of peer-to-peer networking. When one moves away from single point publishing and toward a model where the network itself ensures replication, access and redundancy, content may be able to survive through failures in technology, funding, or interest. As illegal music sharing has demonstrated countless times, by publishing the content into the Web via peer-to-peer rather than on the Web as in a static web page, longevity can be ensured because once the content is in the network, it is practically impossible to remove it.

So let’s assume for the moment that the storage breakthrough that I am predicting comes in some variation of peer-to-peer file sharing where we no longer worry about collecting content as material published into the Web will exist forever replicated on servers located throughout the world. If content will exist forever within the very fabric of the Web, then what does one archive? The traditional notion of archiving has been collecting content and storing it in a central place for retrieval at a later date. If the Web itself is the storage medium, then how does archiving change?

In this utopian view of the future, the role of the archivist becomes one of the most critical functions within the digital society. Smart search engines, contextual-based indexing, and page harvesting, will continue to have a role, but the ability to quickly find content will depend on metadata. An archivist will become a jack-of-all-trades bibliographer who will define archives not by the materials on hand, but by the specialized collections of metadata and content references that can be used to intelligently retrieve related materials from the digital space. In essence, the original Yahoo! model of indexing the Web may be the role of the archivist in the future. What will make the archivist different from a bibliographer will be his/her ability and responsibility to ensure that “archived” content is seeded into the Web, and that it persists over time through replication across the network.

Are there any indicators that this new role is emerging? I would have to argue that the early days of Yahoo! provided a glimpse into this future. More recently, blogs seem to be shaping this role. Many blog authors organize Web content around particular themes, not unlike bibliographers. The blog author who references and glosses Web content is in some sense an archivist for if the original content disappears, the author’s own commentary and organizing principles may be used to reconstruct elements of the original material. Because content of “value” may be referenced several times by many different blogs, the longevity question of the original content begins to diminish.

The storage question becomes an interesting one in this present-day context. Because references are stored within blogs, if the original content disappears one can only reconstruct the original material from excerpts and discussion. From an archival standpoint, this is not enough. The original material must be maintained for future generations. Enter Google. By now, most people have discovered that if one searches for content on Google and the original material cannot be found, the cached version by Google may exist, thereby allowing one to still access the content even though the original is destroyed or lost. In essence, when content is indexed it is published into the Web and in the content fabric for a set period of time. If the cache could be stored over time and distributed throughout the Web, then content will arguably last forever.

But what about intellectual property and copyright? That’s a whole different matter of seemingly unending complexity. In a nutshell, the very notion of these two legal areas will need to change as there are not political boundaries when content is published into the Web. Content is not free, but I believe we’re going to have to come to grips with the idea of content as commodity.