Federal Government Information on the Web: Here Today … Where Tomorrow?

The Center for Research Libraries (CRL) devoted a recent two-day program to the vulnerability of digital government information, and one of the highlights was an important discussion of the special risks for “born-digital” information published online by the federal government, led by Jim Jacobs, Data Services Librarian Emeritus of the University of California at San Diego.

His presentation, “Government Records and Information: Real Risks and Potential Losses,” came on the second day of CRL’s April 24-25 conference, “Leviathan: Libraries and Government Information in the Age of Big Data,” held at the University of Chicago’s Gleacher Center.

“No one knows … how much has been created or where it all is”

In an accompanying paper provided to attendees prior to the conference, Jacobs wrote of the challenges of even identifying the scope of the preservation problem. While a standard method of counting born-digital documents eludes us, Jacobs said “we can certainly conclude that the production of born-digital government information is very, very much greater than the earlier production of printed government information.

“One might reasonably estimate that there are more born-digital government information items produced in a single year than all the two or three million non-digital government information items accumulated in the (Federal Depository Library Program) over 200 years” (emphases in original).

Libraries and other memory institutions have stepped forward with projects like “end of term” web harvesting and regular web crawls by the Internet Archive, but it’s impossible to assess the reach of these efforts without knowing the boundaries of the problem. “The simple fact is that no one knows how much born-digital US Federal government information has been created or where it all is,” Jacobs wrote.

Born Digital Documents & the FDLP

The Federal Depository Library Program (FDLP) preserved millions of print documents under a framework that provided clear responsibilities for creation (federal agencies), distribution (the Government Printing Office), and preservation (FDLP libraries).

But in the born-digital realm, agencies have more easily executed end-runs around GPO to effect a monopoly on their own distribution, and libraries have relied too much on government agencies to preserve their own born-digital information.

Digital Preservation Risks

That’s risky, Jacobs said in his presentation, because it puts preservation at the mercy of uncertain agency budgets and in organizations where preservation usually isn’t part of the agency’s mission, and because political actors in places of responsibility may even have a stake in not preserving it.

He pointed to the most recent “link rot” finding published by the Chesapeake Digital Preservation Group that 51 percent of the dot-gov URLs selected in their earliest survey in 2007-08 broke over the ensuing six years.

Jacobs outlined three models for preserving born-digital government information and gave examples of each: government working alone (the NARA model), government working with non-government partners (GPO/LOCKSS-USDOCS), and non-government entities working without government cooperation (Internet Archive). The ideal outcome is for government to cooperate with memory institutions, he said.

He also spoke of three models of documents selection: broad web harvesting like that done by the Internet Archive; targeted selection, either narrowly focused like some Archive-It projects or title-driven like the Chesapeake group; and “digital deposit,” whereby agencies create preservable files and deposit them with memory institutions. We probably need a mix of all these strategies, Jacobs said.

‘Every library should participate in digital preservation’

In setting a framework for how we should proceed, Jacobs stressed that it’s important that preservation and access not be treated separately; they go together. Preservation should focus on different community needs. We should mix a provenance approach (which agency should we preserve?) with a user services approach (what do our users need?) to build “unique collections for unique communities.”

Finally, Jacobs stressed that library participation in cooperative digital preservation efforts is not prohibited by technology; not every library needs to build large data collections. Libraries can contribute in other ways, like metadata creation and item selection. “Every library should participate in digital preservation,” he said. The outcomes will add value to our libraries and provide important collections and services to our users.

For more information, here’s a link to more background on his presentation. Video and slides from all the sessions are on the conference website.

An earlier version of this article was posted on Kevin’s blog, GovDocsGuy.