Content with Digitization

PUBLISHED: November 7, 2007

It’s been a year since the University of Virginia inked a deal with Google to join the Google Books Library Project. The search engine giant, not content to merely index everything on the internet, is working with a dozen libraries to scan in every page of every book in their collections. This massive undertaking has just recently borne fruit here at UVa, with the first of UVa’s books going online in the past few weeks.

Anthony Grafton considers the merits of this and related undertakings in “Future Reading: Digitization and its Discontents” in the current New Yorker. Though Grafton generally speaks well of this global Library of Alexandria, he’s wary of substituting electrons for ink. He expresses some familiar concerns: it’s impossible to scan ephemera like the scent of the pages, optical character recognition is imperfect, this is another outlet for the West’s cultural imperialism, etc.

The author’s worries aren’t without merit. (We’ve experienced some common character recognition troubles in our own efforts to make VQR’s archives available online. I suspect that Patricia Rowe Willrich didn’t actually write that Wallace Stegner is “in his 8o’s,” though I also suspect that the digit/letter transposition represents no great logical puzzle for our readers.) But one of his more serious concerns, the fragmentation of archives across thousands of unrelated Internet repositories, is perhaps the easiest to address. He writes:

The supposed universal library, then, will be not a seamless mass of books, easily linked and studied together, but a patchwork of interfaces and databases…. Soon, the present will become overwhelmingly accessible, but a great deal of older material may never coalesce into a single database. Neither Google nor anyone else will fuse the proprietary databases of early books and the local systems created by individual archives into one accessible store of information. Though the distant past will be more available, in a technical sense, than ever before, once it is captured and preserved as a vast, disjointed mosaic it may recede ever more rapidly from our collective attention.

If that were true, it would be unfortunate, not because it would be a step backwards (it’s far easier to hop from website to website than from library to library), but because it would be a failure to embrace the full potential of the medium. Fortunately, in these days of “Web 2.0” (bingo!), there’s no great danger of that. The Online Computer Library Center’s WorldCat, for instance, exists precisely to pool its member libraries’ collection data and syndicate it to third parties, including Google.

And there’s the burgeoning microformat standards that allow metadata to be embedded within content for automatic parsing by user agents and search engines alike. The XFN standard for social networks, hCard standard for address data and hCalendar standard for event data are all based on open standards and embeddable as semantic XHTML. From geotagged photos to embedded Creative Commons licenses, the 450 million microformatted data on the web represent an enormous amount of information ripe for the aggregating. VQR has been using microformats whenever possible, whether explicit (embedding Creative Commons license data) or implicit (adhering to the definition list standard in our “Business of the Book” transcript), and our web presence is all the richer for it.

Google has been a leader in opening up their own application programming interface (API) to allow their data and services to be invoked from any web page, precisely the sort of thing that would allow their digital book collection join the microformat web. Dan Cohen recently made his own pitch for a Google Books API, while Alexis Turner has found tantalizing evidence that Google is already sharing their book data with OCLC’s WorldCat. If Google isn’t already in the process of becoming a part of a seamless global electronic library, it’s something they could do with minimal effort.

Any institution that really wants to share their digital book collection should find no obstacles in doing so, whether by participating in a WorldCat-type program or simply tagging each item with microformatted metadata. The global digital library will organize itself.

Waldo Jaquith

Waldo Jaquith is a graduate of the University of Virginia and worked for the Virginia Quarterly Review as web editor. He was a News Challenge Fellow with the John S. and James L. Knight Foundation.

Published: November 7, 2007