Google's gigantic library project
SPARC Open Access Newsletter, issue #81
January 2, 2005
by Peter Suber
Just as we were digesting the impact of Google Scholar (announced November 18) we had to start digesting Google's new and much larger project to digitize at least 15 million print books for free full-text searching and, in some cases, free full-text reading (announced December 14). 

Five major research libraries have agreed to loan Google books for the gigantic project:  Harvard, Stanford, University of Michigan, Oxford, and the New York Public Library (NYPL).  Google says that no more libraries are on its list at the moment, but it's always willing to hear from libraries with special collections that Google might crawl.

Some of the scanned books will be under copyright and some will not.  When copyrighted books come up in a search, Google will display a full citation and up to three passages of text containing the searchstring.  It will also link to nearby libraries where the book can be borrowed and to Amazon for users who would rather buy a copy.  For public-domain books, Google will display passages of text containing the searchstring and a link to the full-text book for reading.  When you reach the readable full-text, you'll find that Google does not allow downloading or printing.  Moreover, early reports suggest that these readable books will be image files, not text files, and hence not searchable outside the Google index unless you do your own OCR.  (Google is unlikely to offer full-text public-domain books in a more convenient form, since that would make them available for indexing in rival search engines.) 

To get all this content into its index, Google will digitize the volumes at its own expense.  At roughly $10 per volume, 15 million books will cost it $150 million.  The deal is non-exclusive, so that any other company with that kind of money could digitize the same books.  Yahoo and Microsoft may be considering it; the Internet Archive is already doing something similar (more below).  Google will earn money on the deal at least by bringing in new users, which will translate into greater ad revenue.  It may eventually place ads in its digital copies of the scanned books, but hasn't yet decided whether to do so.  Google will share ad revenue from copyrighted books with publishers.  But it will not, apparently, share revenue with participating libraries.  Google has applied for a patent on a method for providing "subscription-like access" to copyrighted content, which hints at another business model for covering its costs.

At least at first, books will rarely come up near the top of a hit list, if only because very few other sites will link to them.  Google hasn't yet announced a separate interface or relevancy algorithm for searching books, but it may have to develop at least one of them in order to attract enough book-searching traffic to repay its investment.  (It already has a special syntax; throw the word "book" into a search, and the hits from scanned books will be segregated for separate viewing.)

The five participating libraries will get free copies of the bits scanned from their books.  All of them plan to offer enhanced access to their own patrons, for example, printing and downloading of public-domain texts, and integration into the library catalogue.  A few news reports suggest that some of the libraries might provide the general public with OA to the full-texts.  But so far none of the participating libraries has explicitly said that it would do so.  I'm still unsure whether the Google contract even permits it.

Michigan is letting Google scan all 7 million of its books, excluding only some rare books that might be damaged by the scanning process.  The other libraries are only letting Google scan subsets of their collections and will open the gate further if they are happy with the experiment.  Oxford and NYPL are offering only public-domain books; Stanford is offering 2 million of its 8 million volumes; and Harvard is offering only 40,000 of its 15 million volumes.

Scanning Michigan's 7 million books will take about six years.  If that seems like a long time, consider that the Michigan collection occupies about 132 shelf-miles of books.  If Google ends up scanning the entirety of the Harvard or Stanford collection, let alone both, the job will take even longer.  Books will appear in the Google index roughly as they are scanned; you won't have to wait years to see the effect on your research.

This is the project that has been known in some circles as Project Ocean, ever since John Markoff used that term in the New York Times on February 1, 2004.  But Google is no longer using that name and, strangely, given the project's magnitude, Google hasn't given it a new name either.  It will simply be a part of Google Print --the largest part and the part extending the program from publishers to libraries.  The project is not yet integrated with Google Scholar, though integration would enhance both projects. 

The library project is breathtaking in its scope and cost, and revolutionary in its implications.  It's significant for half a dozen reasons.  I'm sure other reasons will soon be apparent to everyone.

* It will hugely expand the universe of free online books for reading and expand it even further for searching.  Even if the project were limited to Michigan's 7 million books, it will far exceed what most libraries conceive to be a core collection.  We don't know what it will do to teaching and research, let alone pleasure reading and autodidacticism.  But we can be sure that removing access barriers to collections of this magnitude and utility will change basic practices.  Because of its scale, this is a quantitative change that will bring qualitative changes in its wake.

* While a handful of governments and corporations had the money and --I contend-- the interest to undertake this project, none had stepped up to the plate.  Google was willing to spend big to make this happen, and it was willing before anyone else.  If there are financial risks, copyright thickets, and logistical problems, and there undoubtedly are, Google had the courage and vision to see that risks were worth taking and the problems worth solving.  (This doesn't detract from earlier digitization projects from others, some of them very large; none is this large.)

* The project will give Google an unmatched critical mass of important texts for scholarly research.  That will attract researchers.  That will in turn increase the importance to researchers of having their content indexed by Google, through Google Scholar, CrossRef Search, or routine crawling.  There are two ways to make content more visible:  index it in the right tools, and draw more eyeballs to the tools that already index it.  Google has long since learned the secret of doing both at once, and this project will be a huge leap forward on both fronts.

* Now or soon, if you make your work OA, then Google will find it, crawl it, and add it to its index.  Hence, the eyeball-attracting critical mass it is developing also operates as an incentive for authors and publishers to provide OA to their work.

* This project makes copyrighted and revenue-producing books freely accessible to some degree online (at least for searching, and for reading relevant extracts) without antagonizing publishers.  If free online searching and sampling increase net sales for some kinds of books --already proved for many kinds of books-- then this project will bring this fact home to many more publishers. 

* It's now more important than ever to protect and expand the public domain.  Projects like this show vividly what is pirated from the public when the public domain is shrunk by retroactive extensions of the term of copyright.

Google library project home page
http://print.google.com/googleprint/library.html

Google press release on the library project, December 14, 2004
http://www.google.com/press/pressrel/print_library.html

Google Print FAQ, which now covers the library project
http://print.google.com/googleprint/about.html

Press releases from the five participating libraries:
--Harvard University Library
http://hul.harvard.edu/publications/041213news.html
--New York Public Library
http://www.nypl.org/press/google.cfm
--Oxford University Library
http://www.admin.ox.ac.uk/po/041214a.shtml
--Stanford University Libraries
http://www.stanford.edu/dept/news/pr/2004/pr-google-011205.html
--University of Michigan Library
http://www.umich.edu/news/index.html?Releases/2004/Dec04/library/index

Harvard University Library's FAQ on the project
http://hul.harvard.edu/publications/041213faq.html
http://www.earlham.edu/~peters/fos/2004_12_12_fosblogarchive.html#a110304624758583096

The Google library project stimulated an orgy of press stories.  Here' s a selection of the better accounts and comments.

Barbara Quint, Google's Library Project: Questions, Questions, Questions, Information Today, December 27, 2004.
http://www.infotoday.com/newsbreaks/nb041227-2.shtml
http://www.earlham.edu/~peters/fos/2004_12_26_fosblogarchive.html#a110424842501121205

John Blossom, Open Stacks: Pondering the Value of Copyrighted Content in a World of Online Archives, Commentary, December 20, 2004.
http://www.shore.com/commentary/newsanal/items/2004/20041220copyright.html
http://www.earlham.edu/~peters/fos/2004_12_19_fosblogarchive.html#a110372695571456552

Carolyn Said, Revolutionary chapter: Google's ambitious book-scanning plan seen as key shift in paper-based culture, San Francisco Chronicle, December 20, 2004.
http://www.sfgate.com/cgi-bin/article.cgi?file=/chronicle/archive/2004/12/20/BUGROAD6QT1.DTL
http://www.earlham.edu/~peters/fos/2004_12_19_fosblogarchive.html#a110372577690045956

Michael Gorman, Google and God's Mind, Los Angeles Times, December 17, 2004.
http://www.latimes.com/news/printedition/opinion/la-oe-nugorman17dec17,1,2263077.story
http://www.earlham.edu/~peters/fos/2004_12_19_fosblogarchive.html#a110365375034144708

Also see this reply to Gorman:  Kevin Drum, Google and the Human Spirit, Washington Monthly, December 17, 2004.
http://www.washingtonmonthly.com/archives/individual/2004_12/005344.php
http://www.earlham.edu/~peters/fos/2004_12_19_fosblogarchive.html#a110365375034144708

The Electronic Library, an unsigned editorial in the New York Times, December 21, 2004.
http://www.nytimes.com/2004/12/21/opinion/21tue2.html
http://www.earlham.edu/~peters/fos/2004_12_19_fosblogarchive.html#a110364933349229470

Barbara Quint, Google and Research Libraries Launch Massive Digitization Project, Information Today, December 20, 2004.
http://www.infotoday.com/newsbreaks/nb041220-2.shtml
http://www.earlham.edu/~peters/fos/2004_12_19_fosblogarchive.html#a110357458643994979

Rory Litwin, On Google's Monetization of Libraries, Library Juice, December 17, 2004.
http://libr.org/juice/issues/vol7/LJ_7.26.html#3
http://www.earlham.edu/~peters/fos/2004_12_12_fosblogarchive.html#a110329796475841366

NPR has run two broadcasts on the project:   (1) "All Things Considered" on December 14 included a Michele Norris interview with Carol Brey-Casiano, president of the American Library Association, on the Google library project, and (2) "Talk of the Nation" on December 15 focused on the Google library plan and featured guests Michael Keller, head librarian at Stanford, and Brewster Kahle, founder of the Internet Archive.
http://www.npr.org/templates/story/story.php?storyId=4229570
http://www.npr.org/templates/story/story.php?storyId=4227895
http://www.earlham.edu/~peters/fos/2004_12_12_fosblogarchive.html#a110320637373090865

Peter Grier and Amanda Paulson, Google plans giant online library stack, Christian Science Monitor, December 15, 2004.
http://www.csmonitor.com/2004/1215/p01s02-ussc.html
http://www.earlham.edu/~peters/fos/2004_12_12_fosblogarchive.html#a110314974272144117

Janice McCallum, Google Scholar Flunking Relationships 101?, Commentary (the Shore Communications blog), December 14, 2004.
http://shore.com/commentary/weblogs/2004_12_01_m_archive.html#110303978628673016
http://www.earlham.edu/~peters/fos/2004_12_12_fosblogarchive.html#a110312594436064426

Anon., Google to Digitize Library Book Holdings, Outsell Now, December 14, 2004.
http://now.outsellinc.com/now/2004/12/google_to_digit.html
http://www.earlham.edu/~peters/fos/2004_12_12_fosblogarchive.html#a110312537294765004

Mike Wendland, U-M's entire library to be put on Google, Detroit Free Press, December 14, 2004.
http://www.freep.com/money/tech/mwend14e_20041214.htm
http://www.earlham.edu/~peters/fos/2004_12_12_fosblogarchive.html#a110305562562452560

Gary Price, Google Partners with Oxford, Harvard & Others to Digitize Libraries, SearchDay, December 14, 2004.
http://searchenginewatch.com/searchday/article.php/3447411
http://www.earlham.edu/~peters/fos/2004_12_12_fosblogarchive.html#a110303937095799369

Stephen M. Marks, Google To Scan Library Books, Harvard Crimson, December 14, 2004.
http://www.thecrimson.com/today/article505061.html
http://www.earlham.edu/~peters/fos/2004_12_12_fosblogarchive.html#a110303750711027829

Anon., Harvard Libraries and Google announce pilot digitization project with potential benefits to scholars worldwide, Harvard University Gazette, December 14, 2004.
http://www.news.harvard.edu/gazette/daily/2004/12/13-google.html
http://www.earlham.edu/~peters/fos/2004_12_12_fosblogarchive.html#a110303593453933607

Scott Carlson and Jeffrey Young, Google Will Digitize and Search Millions of Books From 5 Leading Research Libraries, Chronicle of Higher Education, December 14, 2004.
http://chronicle.com/free/2004/12/2004121401n.htm
http://www.earlham.edu/~peters/fos/2004_12_12_fosblogarchive.html#a110303389823309586

John Markoff and Edward Wyatt, Google Is Adding Major Libraries to Its Database, New York Times, December 14, 2004.
http://www.nytimes.com/2004/12/14/technology/14google.html
http://www.earlham.edu/~peters/fos/2004_12_12_fosblogarchive.html#a110303158480921408

* Postscript.  One day before the Google announcement, Brewster Kahle's Internet Archive (IA) announced a very similar project.  But just as the press began to pay attention, Google stole the spotlight and most journalists never returned to the IA story.  (The IA announcement is dated December 15 but was released on December 13.)  That's a shame, because the IA project is more progressive and revolutionary than the Google project.

The IA project will digitize more than one million books from a dozen libraries in five countries.  It's open to any library that would like to participate.  It's already begun and already has 27,000 books online with another 50,000 to come in the first quarter of 2005.  But above all, IA will offer full open access to the public-domain books in the collection.  Like Google, IA will pay the costs of digitization itself, and it will include copyrighted books alongside public-domain books.  IA will offer searching of its digital texts, even if not Google-quality searching.  However, it will open its files to crawling, including Google crawling, so that we will have the best of both worlds.

We should be careful when comparing the magnitude of the two projects.  IA is digitizing fewer books, although a million books would have been a major news story in any other news week.  But Google isn't providing full open access to any of its books.  Even when Google provides free online full-text reading, it will disable printing and downloading.  From the perspective of open access, therefore, the IA scale is much larger than Google's.

Internet Archive
http://www.archive.org/

IA Open-Access Text Archive
http://www.archive.org/texts/

IA press release on the Open-Access Text Archive
http://www.archive.org/iathreads/post-view.php?id=25361
http://www.earlham.edu/~peters/fos/2004_12_12_fosblogarchive.html#a110312395712458432

Mark Chillingworth, Internet Archive to build alternative to Google, Information World Review, December 21, 2004.
http://www.iwr.co.uk/IWR/1160176
http://www.earlham.edu/~peters/fos/2004_12_19_fosblogarchive.html#a110363728526238561

Guy Dixon, The Race to Digitize the Print Universe, Globe and Mail, December 15, 2004.  On several large-scale Canadian digitization projects, including the IA project with the University of Toronto.
http://www.theglobeandmail.com/servlet/story/RTGAM.20041215.wxgoogle15/BNStory/Entertainment/
http://www.earlham.edu/~peters/fos/2004_12_19_fosblogarchive.html#a110355933873227142

Anon., Internet Archive, Libraries Collaborate on Open-Access Text Archives, Library Journal, December 27, 2004. http://www.libraryjournal.com/article/CA490132?display=NewsNews&industry=News&industryid=1986&verticalid=151
http://www.earlham.edu/~peters/fos/2004_12_19_fosblogarchive.html#a110389729676530915


----------

Read this issue online
http://dash.harvard.edu/bitstream/handle/1/3997163/suber_news81.html

SOAN is published and sponsored by the Scholarly Publishing and Academic Resources Coalition (SPARC).
http://www.arl.org/sparc/

Additional support is provided by Data Conversion Laboratory (DCL), experts in converting research documents to XML.
http://www.dclab.com/public_access.asp


==========

This is the SPARC Open Access Newsletter (ISSN 1546-7821), written by Peter Suber and published by SPARC.  The views I express in this newsletter are my own and do not necessarily reflect those of SPARC or other sponsors.

To unsubscribe, send any message (from the subscribed address) to <SPARC-OANews-off@arl.org>.

Please feel free to forward any issue of the newsletter to interested colleagues.  If you are reading a forwarded copy, see the instructions for subscribing at either of the next two sites below.

SPARC home page for the Open Access Newsletter and Open Access Forum
http://www.arl.org/sparc/publications/soan

Peter Suber's page of related information, including the newsletter editorial position
http://www.earlham.edu/~peters/fos/index.htm

Newsletter, archived back issues
http://www.earlham.edu/~peters/fos/newsletter/archive.htm

Forum, archived postings
https://mx2.arl.org/Lists/SOA-Forum/List.html

Conferences Related to the Open Access Movement
http://www.earlham.edu/~peters/fos/conf.htm

Timeline of the Open Access Movement
http://www.earlham.edu/~peters/fos/timeline.htm

Open Access Overview
http://www.earlham.edu/~peters/fos/overview.htm

Open Access News blog
http://www.earlham.edu/~peters/fos/fosblog.html

Peter Suber
http://www.earlham.edu/~peters
peter.suber@earlham.edu

SOAN is licensed under a Creative Commons Attribution 3.0 United States License.
http://creativecommons.org/licenses/by/3.0/us/


Return to the Newsletter archive