The case for OAI in the age of Google
SPARC Open Access Newsletter, issue #73
May 3, 2004
by Peter Suber
Why don't more faculty deposit their eprints in open-access, OAI-compliant (OA-OAI) archives?  This is a mystery.  Two explanations we can rule out right away are opposition to open access and opposition to OAI metadata sharing.  These never come up when faculty are asked about their archiving inertia, which only makes the mystery even more puzzling. 

When asked about archiving inertia, some faculty say that putting an eprint on a personal web site is just as good as putting it in an OAI-compliant archive.  Google will find an eprint on a personal web site and make it visible to those who might need it for their research.  Let's look at this one more closely.  *Is* Google just as good?  How strong is the case for OAI archiving in the age of Google?

For this purpose, let me use the name "Google" to represent not only Google itself but any Google rival or future iteration of Google that improves on Google's famously effective relevancy algorithm and wide scope.  In short, let "Google" be our name for the state of the art in indexing by mainstream search engines.

So, is Google good enough?  If not, why not? 

(1) If we only care about open access itself, then it's true that putting an eprint on a personal or institutional web site is good enough.  It's open access. 

Against this, the Bethesda and Berlin definitions of "open access" require deposit in a certain kind of repository.  But I've argued (for example in SOAN for 8/4/03) that this is a mistake.  It confuses OA itself with one vehicle for delivering OA or one enhancement to literature that is already OA through a different vehicle.  The BOAI was more accurate in making deposit or archiving one of the means to OA, not part of the concept or definition of OA.  If we're talking about OA itself, then an eprint on an author's web site can be OA.

How should we define "open access"?  (SOAN for 8/4/03)

To forestall objections, let me add that I've also argued (for example in SOAN for 3/2/04) that OA journals should not rest with making their articles OA but should also deposit them in them in OAI-compliant archives.  Archiving is desirable both for creating OA and for enhancing literature that is already OA.  But that doesn't make it part of the definition of OA. 

Top 10 priorities for the OAI community (SOAN for 3/2/04)
(See priority #3.)

(2) The OA-OAI proponent might concede that eprints on personal web sites can be OA.  "But OA-OAI archiving enhances visibility more than Google indexing does."

The Google reply:  This may have been true once, but it's less true or untrue today.  There are two reasons why:  Google is very good and getting better, and many more people turn to Google before they turn to OAI search tools.  The second reason is peculiar.  It means that Google's popularity gives it one kind of visibility-increasing advantage even over superior search tools --if there could be superior search tools under our stipulations.  Even if OAI tools would do a better job than Google if they were as popular as Google, Google's surpassing popularity gives it a self-nourishing advantage.  (BTW, this same self-nourishing advantage should help the real Google withstand the coming onslaught from an improved Yahoo and an imminent Microsoft entry.)

(3) This actually answers another argument that might be made for OAI archiving, but let's make the argument explicit anyway.  "Scholars doing serious scholarly research look in specialized scholarly tools and resources before they look in Google." 

The Google reply:  again, this might have been true once, and perhaps it ought to be true now, but either it's becoming untrue or it's already untrue.

Working researchers certainly do use Google even if they also use specialized scholarly tools.  Moreover, unfortunately, in the period before scholars used Google for serious research, they weren't using OAI tools instead.  Google and OAI tools are both rising in usage.

Two years ago (April 2002) a study by DK Associates showed that professional analytic and organic chemists turned first to ChemWeb and second to Google.  It's impressive that Google occupied such an exalted position among such serious researchers that early in its evolution.  The same study showed that chemists in management and development positions used Google first and ChemWeb second.

Since then, Google has improved its algorithms, its index size, and its popularity --and Elsevier has decided to discontinue ChemWeb.  I haven't seen a more recent study, but I wouldn't be surprised if Google was #1 among working chemists today. 

In his February 2004 keynote at the NFAIS annual meeting (p. 8), John Regazzi reported, "In a survey for this lecture, librarians and scientists were asked to name the top scientific and medical search resources that they use or are aware of.  The difference is startling.  Librarians named Science Direct, ISI Web of Science, and Medline, while scientists named Google, Yahoo, and PubMed (librarians also named PubMed)."  Regazzi is Elsevier's Managing Director of Market Development.

(Thanks to Carol Tenopir for citing the Regazzi lecture in her article, "Is Google the Competition?" Library Journal, April 1, 2004, and to Randy Reichert for citing it in an STLQ blog posting, April 22, 2004.)

(4) "Archiving will give an eprint a permanent or persistent URL."  Compared to eprints on personal web sites, eprints in OAI archives rarely move.  When scholars change institutions or retire, they usually change web sites, with the effect of breaking links that point to their work e.g. in search engines, bibliographies, footnotes, and other indices around the world.  This is a reason to favor archives over personal web sites.

Google reply:  True, but Google has a large and useful cache that greatly mitigates the damage of link rot. 

Archiving will give an eprint other kinds of longevity, not just URL longevity.  Those who maintain OAI-compliant repositories take steps to assure long-term access and preservation.  Maintainers of personal web pages rarely take these steps, and regular back-ups are not enough.  This is also true.  In fact, it's the main reason why the Bethesda and Berlin statements wanted to make archiving an essential part of the provision of OA.  (Again, I agree that taking steps toward preservation is valuable, even critical; I only object to making preservation part of the definition of OA rather than a valuable, even critical, enhancement of content that might already be OA.)

(5) "OAI-compliant searching tools refresh their indices faster than Google." 

Google reply:  But this is not quite true.  Google refreshes its index for different kinds of content at different rates, and assigns a slow rate to most scholarly pages.  But eprints at sites that Google already rates highly are refreshed at a much faster rate.  On the other side, the refresh rate at OAI-compliant data services is up to the service providers.  At least this means that when we want to refresh the index often, we can do so, and we needn't hire expensive experts in "search engine optimization" in order to scam the Google index in a way that might not work next week.

(6) "OAI tools rest on a standardized metadata schema and therefore support field searching (e.g. on 'author' or 'title')."

Google reply:  True, but the Google syntax does a lot of this and over time will do a lot more.

Here's a variation on the OAI argument:  if users search for articles by their citations, rather than by content-based keywords, then OAI tools will help them more than Google will.  I owe this argument to the EPrints Handbook.

The Google reply:  It's true that OAI tools will provide better visibility to those who search by citations.  But talented Google searchers will prefer to search by content-based keywords, not by citations.  If they do, then they will likely find the same articles by a different route, though they will be combined with all the other articles that also satisfy the keywords.  Insofar as the size of the hit list is a problem, see the next OAI argument.

(7)  "OAI archiving reduces information overload."  When you search across OAI-compliant archives for research literature, you find only research literature.  But when you search in Google, you get commodities with the same names, popular literature on scientific topics, scientific name-dropping, crackpot hallucinations, and much more that you definitely don't want. 

Google reply:  This is true, but it overlooks the Google relevancy algorithm.  In Google, you may get more hits than you could ever scan, and many of them will be worse than useless, but Google's PageRank algorithm does a pretty good job of putting the ones you want near the top.  Just as it doesn't matter how deep the ocean is, as long as you can swim, it doesn't matter how many hits your search returns, as long as the ones you want float to the top.  Moreover, skillful users know how to tweak their search strings to narrow the results and improve their relevancy.  Finally, remember, the Google algorithm (in fact and ex hypothesi) is improving all the time.  We don't have to say that the Google algorithm is perfect, merely that a good algorithm can neutralize much of the advantage of a smaller or more focused index and that this one is good and getting better.

Judge for yourself.  Here are some terms from different academic fields and their Google hit tallies as of April 18, 2004.  Run some of them and see whether any non-academic sites make it near the top of the list.  Then tweak the search to refine the list.  "Poincare conjecture" (2.8 thousand), "third-wave feminism" (3.9 thousand), "proto Indo-European" (12 thousand), "categorical imperative" (25.4 thousand), "valence electron" (27.2 thousand), "battle of Hastings" (39.7 thousand), "collateral estoppel" (46.9 thousand), "obsessive compulsive" (304 thousand), "black hole" (2.7 million), "inflation" (4.9 million), and "protein" (26.8 million).  The general terms toward the end of the list get the most hits.  But it's easy to conjoin them with other terms in order to reduce the hit list and improve relevancy.  For example, try "black hole" plus "event horizon" (41.8 thousand), "inflation" plus "junk bond" (6.6 thousand), or "protein" plus "chirality" (47.8 thousand).

On the other side, any improvements that come to the Google algorithm could also in principle come to the OAI search tools.  That would give the OAI tools a twofold strategy for reducing information overload --intelligent sorting and smaller or more focused indices.  But even then, Google could claim a twofold strategy for finding what you want --the same intelligent sorting but yoked to a larger and more wide-ranging index.  In short, the same small OAI indices that some cite as an advantage in reducing overload can always be seen as limitations on the search for what you want.

Here's a variation on the same OAI argument:  "If you're searching for an unusual author name or term in Google, you'll probably find what you want.  But if you're searching for a common term or name, then OAI searches will probably shorten your search."

I owe this argument to a participant in David Prosser's workshop on filling OAI archives at the CERN OAI meeting in February --a participant whose name unfortunately I do not know.  If you are searching for "John Anderson", "piano", or "chess", Google will be less useful than if you are searching for "Spiro Agnew", "sackbut" or "43-man squamish". 

(8) I'm out of OAI virtues that might surpass Google virtues.  Are there any Google virtues that might surpass OAI virtues?  Here's one:  a gigantic index.  But as we just saw, this advantage competes with the advantage of the smaller and more focused OAI indices.  For some searches, a wide scope (plus a good relevancy algorithm) is more useful than a manageable hit list (plus a good inclusion policy), while for other searches the reverse is true.

Another place where Google has the advantage is full-text indexing.  So far, OAI tools only search metadata.  The very welcome 0AI reply is that full-text indexing is coming.  For one approach to it, see the work on the OA-X protocol.

* Sub-total.  For every OAI virtue, there is some Google counterpart.  This doesn't mean that the Google counterparts are superior or even equivalent.  That will depend on variables such as your search skill, your search goal, and the year (remember, what we're calling Google is always improving). 

I know you want me to choose between them but I'm not going to do it.  If their merits really depend on your needs and circumstances, however, then this is already a kind of victory for Google, at least insofar as it means that putting an eprint on your personal web site won't *always* be worse, or won't be *much* worse, than depositing it in an OA-OAI archive.  (If you're sorry that I'm not choosing between them, then here's a clue to my personal position:  It was very difficult to bring myself to write out the previous sentence.)

Note how we have confirmed the wisdom of a general practice within the OA movement.  If we provide OA to our eprints, then services to index and preserve them will come along after the fact.  Depositing eprints in OAI-compliant archives makes those eprints fodder for all future OAI-compliant data services.  Depositing eprints on a personal web site makes them fodder for all future iterations and rivals of Google.  We don't have to wait for these services to emerge, or to reach a certain level of adequacy, before we provide OA to our eprints.  On the contrary, we should provide OA to our work right now and let evolving data services compete to improve upon the visibility and longevity of our work for the rest of time.

OA-OAI archiving has all the virtues we've always seen in it, especially for metadata sharing and interoperability.  Google's strengths don't subtract from that, and won't even as they continue to grow.

Notice that some of the virtues of OA-OAI archiving --such as persistent URLs and preservation-- have little to do with searching and visibility and nothing to do with metadata harvesting or interoperability.  But this is certainly not a flaw in OA-OAI archiving.  On the contrary; it means that it has virtues beyond providing OA and content visibility.

But this is key:  OA-OAI archiving and Google indexing are completely compatible.  We can do both, and we should.  That's the main reason why I'm not going to choose between them. 

If the OA-OAI archive where you might want to deposit your eprint is in the deep web, then the Google crawler would not normally find it.  To get the benefit of Google indexing on top of OAI indexing, you'd have to deposit the eprint in an archive and put another copy somewhere on the surface web, such as your personal web site. 

An exciting series of new developments is greatly improving on the compatibility between OAI archiving and Google indexing.  Basically, they give you the benefit of both with only a single deposit. 

For example, Yahoo and OAIster have struck a deal by which OAIster feeds Yahoo the rich metadata it harvests from its large set of OAI-compliant archives.  The content in those archives is still searchable with OAI tools, such as OAIster itself, but is now also searchable through Yahoo.  Yahoo gains some new content and new metadata for some old content.  It is also spared the need to crawl an increasingly large and useful corpus of literature.  OAIster gains a new layer of Yahoo visibility for some of its content, better Yahoo indexing for all its content, and more frequent and guaranteed refreshment within the Yahoo index.

Last month, OCLC launched a program to harvest DSpace repositories (which are OAI-compliant) and make their metadata available in a non-OAI format for re-harvesting by non-OAI services like Google.  The OCLC tool will make these DSpace repositories indexable by nearly any search engine but Google is apparently the first to take advantage of it.  A test project will use the OCLC tool to help Google index the contents of 17 DSpace repositories from universities around the world.

Ironically, these new Google-OAI bridges do not make the two forms of visibility-enhancement more equivalent.  On the contrary, they tilt the advantage toward OAI archiving.  These bridge tools give some content the benefit of both worlds, but the only content to get this dual benefit is on deposit in OAI-compliant archives.  Content nowhere but a personal web site only gets the Google half of the benefit.

* On the Yahoo-OAIster collaboration

OAIster press release (March 10, 2004)

* On the OCLC-DSpace-Google collaboration

The OCLC press release (April 9, 2004)

OCLC project page

The OCLC tool is not the first to serve this general purpose.  DP9 is an open-source tool from the Old Dominion University Digital Library Group that lets general search engines like Google and Yahoo index OAI-compliant archives.  It's at least three years old.

Jeffrey R. Young, Google Teams Up with 17 Colleges to Test Searches of Scholarly Materials, Chronicle of Higher Education Daily Update, April 9, 2004.

Donald MacLeod, Google launches research archive project, The Guardian, April 13, 2004.,9830,1191090,00.html

Sharon Cantor, Google plans scholarly search tool, Daily Pennsylvanian, April 21, 2004.

Alix Cody, Google, colleges team up to provide research tools, The Dartmouth Online, April 27, 2004.

* Postscript.  Just last week, CrossRef and Google announced a collaboration that deserves mention here even if the fact that it does not make use of OAI-compliant archives puts it in a different category.  The collaboration lets Google index full-text peer-reviewed research articles from nine participating publishers, such as the American Physical Society, Blackwell, the Institute of Physics, Nature Publishing Group, Oxford, and Wiley.  The resulting searches will be free of charge for users, and cover both current and back issues.

The CrossRef-Google press release (April 28, 2004).

EContent has an unsigned note on the CrossRef-Google deal in its April 30 issue.

Just a week earlier we saw the announcement of Amazon's very useful A9 search engine, which integrates Google searches of the web with Amazon Search Inside the Book, combining them in one clean interface.  Just as the imminent, unannounced Microsoft search engine will integrate web searching with the domain Microsoft controls --your hard drive-- A9 integrates web searching with the domain it controls --the growing number of full-text books from participating publishers.

* PPS.  From my "Predictions for 2004" (SOAN for 2/2/04):

"Large, for-profit, non-academic search engines like Google, Yahoo, and the new Microsoft contender will realize that OA is in their interest and join the alliance fighting for it.  They might even join the ranks of those funding it.  OA will give them a larger and more useful body of content to index for searching.  That means it will bring in more traffic and enable them to sell more advertising.  The only obstacle:  none will want to go first, for all the new OA content they fund will immediately be indexable by the others."

As I read the news, both halves of this prediction are coming true.  On the one hand, mainstream, non-academic search engines are looking for ways to index scholarly content.  Starting with OAI-compliant archives and CrossRef-compliant publications is a natural, both because the data are well-structured and because one negotiation can open the door to many sources.  On the other hand, search engines want some quasi-exclusive access to this new content or else their rivals will have equal access to the benefit of their investment.  So while Google can eventually strike a similar deal with OAIster, and can already index the surface-web content in OAIster, Yahoo is the only search engine to benefit from the direct OAIster feed of metadata.  Similarly, Google will have some kind of exclusive access to the CrossRef content --just as it has with the 1,000,000+ documents in IEEE Xplore and just as it plans to have for the public-domain books from the Stanford library it will pay to digitize through Project Ocean.  The only trend in the other direction is the OCLC tool, which will make DSpace metadata equally available to all indexing services.  OCLC deserves our thanks for this openness, Google deserves congratulations for seizing the opportunity first, and the other search engines should jump on the bandwagon ASAP.


