The many-copy problem and the many-copy solution
SPARC Open Access Newsletter, issue #69
January 2, 2004
by Peter Suber

As soon as we provide open access to an article, we should expect copies to proliferate around the world.  The archive or journal where the article first appeared will make back-ups and may have mirror sites.  The Internet Archive and Wayback Machine will make and store copies.  Google and many other search engines will put copies in their cache.  Readers who find the article especially important for their teaching or research might post copies to their own web sites.  Others will circulate copies as email attachments.  Readers will have offline copies on their hard drives, produced by their browsers, page-change alert programs, locally searchable databases, or other applications.  Many users will make and keep printouts. 

Insofar as this proliferation causes trouble, let me call it the many-copy problem.  Insofar as it solves problems, let me call it the many-copy solution.  Open access undoubtedly triggers both.

In the spirit of giving good news first, here's a sketch the many-copy solution. 

* The proliferation of copies shows that copying is physically or technically possible.  In the case of open-access literature, copying is also legally permissible.  When it is both, then licensing agreements and software to enforce them haven't locked up the content and made it uncopyable.  The proliferation of permissible copies shows that the technical and legal freedom to make and distribute copies is intact, which is a key part of the free exchange of information.

*  The proliferation of copies is insurance against disaster.  If one copy is deleted or corrupted, the other copies will likely survive.  This fact was made a deliberate preservation strategy by LOCKSS (Lots of Copies Keeps Stuff Safe), a P2P network of self-correcting archive mirrors. 


* The proliferation of copies is a hedge against censorship, not just deletion and corruption.  When the Bush administration started pruning government-controlled web sites, removing valid science that might help terrorists, and valid science that might support abortion-choice advocates, it was serenely unaware that copies of the same files existed elsewhere on the net.  It doesn't matter whether the censor is trying to save lives or distort science.  You can only remove the copies you control and the copies you know about.  Open access increases the odds that these aren't the only copies online, let alone offline in printouts and hard drives.

Bush administration deletions of information that might help terrorists

Bush administration deletions of information that might help abortion-rights advocates

Bush administration deletions of information that might help critics of its Iraq policy

* The proliferation of copies not only increases the chances that a copy will survive disaster, uncensored, but that *open-access* copies will survive.  This is one reason why BioMed Central and the Public Library of Science deposit copies of all their published articles in PubMed Central.  The PMC copies help assure that the articles will remain OA even if the original journals die, are bought out, or change their access policies.

* A journal might refuse to publish an article unless the author removes the preprint from an open-access archive.  This is not a copyright issue, since the author was the copyright holder at the time the article was deposited in the archive.  But the journal can refuse any submission, and this power gives it leverage over authors who want to be published.  Authors can try to negotiate to keep the preprint accessible, which will work at a growing number of journals.  But even if negotiations fail, authors can simply comply with the journal's demand.  Thanks to the many-copy solution, other open-access copies are likely to exist elsewhere.  Authors, like censors, can only remove the copies they control and the copies they know about, and journals cannot expect them to do any more than that.

* The proliferation of copies increases the likelihood that at least one of them will be indexed by a search engine in your standard toolkit.  Some online journals have terrible search engines for their content.  Some archives are not OAI-compliant and cannot benefit from cross-archive OAI search engines.  But most open-access copies in the surface (as opposed to "deep") web will be crawled by Google and other major search engines.  Some will be indexed by Scirus, OAI-specific engines, and other specialized academic search engines.  There is no single index that marks the "finish line" for content trying to become visible and discoverable.  But every new copy increases the number of pathways between readers and copies, and increases the odds that a random reader will discover a copy by entering relevant terms into his or her favorite search engine, no matter how provincial or peculiar he, she, or it may be.

* Finally, the proliferation of copies speeds access and thereby supports the basic function of open access, which is to accelerate research.  If all copies of an article had to be served from a central location, with no caching or storage on local machines, no printing, and no forwarding, then literature might be nearly as difficult to reach and share as it was in the era of print.  In this sense, open access doesn't one-sidedly cause the proliferation of copies; the relationship is reciprocal.  Open access triggers copying by permitting it, while copying improves access by multiplying access points and cutting delays.

What about the many-copy *problem*?  How can the proliferation of copies cause trouble?

* Copies interfere with the measurement of traffic and usage.  A given archive or journal might measure usage very well.  But if there is an unknown number of copies elsewhere on the net, and an unknown percentage of readers are using those other copies, then the local measurements will be inaccurate to an unknown degree.  We might know that all verified counts are undercounts, but we won't know by how much.

If we had perfect indices of the entire net or perfect spybots in every browser, then the proliferation of copies would be compatible with perfect measurement of traffic and usage.  Perfect indices are very desirable, and perfect spybots very undesirable.  But we're very far from both, and there are good reasons to think that the desirable method of achieving this goal will always be out of reach, even if (big if) we continually approach completeness as an asymptote.

Or, perhaps a perfect index of the net is not even desirable.  It would only solve the measurement problem if it counted all copies in use.  But then it would have to count even offline copies on hard drives, threatening the private exchange and storage of information.  At some point, improving our usage metrics will violate privacy and protecting privacy will thwart usage metrics.

What if open-access articles carried code to report back to a scientometric counting station now and then?  (This is possible; Microsoft already does something similar to see whether copies of licensed MS apps are in use in more than one location at once.)  Copying the file would also copy the code with it, at least in the absence of a fairly sophisticated hack.  However, even if the code only reported anonymized traffic and usage data, many users would worry that it would report more, invade privacy, and compromise anonymous inquiry.  Open-source code would help allay fears, but would it help enough?  Either way, we're likely to see closed-source versions of this code become common.  It's too useful to Jack Valenti and John Ashcroft.

Note that the proliferation of copies only hinders metrics that count downloads, search hits, and other forms of usage.  It does not affect the count of citations or impact measurements based on citation counts.  Of course, automated citation counts might fail too, e.g. because an article citing my work is offline or invisible to the counter.  But if so, the fault does not lie with the many-copy problem.

* The proliferation of copies harms what we could call *dynamic* works --works that are periodically revised or updated.  Even if each update carries a revision date, and all copies carry the revision date, a reader will not know whether there is a more recent copy elsewhere. 

When I maintained a list of links to sites in philosophy, I dated every revision of the file.  But I was frustrated when other philosophers used copies, rather than links, to share it with students or colleagues.  They would invariably fail to keep their copies up-to-date.  The result was that readers who consulted their copies rather than my original would think that I was slow to update the file --or slower than I really was.

If the dynamic work is an article or book, then readers of out-of-date copies will think the author is guilty of errors or omissions that have been corrected in newer versions.

If the dynamic work has legal implications, like a web site privacy policy, then out-of-date copies will mislead users about their rights.

While I consent to open access for all my online writings, I do try to control the copying of my dynamic works.  When I find out-of-date copies on the web, I ask the host to bring them up-to-date or take them down.  I consent to mirrors of my dynamic works only when I am confident that the mirror will remain in synch with the original.

* The proliferation of copies makes it more difficult to know when the version of an article you're reading is the same version approved by a journal's peer-review process.  The text might give no indication.  It might say that it was approved, but it might be an altered copy of a version that was truly approved, or a fraudulent copy that was never approved. 

For better or worse, we're refining our rules of thumb for deciding when to trust online content.  One rule might be, "If the copy doesn't say where it was refereed, then assume it was never refereed."  (It might be an honest preprint or it might be a fraud.)  This particular rule may err too far on the side of skepticism, for we all know peer-reviewed papers on author web sites that show no sign of their approval by a peer-review process.  The question is how the many-copy problem interferes with our attempt to make the rule more discriminating and less crude.

One way to deal with the authentication problem is for the journal that conducted peer review to host its own copy of the approved article.  If you distrust the copy you're reading, then visit the source and read an authenticated copy, or run file-comparison software across the authenticated and questionable copies.  Of course most readers cannot use this method unless the journal version is open-access.

Another approach is encryption, used in _Surfaces_, an early peer-reviewed, open-access journal edited by Jean-Claude Guédon.  With encryption, receiving an authenticated copy of an article is as easy as receiving an authenticated signature or credit card payment.  (I still don't understand why this powerful idea from 1991 has not been more widely imitated or criticized.  Is it because there's little urgency to solve the authentication problem itself?)

(stopped publishing in 1999)

How Surfaces used encryption

Another approach is to let articles carry their metadata with them.  Metadata fields could indicate not only authorship and date, but whether the article was refereed and where.  Embedded metadata would not be as secure as encryption, but more convenient for the reader and less likely than "self-reporting code" to aggravate the suspicions of suspicious users.  (It might be hacked by would-be plagiarists, but not by authentication-seeking readers.)  On the other hand, if you find yourself asking whether to trust the metadata that accompany an article, then they haven't solved the authentication problem.

* Postscript:  I can't resist pointing out that just this month, Emerald introduced digital signatures in order to authenticate transactions with scholarly journals.  But it has a much more retrograde purpose in mind than the open-access distribution of authenticated copies of refereed articles.  Emerald will use the system to let authors transfer copyright through an online form.  Quoting John Peters, Emerald's editorial and author relations director:  "For the first time Authors will be able to assign copyright online resulting in a more efficient article submission process, much shorter time to publication, and of course, [somehow, despite the access fee] the widest possible dissemination of their work."


Read this issue online

SOAN is published and sponsored by the Scholarly Publishing and Academic Resources Coalition (SPARC).

Additional support is provided by Data Conversion Laboratory (DCL), experts in converting research documents to XML.


This is the SPARC Open Access Newsletter (ISSN 1546-7821), written by Peter Suber and published by SPARC.  The views I express in this newsletter are my own and do not necessarily reflect those of SPARC or other sponsors.

To unsubscribe, send any message (from the subscribed address) to <>.

Please feel free to forward any issue of the newsletter to interested colleagues.  If you are reading a forwarded copy, see the instructions for subscribing at either of the next two sites below.

SPARC home page for the Open Access Newsletter and Open Access Forum

Peter Suber's page of related information, including the newsletter editorial position

Newsletter, archived back issues

Forum, archived postings

Conferences Related to the Open Access Movement

Timeline of the Open Access Movement

Open Access Overview

Open Access News blog

Peter Suber

SOAN is licensed under a Creative Commons Attribution 3.0 United States License.

Return to the Newsletter archive