What software do we need?
SPARC Open Access Newsletter, issue #86
June 2, 2005
by Peter Suber
What software would advance the cause of open access?  How can programmers help?  Here's an informal list.  If you have other ideas, please send them to the SPARC Open Access Forum.
http://www.arl.org/sparc/soa/index.html#forum

* Help develop the many open-source projects that already exist to support open access, such as Eprints and DSpace for archiving, and Open Journal Systems and DPubS for journal management.  I don't know of a general list of open-source software that supports OA.  (Making such a list would be another contribution.)  But here are two lists to get you started:

The BOAI Guide to Institutional Repository Software (limited to open-source packages)
http://www.soros.org/openaccess/software/

The SPARC list of journal and archiving software (not limited to open-source packages)
http://www.arl.org/sparc/resources/pubres.html

* Automate the metadata annotation of scholarly journal articles.  The more we can automate this process, the more we can simplify and shorten the process of self-archiving.  If we could completely automate this process, then OA archiving would become so easy that we could move toward bulk archiving of eprints.  Until then, it would help for software to take the first stab at metadata annotation, and let the author or another human being correct or finish the process.

The good news is that the state of the art is further along than existing software.  That is, there's room for the software to catch up.  See Jane Greenberg et al., Final Report for the AMeGA (Automatic Metadata Generation Applications) Project, submitted to the Library of Congress, February 17, 2005.  "The main finding...is that there is a *disconnect* between experimental research and application development. It seems that metadata generation applications could be vastly improved by integrating experimental research findings."
http://www.loc.gov/catdir/bibcontrol/lc_amega_final_report.pdf
http://www.earlham.edu/~peters/fos/2005_03_27_fosblogarchive.html#a111219331954819324

An important, related job is to automate the classification (tagging) of articles by discipline.  This will make it much easier for institutional or multi-disciplinary repositories to support browsing by field, not just searching by field-specific keywords.

* Today most journals permit postprint archiving, but most do not let authors use the published PDF for this purpose.  However, some journals --like the New England Journal of Medicine and the California Law Review-- do not let authors use anything else.  Journal anxiety about the version-control problem is likely to increase the number of journals in the second category.  For these journals, we need a tool to download a specific PDF from a publisher web site and deposit it in a designated archive.  The author should only have to enter two URLs, one for the article and one for the archive, and run the program from an IP address with access to the article.  Of course, the URLs could be entered by a secretary, student worker, librarian, or OA activist rather than the author.

If we already have the metadata annotators by this time, then this tool could call on them to annotate the articles as they are deposited.  Otherwise it could fire an email to the author on how to follow up by adding the metadata.

* Scrape the text from a PDF file, preferably including the pagination.  If we had an easy, no-cost way to take plain-text from a PDF, we could move the text to any other file format we wanted or add layers of intelligent tagging.  When journals give authors permission to post the final version of the published text but not to use the publisher's PDF, authors have a laborious and error-prone job in front of them --one which lends itself to automation.  Even when authors have permission to use the publisher's PDF, they may want to move the text to a file format friendlier to crawling and indexing software. 

* Scrub executable scripts from PDFs.  For the reasons why, see my essay from last month's newsletter on Trojan Horse eprints.
http://dash.harvard.edu/bitstream/handle/1/3997158/suber_news85.html#trojanhorse

* Many programs today can summarize digital texts, reducing long news stories or journal articles to a couple of accurate, succinct paragraphs.  What we need are tools to connect these programs to the research process.  For example, imagine selecting an option on a good search engine and getting back short summaries along with each URL.  Imagine bookmarking 100 relevant-looking articles, clicking on a summarizing tool, and getting back summaries of each article and a report on where their conclusions conflict and which ones draw upon which others.

Article summarizing software
http://dash.harvard.edu/bitstream/handle/1/4314310/suber_news3-18-02.html

François Schiettecatte once planned to add a text-summarizing feature to My.OAI, his excellent cross-archive search engine for OAI-compliant repositories.  But unfortunately My.OAI is now defunct.
http://www.myoai.com/
(The link is dead.  I include it only for reference or in case the program revives in the future.)

The Columbia Natural Language Processing group once planned to integrate searching, text summarizing, and consistency-checking into one tool for medical researchers.  But it looks like the project is defunct; at least the web site hasn't been updated since 2002.  See PERSIVAL (PErsonalized Retrieval and Summarization of Image, Video And Language Resource).
http://persival.cs.columbia.edu/

* Mine facts and assertions from free-form text and deposit them in growing scientific databases for further querying, processing, and interconnection.  There's a lot of literature, and a lot of work, on this problem.  For a good recent survey of the issues and benefits, see Dietrich Rebholz-Schuhmann, Harald Kirsch, Francisco Couto, Facts from Text --Is Text Mining Ready to Deliver? PLoS Biology, February 15, 2005.
http://biology.plosjournals.org/perlserv/?request=get-document&doi=10.1371/journal.pbio.0030065
http://www.earlham.edu/~peters/fos/2005_03_06_fosblogarchive.html#a111056339859943872

* You get the idea.  We need tools that make it easier to get literature online for OA and then tools to make OA literature more useful than it already is.  Of course, in every case, open-source is preferred.

We have OA strategies that depend on persuading authors, persuading universities, persuading libraries, persuading funders, and persuading governments.  There is certainly an effective strategy that depends on persuading programmers.  The software strategy is simply to make spectacular tools that are limited to, or optimized for, OA literature.  If there are cool tools waiting to enhance any literature that becomes OA, then they operate as so many more incentives to make literature OA.

Today, many authors make their work OA precisely to make it visible to Google.  One day soon, text-mining software should exert an even stronger force on serious researchers.  Text-miners will work best on OA literature and vastly leverage our ability to find what we need, no matter how it is expressed. 

I stand by this assessment from 2002:

I...expect that software to help readers find relevant literature will become more and more sophisticated over time, roughly matching the advances in artificial intelligence. Readers frustrated by information overload will come to rely on these sophisticated tools. Works of scholarship invisible to these new-generation searching, recommendation, and evaluation tools will be invisible to researchers....As we move further into an era in which serious research is mediated by sophisticated software, commercial publishers will have to put their works into the public Internet in order to make them visible to serious researchers. In this sense, the true promise of [open access] is not that scientific and scholarly texts will be free and online for reading, copying, printing, and so on, but that they will be available as free online data for software that acts as the antennae, prosthetic eyeballs, research assistants, and personal librarians of all serious researchers.
http://www.earlham.edu/~peters/fos/morrison.htm

Programmers, start your engines!


----------

Read this issue online
http://dash.harvard.edu/bitstream/handle/1/3967550/suber_news86.html

SOAN is published and sponsored by the Scholarly Publishing and Academic Resources Coalition (SPARC).
http://www.arl.org/sparc/

Additional support is provided by Data Conversion Laboratory (DCL), experts in converting research documents to XML.
http://www.dclab.com/public_access.asp ==========

This is the SPARC Open Access Newsletter (ISSN 1546-7821), written by Peter Suber and published by SPARC.  The views I express in this newsletter are my own and do not necessarily reflect those of SPARC or other sponsors.

To unsubscribe, send any message (from the subscribed address) to <SPARC-OANews-off@arl.org>.

Please feel free to forward any issue of the newsletter to interested colleagues.  If you are reading a forwarded copy, you can subscribe by sending any message to <SPARC-OANews-feed@arl.org>.

SPARC home page for the Open Access Newsletter and Open Access Forum
http://www.arl.org/sparc/publications/soan

Peter Suber's page of related information, including the newsletter editorial position
http://www.earlham.edu/~peters/fos/index.htm

Newsletter, archived back issues
http://www.earlham.edu/~peters/fos/newsletter/archive.htm

Forum, archived postings
https://mx2.arl.org/Lists/SOA-Forum/List.html

Timeline of the Open Access Movement
http://www.earlham.edu/~peters/fos/timeline.htm

Open Access Overview
http://www.earlham.edu/~peters/fos/overview.htm

Open Access News blog
http://www.earlham.edu/~peters/fos/fosblog.html

Peter Suber
http://www.earlham.edu/~peters
peter.suber@earlham.edu

SOAN is licensed under a Creative Commons Attribution 3.0 United States License.
http://creativecommons.org/licenses/by/3.0/us/


Return to the Newsletter archive