How to facilitate Google crawling
Notes for open-access repository maintainers
Google and I have put together a set of tips to help configure open-access scholarly repositories for full-text Google crawling. Please help by sending the URL of this document to the people who maintain the repository for your institution or discipline.
- Make sure that robots.txt does not block Googlebot (Google's crawler) from individual articles, article lists or links that lead from the homepage to the article lists.
- To allow Googlebot access to the entire repository, you can add the following to the robots.txt file on your web site:User-agent: Googlebot
- Make sure all articles can be reached by following HTML links from the homepage. Common reasons why this may not be possible:
- The only way to access articles is via a search interface.
- A browse interface using HTML links is the best way to make sure Google's crawlers can find all the articles on your site. It will also be useful for human users who will be able to see the excellent content your repository has. It will also facilitate serendipitous discovery. Use a text browser such as Lynx to examine your site, because Google's crawlers will likely see your site much as Lynx would.
Note: there is no need to limit yourself to static urls to build the browse interface. Google's crawlers happily crawl dynamic urls.
- Browse interfaces should be built as a bushy tree with links to actual articles at the leaves. This will allow Google's crawlers to find links to all the articles quickly. Some repositories use browse interfaces built as a list, each item in the list providing links to a small number of articles. To discover links to all articles, crawlers need to follow a long chain of "next" links. Given a bounded time to take a snapshot of the web, crawlers may not be able to traverse the entire list and may miss links to articles later in the list.
- Some repositories break larger documents into pieces and provide separate links to all the pieces. It is hard to recompose the entire document by finding and ordering the pieces. When it is necessary to provide separate pieces, it would be ideal if an additional link to access the entire document as a whole could be provided as well. Having access to the entire document helps better analyze citations for Google Scholar.
- Some repositories require cookies. Google's crawlers don't accept cookies. It would be best to not require cookies for Googlebot.
- Some repositories add sessionids to track usage. It is best to allow crawlers to access your sites without sessionids or arguments that track their path through the site. These techniques are useful for tracking individual user behavior, but the access pattern of crawlers is entirely different. Using these techniques may result in incomplete indexing of your site, as crawlers may not be able to eliminate URLs that look different but actually point to the same page.
- If not all content on your repository is open access, for example if the fulltext of some or all articles is available only to users at your institution, please contact firstname.lastname@example.org to see how Google can work with you to index your repository
- For more guidelines, see http://www.google.com/webmasters/guidelines.html.
For more info on Google's crawlers, see http://www.google.com/webmasters/faq.html.
First put online January 27, 2005.
Open Access Project Director, Public Knowledge
Research Professor of Philosophy, Earlham College
Senior Researcher, SPARC