The Almaden Crawler

If you are reading this document, then your website has likely been
touched by our web crawler.  We have included some information below
to help you understand what a crawler is, how you can deal with
crawlers, and how to contact us.

Some people are surprised when a crawler visits their site regularly,
downloading pages.  There are many groups with crawlers which crawl
the web, and a code of conduct exists to ensure that crawlers and web
sites can cooperate to achieve their respective goals.  Being
responsible professionals, we are very anxious to make sure that
webmasters are not inconvenienced by our crawling activities, and we
only wish to use publicly available data.  Therefore, we abide by the
Robots Exclusion Standard (see http://www.robotstxt.org/wc/exclusion.html), 
but more importantly we subscribe to the notion of being good citizens
in our use of the Internet.  We will therefore do our best to make
sure that nobody is inconvenienced by our crawling activities.

------------------------------------------------------------------
What is a Crawler?
------------------------------------------------------------------

A Crawler (which may also be called a robot, spider, or bot) is a
program that automatically traverses the Web's hypertext structure by
retrieving a document, and recursively retrieving all documents that
are referenced.  For more information on Crawlers and the standards of
crawling which we follow, you can visit the WebRobots FAQ
(http://www.robotstxt.org/wc/robots.html).

------------------------------------------------------------------
How Do I Prevent My Website or Parts of My Website From Being 
Crawled by Your Crawler?
------------------------------------------------------------------

Our crawler activities may create a burst of moderate activity to a
single server.  However, if you would prefer that ours or other
crawlers bypass a part or all of your website, or if you are concerned
that your site is being heavily loaded by our crawler, then the
simplest method for you to prevent this is to create a robots.txt file
on your server.  Any crawler should access this file before
downloading anything from your server(s).  This file should reside in
the top level of your server, and allows you to control which parts of
your server may be visited, and which crawlers are allowed to visit
your site(s).  Note that if your robots.txt file is malformed, then a
crawler may not recognize your intention.  We obey the Robot Exclusion
Standard, originally constructed in 1994 and updated in 1996.  You can
review the standard at the Robotstxt website
(http://www.robotstxt.org/wc/exclusion.html)

------------------------------------------------------------------
How do I make a Robots.txt File?
------------------------------------------------------------------

If you are wondering what a robots.txt file look like, here is a
simple one that asks all robots to stay away from /temp/documents and
its subdirectories:

    # Sample robots.txt file 1
    User-agent: *
    Disallow: /temp/documents/

The first line is a comment line which can be placed anywhere in a
robots.txt file as long as the comment is preceded by a pound symbol
(#).  The second line designates robots to which the access policies
apply, with a "*" meaning all robots.  The third line disallows access
to the specified directory and to any directories below it in the
hierarchy. You can include multiple Disallow statements to prohibit
access to two or more directories.  You may want certain robots to
access areas that are disallowed by other robots.  The following
robots.txt file allows unrestricted site access to a robot named
CRAWLER but prohibits others from accessing either /temp/documents or
/under_construction:

    # Sample robots.txt file 2
    User-agent: *
    Disallow: /tmp/documents/
    Disallow: /under_construction/
    User-agent: CRAWLER
    Disallow:

If you want to forbid all crawlers from crawling your site altogether,
then create a robots.txt file with the following lines:

    # Sample robots.txt file 3
    User-agent: *
    Disallow: /

Upon seeing this, crawlers which abide by the robots standard, like we
do, will immediately disconnect and go find another server.  Any of
the above sample robots.txt files must be placed in the top level of
your server under the file name "robots.txt".  Be sure to verify that
the URL http://your.server.name/robots.txt will retrieve your newly
created file.

If you only want to forbid only our crawler from going through your
site, then create a robots.txt file that contains the following lines:

    User-agent: wfarc 
    Disallow: /

Again, place this file in the top level of your server under the file
name "robots.txt", and verify that the URL
http://your.server.name/robots.txt will retrieve your newly created
file.

------------------------------------------------------------------
How You Can Help Us Quickly Respond To You:
------------------------------------------------------------------

You can provides us with some pieces of information so that we can
rapidly identify the source of any problems or issues involving our
crawler interacting with your website.  In your email to us, please
include the following information:

 * An outline of your problem or issue
 * Identification of the IP Address of the server which our crawler touched
 * Identification of the time and date of the problem or issue
 * Identification of your name as contact person, email address and/or
   phone number 
 * Entries from your server log(s) which shows the problem or URLs that 
   triggered the problem or issue would also be helpful.

How To Contact Us:

If you have created a robots.txt file on your server and still have
questions for us, then please contact us via email, including the
information outlined above, using the email address