|
If you're reading this, chances are you've seen our robot visiting
your site while looking through your server logs. When we crawl to
populate our index, we advertise the "User-agent" string "NutchOrg".
If you see the agent "Nutch" or "NutchCVS", that's probably a
developer testing a new version of our robot, or someone running their
own instance.
We are open-source developers, trying to build something useful for
the world to use. It comes naturally to us to want to be good
netizens. If you notice our bot misbehaving, please drop us a line at
agent@nutch.org and we will investigate the
problem.
Our bot does retrieve and parse robots.txt files, and it looks for
robots META tags in HTML. These are the standard mechanisms for
webmasters to tell web robots which portions of a site a robot is
welcome to access.
We're an open source project, so please
understand that a misbehaving bot appearing with our Agent string may
not have been run by us. Our code is out there for anyone to tinker
with. However, whether or not we ran the bot, we'd appreciate hearing
about any bad behavior- please let us know about it! If possible,
please include the name of the domain and some representative log
entries. We can be reached at
agent@nutch.org
Our bot follows the robots.txt exclusion standard, which is
described at
http://www.robotstxt.org/wc/exclusion.html#robotstxt. Depending on
the configuration, our robot may obey different rules. To make it
simple to send our bot away, we'll always obey rules for "Nutch".
Here are the different cases.
- When we're running to populate our index, we'll advertise the
agent "NutchOrg", and obey rules for "NutchOrg" if they exist, or
"Nutch", or "*".
- When anyone is running an unmodified CVS version of our bot
(including when we're running our bot to test it) it will advertise
"NutchCVS", and obey rules for "NutchCVS" if they exist, or "Nutch",
or "*".
- Release versions of our bot will advertise "Nutch", and obey
rules for "Nutch" or "*".
To ban all bots from your site, place the following in your
robots.txt file:
User-agent: * Disallow: /
To ban Nutch bots from your site unless they're building the
Nutch.Org demo index, place the following in your robots.txt file:
User-agent: Nutch Disallow: /
User-agent: NutchOrg Disallow:
To ban all Nutch bots from your site:
User-agent: Nutch Disallow: /
If you do not have permission to edit the
/robots.txt file on your server, you can still tell robots not to
index your pages or follow your links. The standard mechanism for
this is the robots META tag, as described at
http://www.robotstxt.org/wc/meta-user.html.
To tell Nutch, and other robots, not to index your page or follow
your links, insert this META tag into the HEAD section of your HTML
document:
<meta name="robots" content="noindex,nofollow">
Of course, you can control the "index" and "follow" directives
independantly. The keywords "all" or "none" are also allowed,
meaning "index,follow" or "noindex,nofollow", respectively. Some
examples are:
<meta name="robots" content="all"> <meta
name="robots" content="index,follow"> <meta name="robots"
content="index,nofollow"> <meta name="robots"
content="noindex,follow"> <meta name="robots"
content="none">
If there are no robots META tags, or if an action is not
specifically prohibited (ie. neither "nofollow" or "none" appears),
Nutch will assume it is allowed to index or follow links.
|
|
|
| |
|