Skip to content

Commit

Permalink
Block some high frequency spider bots (#2816)
Browse files Browse the repository at this point in the history
This introduces rules to the robots.txt to block some spiders based on:

- Blocking spiders that are just for LLM training or enrichment of the
company, and do not offer value to genuine searchers.

- Block the package docs under /p from crawlers that hit those pages
frequently but without obvious gains for genuine searchers.

These bots are behaving as worst-case users, based on our tests for
ocaml/infrastructure#161 : they are hitting
our most expensive pages frequently, in patterns that bypass the
cache (newly added by @mtelvers).
  • Loading branch information
shonfeder authored Nov 16, 2024
1 parent e0b1fb1 commit 74e753e
Showing 1 changed file with 37 additions and 1 deletion.
38 changes: 37 additions & 1 deletion asset/robots.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,38 @@
# "The Meta-ExternalAgent crawler crawls the web for use cases such as training
# AI models or improving products by indexing content directly"
# See https://developers.facebook.com/docs/sharing/webmasters/web-crawlers
User-agent: meta-externalagent
Disallow: /

# https://platform.openai.com/docs/bots
User-agent: GPTBot
Disallow: /

# A SEO consultancy
# https://www.semrush.com/bot/
User-agent: SemrushBot
Disallow: /

# We don't want these robots crawling our expensive documentation pages, as they
# hit with high frequency

User-agent: Amazonbot
Disallow: /p/

User-agent: Bingbot
Disallow: /p/

# A chinese search site
# https://www-sogou-com.translate.goog/docs/help/webmasters.htm?_x_tr_sch=http&_x_tr_sl=la&_x_tr_tl=en&_x_tr_hl=en&_x_tr_pto=wapp#07
User-agent: Sogou web spider
Disallow: /p/

# A korean web search
# https://naver.me/spd
User-agent: Yeti
Disallow: /p/

# Everything else is OK

User-agent: *
Allow: /
Allow: /

0 comments on commit 74e753e

Please sign in to comment.