Block some high frequency spider bots (#2816)

This introduces rules to the robots.txt to block some spiders based on: - Blocking spiders that are just for LLM training or enrichment of the company, and do not offer value to genuine searchers. - Block the package docs under /p from crawlers that hit those pages frequently but without obvious gains for genuine searchers. These bots are behaving as worst-case users, based on our tests for ocaml/infrastructure#161 : they are hitting our most expensive pages frequently, in patterns that bypass the cache (newly added by @mtelvers).
ocaml · Nov 16, 2024 · 74e753e · 74e753e
1 parent e0b1fb1
commit 74e753e
Showing 1 changed file with 37 additions and 1 deletion.
diff --git a/asset/robots.txt b/asset/robots.txt
@@ -1,2 +1,38 @@
+# "The Meta-ExternalAgent crawler crawls the web for use cases such as training
+# AI models or improving products by indexing content directly"
+# See https://developers.facebook.com/docs/sharing/webmasters/web-crawlers
+User-agent: meta-externalagent
+Disallow: /
+
+# https://platform.openai.com/docs/bots
+User-agent: GPTBot
+Disallow: /
+
+# A SEO consultancy
+# https://www.semrush.com/bot/
+User-agent: SemrushBot
+Disallow: /
+
+# We don't want these robots crawling our expensive documentation pages, as they
+# hit with high frequency
+
+User-agent: Amazonbot
+Disallow: /p/
+
+User-agent: Bingbot
+Disallow: /p/
+
+# A chinese search site
+# https://www-sogou-com.translate.goog/docs/help/webmasters.htm?_x_tr_sch=http&_x_tr_sl=la&_x_tr_tl=en&_x_tr_hl=en&_x_tr_pto=wapp#07
+User-agent: Sogou web spider
+Disallow: /p/
+
+# A korean web search
+# https://naver.me/spd
+User-agent: Yeti
+Disallow: /p/
+
+# Everything else is OK
+
 User-agent: *
-Allow: /
+Allow: /