sirocyl

noted computer gremlinizer

working on a @styx-os.

 

laptop.
                                                                                                     

"accidentally-vengeful telco nerd"
—Tom Scott

platform sec researcher, OS dev, systems architect, composer; Other (please specify). vintage computer/electronics nut.

I am open to tag suggestions - if there is something you want me to tag on my posts, leave a comment. <3


take a look at
this cool bug I found 🪲
discord
@sirocyl
revolt.chat (occasionally active)
@sirocyl#5128
styx linux OS project
styx-os.org/

eramdam
@eramdam

This is very good advice BUT, and this isn't a dig against OP, just a heads-up. A lot of AI companies will either straight up ignore robots.txt or will fake the user-agent of their crawling bot to bypass any blocking you might do on the server-side.

This isn't the only source but is one I could easily find about this specific issue:

I wish there was a silver bullet for that stuff but alas.


catball
@catball

here's a repo that tracks a bunch of separate blocklists, some of which include bots and scrapers. https://github.com/firehol/blocklist-ipsets#list-of-ipsets-included

likewise, rate limiting and IP banning can be configured either with (usually a plugin for) your favorite http server, or using a tool like fail2ban which watches the logs of your http server and can ban clients who request too much too fast for variable amounts of time

(of course, all this only really applies if you're like self hosting or using a cloud vps, and requires a lot more technical effort)


You must log in to comment.
Pinned Tags