posts from @actuallyalys tagged #python

also:

actuallyalys
@actuallyalys

if you're not aware, a robots.txt file is something bots are supposed to look at before. for example, you can provide a rule to Google-Bot, and then Google isn't supposed to index your website. this is especially important now that ai companies are vacuuming up as much data as possible.

one of the best lists i've found is Block the Bots that Feed “AI” Models by Scraping Your Website

caveats

you may have already seen the limitation of this approach. a robots.txt is just a machine-parseable list of rules, and just like a list of rules posted on a door of an establishment, they are not enforceable by itself.

ai companies in specifically are particularly problematic:

  1. ai companies have a poor track record of respecting authors' and artists' consent.
  2. ai companies have a poor track record about being transparent, including about their training data.
  3. ai companies already have access to large datasets like common crawl.
  4. tech companies in general have no qualms about selling data.

alternatives

the main alternative approach is to detect the user agent and deny it access. the user agent is what identifies the agent browsing a website. for example, as you're reading this, your browser provided a user agent to Cohost.org so the website knows what program you're using to access it. this still relies on them being honest, but overriding this requires them to actively lie. the link i posted has a lot more details on how to actually do this.

cloudflare and probably other, similar services has a feature that will block bots. i think what cloudflare does is use a bunch of different data to guess whether you're a bot, and if it thinks it's likely you're a bot, makes you pass a captcha. i'm not a huge fan of cloudflare, so i don't really recommend this approach.


actuallyalys
@actuallyalys

made a version of my site with nonsense content to redirect AI bots to, inspired by JEFFPARDY. i vaguely reading someone proposing or doing something similar, but i can't remember the source. (redirecting to a 403 as proposed in the Block the Bots that Feed “AI” Models by Scraping Your Website is probably the "right"/RFC-compliant solution.)

now i just need to figure out how to redirect nginx to this version if someone accesses with the appropriate user agent.

the core is this function that uses Beautiful Soup to replace all the words with JEFF:


def jeffpardize(html: typing.TextIO) -> str:
    soup = BeautifulSoup(html, "lxml")
    for string in soup.body.find_all(text=re.compile(r"\w+")):
        new_string = re.sub(r"\w+", "JEFF", string.string)
        
        string.replace_with(NavigableString(new_string))
        #print(new_string)

    return soup

i generally like Beautiful Soup, but i haven't used it to modify HTML before, and honestly that part of the API doesn't feel particularly good to use, imo.

i'm tempted to release this as a nicely packaged utility, but i already have to do that for grommetik