FauxWren

it's wren! avatar by, yknow, @aidan

  • she/her

36 | everything i speak in red is absolute and unwavering truth with no room for dispute. one time i fell asleep into a pizza


Mastodon
@wren@masto.posting.haus
Bluesky
@fauxwren.bsky.social
Twitter
twitter.com/Faux_Wren

hayley
@hayley
rngDolphins
@rngDolphins asked:

How does @starwars-characters work? After you using Puppeteer.js or raw cUrls or something more sinister?

it's the latter

starwars bot is made from python and beyond the cohost api integration i made with the requests library, it is a very half-baked custom wikia markdown parser specifically designed to find the portrait image (easy) and the first paragraph of text on a character's page (extremely fucking hard for some reason)

im going to begin this explanation with a warning: do not try to give me suggestions for how to improve this. if you can think of a better way, so can i! keep it to your damn self i made this like three years ago

the data

the list of characters is scraped by going through a bunch of different character-like categories and going through every page to compile all of the page links to character-like things (via the api https://starwars.fandom.com/api.php?format=json&action=query&list=categorymembers&cmtitle={category}). sometimes the categories will have like, genres of character instead of characters on them so that's why my bot will sometimes post shit like "Arc Troopers are a blah blah blah" rather than a specific character. im too lazy to fix this and i think it's funny

that list gets randomized and shoved into a json file that is called characters_updated.json. it's called that because there was also a characters.json but im too lazy to update it. i just looked now and there is also a file called characters_updated_old.json in there too and im not sure why. good job hayley

because my json file is a "point in time" list of the characters, sometimes character pages get moved and dont exist anymore, or new characters get added. i have to manually rerun and splice new characters into the list occasionally to account for this (i do it in such a way with deduping as to be sure no one gets double posted).

a nice bonus from this is i just have a list of Every Star Wars Character on hand at all times. i made a discord bot command that just randomly gives you a star wars character with this too. it's great

the file looks like this

[
  [
    "Jek Nkik",
    "/wiki/Jek_Nkik"
  ],
  [
    "Broll",
    "/wiki/Broll"
  ],
  [
    "Ardo Banch",
    "/wiki/Ardo_Banch"
  ],
  [
    "Semler Tevez",
    "/wiki/Semler_Tevez"
  ],
  [
    "Noop Yeldarb",
    "/wiki/Noop_Yeldarb"
  ],
...and so on

the bot

the bot itself runs on a cronjob once an hour at 50 minutes past the hour. i chose that number because never schedule your bots to run at :00 on the hour. everyone does that. get a little creative next time. every time it runs it reads from a checkpoint.txt file that tells it how far into the json list of characters it is. when it finishes posting the character it will increment the number in the checkpoint.txt file to indicate the new character's index. it skips any character name that begins with "Unidentified ..." because holy hell there are a lot of those. fun fact we are mere hours away from passing the 20,000 checkpoint

on each run, it retrieves the character name from the json file for the character, then uses the api to retrieve the wiki markdown text for the biography (and beautifulsoup to clean out html tags).

def get_bio_from_api(name):
    content = get(f"https://starwars.fandom.com/api.php?action=parse&format=json&prop=wikitext&page={name}").content
    res = json.loads(content)

    if res.get("error") is not None:
        raise Exception(f"{res['error']}")

    text = res["parse"]["wikitext"]["*"]

    if "#REDIRECT" in text:
        match = re.match(r"#REDIRECT ?\[\[([\w\s/]+)\]\]", text)
        if not match:
            raise Exception(f"Could not figure out the redirect for {text}")
        return get_bio_from_api(match.group(1))

    soup = BeautifulSoup(text, features="html.parser")
    [x.extract() for x in soup.find_all('ref')] 
    text = soup.text
    return name, text

then i have a parser that strips out all of the [Wiki Text|Formatting Stuff] from the text. im not gonna post the whole thing because it's ugly so just look at this single line and you can imagine the rest:

total = total.replace("'''", "").replace("''", "").strip().split("\n\n")[0].replace("\n", " ").split("==")[0]

then i tokenize the text into a list of sentences with nltk, and choose the maximum number of sentences that is equal to or less than 280 characters total. i technically dont need to do this anymore but having the character limit makes the comedy even better in my opinion. i might change this in the future

finally, my parsing is so consistently bad (it's better now that i use the api tho) that i have a sanity check where it looks to see if it finds the name of the character in the first 30 characters of the biography. if it doesn't it assumes it found the wrong paragraph and sends a discord message to me (with disnake) instead of sending the post. i then look at the text and manually tell it whether or not to post that text or skip it. it doesn't work very well for "This individual ..." style posts and im too lazy to fix it so i just reply to its error dms every once in a while and tell it its doing a good job. here's what that looks like. you'll notice it just sends me the raw error and a bunch of logs instead of anything useful. once again that is because i am lazy

oh also at some point during all that i look for a .pi-image-thumbnail class in the wiki page and download the attached <img> to use for the post image if it exists

PHEW i think that's it. wow i typed a lot. i think i built this badly and 2020 hayley was an incompetent fool. i have no plans on changing it because i am tired enough of programming for my day job that i dont like it bleeding into my spare time anymore


You must log in to comment.

in reply to @hayley's post: