Chameleon (sometimes penguin or dog) furry artist and shitposter. Like airplanes, chemicals, theme parks, music, Yu-Gi-Oh, baking, and baths but isn't good at any of those. Super busy at work all the time.

More active on FA or Fediverse

Posts may contain strong language or mature references. VIOWER EXCRETION ADVISD.

Icon by https://www.furaffinity.net/user/klaora
Header image by https://www.furaffinity.net/user/charmersshelter/


0xabad1dea
@0xabad1dea

This is a graph of Discord’s algorithmically inferred gender (extracted from “request your data” json; axes are probability and days) for a user whose display name is “Tiffany”, whose bio is “she/her”, whose pfp is a drawing of a girl and whose profile theme color is pink.

Algorithmically inferred gender is worse than useless. Presumably the issue is that she talks about programming, and all the deliberate “I am explicitly telling you I am a girl” signaling in the world can’t convince a computer. I sometimes watch a livecoding streamer whose youtube stats claim his audience is 99% male even though you can see fem-coded chat participants regularly. Algorithms like this are deleting the women


fluffy
@fluffy

Here's my Discord gender graph. Fuck you, Discord.

Also, it's not super easy to extract this data, but here's what I did:

  1. Did a full data export (including messages)
  2. Waited a few days for it to arrive
  3. Noticed that my activity/analytics/events-2024-00000-of-00001.json file was about 3GB which is really difficult for most JSON tools to process
  4. Ran the following command to filter out just the rows with the gender information:
    jq 'select(.prob_male)' activity/analytics/events-2024-00000-of-00001.json > gender.json
    
  5. Found out that jq doesn't produce parseable JSON after all this, so I opened the file in a text editor and replaced } with }, and then added a [ and ] to the beginning and end, respectively
  6. Ran this bit of Python to get a CSV file:
    import csv
    import json
    outfile = open('gender.csv', 'w')
    writer = csvfile.writer(outfile)
    data = json.load(open('gender.json'))
    for item in data:
        writer.writerow(item['day_pt'][:10],item['prob_male'],item['prob_female'],item['prob_non_binary_gender_expansive'])
    

and then I had something I could import into any given spreadsheeet (in this case I used Apple Numbers).

Anyway, as always, "the algorithm" is a form of bias laundering. Who knows why Discord decided I'm probably male! Fuck you, Discord!


fluffy
@fluffy

Update: The entire procedure above can be condensed into a single jq command:

jq -r 'select(.prob_male) | [ .day_pt[0:10], .prob_male, .prob_female, .prob_non_binary_gender_expansive ] | @csv' activity/analytics/events-2024-00000-of-00001.json > gender.csv

Thanks @artemis!


Miff
@Miff

AMAD (assigned male at discord)


You must log in to comment.

in reply to @0xabad1dea's post:

it may not be present in your discord data export to begin with (whether because you revoked consent to use personal data at some point or another, more inscrutable reason) but if you have it, you can find it by searching the json for "predicted_"

Platform: Most of our users are currently men

Algorithm: That means for any given user the odds of them being a man are pretty high! My job is so easy!

Algorithm: 99% of users are >50% likely male.

Platform: All of our users are men and we should only appeal and advertise to them. Thanks algorithm!

ive always always always wondered about this esp when youtubers are like "my audience is predominantly [such and such]" and im like "HOW DO YOU KNOW. YOUTUBE TOLD YOU THAT BUT HOW DO THEY KNOW"

this is honestly fascinating not only because of the obvious bizarre-ness of it but the fact that this is the best argument I have ever seen for gender being a social construct. Bravo!

I finally got my data dump from Discord but the analytics file is about 2.5GB of JSON and I'm having difficulty finding any tools that can actually parse it meaningfully. Do you have any suggestions about how to extract the predicted_gender and associated date information into something that can actually be handled reasonably easily?

jq is capable of streaming the output successfully but I have no idea what the schema is or what query path I should use to pre-digest the data.

in reply to @fluffy's post:

RE jq, in case it's helpful in your future:

jq -s can do that conversion to a real array for you

My guess is the input from discord was, not a a json array, but a sequence of json objects separated by newlines. That's pretty common for event streams actually. in this case jq processes each of these independently, running the filter on each object as the input, producing a corresponding object as an output, and separating the output with newlines just the way the input was.

but you wanted a json array as output.

jq will do that conversion for you with -s. this makes it read all the input objects, slam them into an array together, and then run it through your filter, so the output will also be an array. in other words, it does the equivalent of what you did with adding commands and a [ and ]

but because it does it to the input not the output, you either need to change your filter slightly to account for taking an array now, or you can just use 2 jq commands (which is what i usually do):

# the second `jq -s` is converting newline-separated not-array to
# a real comma-separated array
jq 'select(.prob_male)' activity/analytics/events-2024-00000-of-00001.json | jq -s > gender.json

This is generally useful for me to have in my pocket whenever i have a newline separation and i need an array. And it can go the other way around too, which I use sometimes when i have an array, and i need to shove it into something that wants it newline-separated:

# converts comma-separated array to newline-separated not-array
jq '.[]'

Ah good, that's super helpful to know! I suspect you're right and Discord's JSON output was just an unstructured event stream, and jq is confusing and overwhelming. Thanks for the -s tip.

Oh, also, I tried using jq's string interpolation function to generate the CSV directly but I couldn't figure it out. Any idea how to use that so I could have elided the Python step?

yeah jq is definitely confusing and overwhelming. ive gotten deep into it because of work stuff but it was (and still is) quite odd to me.

jq actually has a built-in @csv filter you can use to generate CSV.

it was a little confusing to me to learn how to use. specifically, it wants one array, and it wants that array to have the values of a single row of a CSV. So in this case it makes sense not to use -s, because we can turn each event into a CSV row. but you do want to use -r, probably, which makes it print the raw CSV output from this. Otherwise it will be quoted/string-escaped to be a valid JSON-string instead, which isn't what you want. So to generate your CSV, this is what you would do, probably:

jq -r 'select(.prob_male) | [ .item[:10], .prob_male, .prob_female, .prob_non_binary_gender_expansive ] | @csv' activity/analytics/events-2024-00000-of-00001.json > gender.csv

I can't test this for sure because i don't have any data myself.

That was close, the date field is day_pt though, not item (the association there was obviously lost in the jq-python example transition which is totally understandable). So this actually did the trick:

jq -r 'select(.prob_male) | [ .day_pt[0:10], .prob_male, .prob_female, .prob_non_binary_gender_expansive ] | @csv' activity/analytics/events-2024-00000-of-00001.json > gender.csv

Thanks!