31 y/o white passing mixed w/ Black monster woman from the Netherlands. Artist (occasionally). Writer (again, every so often). Prone to camera sniffing behavior, one of them θΔ folk.

AD: @blimpjackal


vogon
@vogon

great research with some cool methodology that confirms what everyone should've already suspected: if you ask someone to generate text on amazon mechanical turk, there's an almost 50% chance that they're just going to ask a large language model to do it

haven't really read much of the machine learning literature so I don't know how widespread this is any more, but also I suspect we're going to get a lot of follow-ups where people discover that their research using mturk as a proxy for human performance can't be replicated if you actually pay people to do the task


ireneista
@ireneista

we've been deeply concerned about the way in which generative language models destroy a lot of the data sets, such as wikipedia and web scrapes more broadly, that everyone uses as ground truth for all manner of research, which is going to make research a lot more general

this paper identifies an aspect of that problem we hadn't even thought of, namely that even studies which think they're paying people to do things are often getting ML output instead


garak
@garak

Pre-2022 text datasets are now the information equivalent of low-background steel.


hellojed
@hellojed

predicting that libraries and physical books are going to become very relevant again


You must log in to comment.

in reply to @vogon's post:

I haven't worked much with mturk but I believe they let you use a standard input for the HIT result or provide your own little embedded web app, and I suspect they did the latter for this

in reply to @garak's post:

we backed up a lot of stuff we had space for. then we decided we need like a hundred tebibytes more space to back up the stuff we actually want to, so that's been blocked on various life constraints because this stuff is rather expensive really. :/