i just spent 3 days annotating 150/600 1800s newspapers to train an NER model over. i figured since ml tooling has come a long fkin way since i first learned it in university mANY years ago now (eek) that getting a baseline model trained in something like spaCy would be real fkin easy right?
fUCK NO.
read more
so im still in the weeds rn as i type this so this is gonna be incoherent as hell but the long and short of it is thiswe annotate in brat. me and my supervisors have tried MULTIPLE annotation platforms at this point and there's always somETHING that means one person in my supervision team or me cant get on with. the only one that has felt the least like pulling teeth is brat.
but brat is weird.
and brat is weird because it just lets you highlight anything as an entity??? no token bounds at all??? despite it having tokenisation rules under the hood, you can still just start an entity mid-token and end it mid another.
this pisses off spaCy.
a lot.
because brat and spaCy have different ideas of what a token is.
this is without mentioning btw that i just spent FOURTEEN HOURS debugging why i couldnt convert my brat standoff into conll or map it to spaCy doc representations and GUESS WHAT TURNS OUT, in front of every line in all 600 files there was a single space.
idk how she got there.
but brat ignored it and wrote all its annotation offsets as tho it wasnt there, but spaCy and brat's own anntoconll.py sure as hell saw the spaces.
BUT ANYWAY SO
i now have a pile of 150 documents, all tagged nice, should be trivial right? i load the txt files into spaCy with the usual nlp(...), recurse the associated .ann file, grab the label and offsets, and then its just doc.char_span(start, end, label=label) right?
right?
FUCK NO
because spaCy tokenised the file different to how brat did (or rather didnt do) it
so some of these Spans are now mid-token to spaCy, which it fucking HATES.
but like anyone else who does this might now be thinking
"but for hWHAT reason are you annotating anything that could become mid-token???"
names.
british names in the 1800s.
because these motherfuckers would just do anything to their fucking names.
which part of the string "Mr. Wm. Smith, Esq." is the name? ALL OF IT. INCLUDING ALL THAT PUNCTUATION.
"Dr. J. E. Brown, M.P."? all one name baybee. all that punctuation throwing off the tokeniser.
so i basically have the fruits of a painstakingly long annotation process that is still actually only 25% complete in one format, that i cannot correctly bring into a different format that i can train an NER model with, because the two tools are incompatible with each other due to differing views on what is and isnt a token, which can be broken by the BRITISH.
but hey at least i could pay Explosion AI, makers of spaCy, $390 to use Prodigy right? the annotation platform that Just Works™ with spaCy? that'd solve everything wouldn't it.
