i guess follow me @bethposting on bsky or pillowfort


discord username:
bethposting

jaycat
@jaycat

EDIT: Thank you everyone who commented! I was hoping I was doing something wrong, but it looks like it's an issue with now Library of Congress set this up. I had been hoping to get a txt file of the text of the book that I could then convert into the format I need for my project, but it looks like I'll be going the slightly more tedious route of using the PDFs. No further help needed!

hi! I'm sure there's someone on here who can help me figure this out. I'm trying to access the text file for a book in the public domain through the Library of Congress - this is the page - and when I click on the option to download as a text file I get this XML file that I can't actually read, and it says at the top "This XML file does not appear to have any style information associated with it. The document tree is shown below". Despite being on this website, I know very little about code so this is a brick wall for me. Any suggestions?


bethposting
@bethposting

it looks like the actual content is being store as .DJVU files, which are kinda like PDFs but meant specifically for scanned documents


You must log in to comment.

in reply to @jaycat's post:

should be noted that djvu is not the same as djvu xml; the latter appears to be some obscure format used by some of the djvu tools and i couldn't find anything that parses it out of the box except for djvuxmlparser which can only do one thing (insert the data from the file into a djvu file).

yeah, my first thought was to try opening it with djvu readers. haven't encountered it as a file type in ages, but it seems to mostly just be OCR metadata

idk what exactly you're looking for that the PDF isn't an option, i checked briefly and there doesn't seem to be much that will read those specific XML files out of the box. it seems to be a slightly obscure format for representing data that is normally part of a djvu file.

something that is very straightforward however, is taking the PDF file, and running pdftotext on it. i did that and the output seems usable, although far from perfect (i suspect they did optical character recognition and so the table of contents looks like a mess when converted to plain text, for example).

i can send the result to you somehow if you like? but it would help to know what you're planning to do with this

that's a great workaround, thank you! I had hoped to be able to get the text of the book similar to the .txt files available on Project Gutenberg, but pdftotext will get me what I need (with manual adjustment I was going to need to do anyways). I'm going to be doing some audiobook readalongs with the text visible, and I am planning on using the Library of Congress as a source since their licensing is very direct and clear.

the djvu xml file honestly looks like the stuff that's probably in the PDF as well (text and bounding boxes that indicate where it is on the page), they probably were generated by the same OCR software

DJVU is a PDF-like file format, yes, but this file doesn't appear to actually be in that format; it looks like a DJVU file was converted to XML as the "text" format, at the same time it was converted to a PDF (and the PDF download on that page works fine, even though the download takes forever). you can see the original DJVU file referenced in the XML source: file://localhost//var/tmp/autoclean/derive/blackbeautyautob00sewe_0//blackbeautyautob00sewe_0.djvu

the thing is, it is not clear at all why this was converted to XML like this. there's huge sections that contain literally no information, and the sections that do have the text of the document contain zero formatting information, because XML only contains the unformatted content, so when you convert the XML to text it all just runs together:

196 CHAPTER XXIX COCKNEYS T HEN there is the steam-engine style of driving; these drivers were mostly peo- ple from towns, who never had a horse of their own, and generally traveled by rail. They always seemed to think that a horse was something like a steam engine, only smaller. At any rate, they think that if only they pay for it, a horse is bound to go just as far, and just as fast, and with just as heavy a load as they please. And be the roads heavy and muddy, or dry and good ; be they stony or smooth, uphill or downhill, it is all the same — on, on, or, one must go at the same pace, with no relief, and no consideration. These people never think of getting out to walk up a steep hill. Oh, no, they have paid to ride, and ride they will! The horse? Oh, he’s used to it! What were horses made for if not to drag people 197 BLACK BEAUTY uphill? Walk! A good joke, indeed!

what this needs is a separate file that provides formatting instructions, and that doesn't appear to be available? normally there would be an XSL file available to format the raw text, but for some reason that isn't here, so there's no easy way to "recover" that formatting information, even if you plug the XML into a ebook reader or something; it would have to make a best guess and even Calibre, the gold standard for this, has no idea what to do

well, there is formatting information in the sense that the XML file contains bounding boxes for all the words (i.e. where they are on the page) but yeah, there's nothing that automatically makes it into nicely flowing text in there. that would basically need to be done by hand.

though the text is probably going to be a reasonable start, so it shouldn't be a huge slog, i suppose

well yeah but 1) the asker did not specify what they want to do with this so no use in prefiguring that

and 2) they're auto generated and they don't seem to do a great job with the ToC. like i said it needs cleaning up but it's probably a good ways there.

honestly though i'd start with running the PDF through pdftotext, which i've already done and the output is easier to work with than some obscure XML sidecar format if you just want plain text

coming back to this a couple days later and wanted to thank you for this explanation and for doing the heavy lifting! I'm glad to know that this is a dead-end and that I should pursue other methods of getting this text accurately off the page.