We're updating the issue view to help you get more done. 

Emory University Request: Adding Book Djvu / OCR to our IIIF Service


We have a partnership with Emory University in digitizing their books.

Rebecca of Emory University writes:
"It looks like you aren’t yet including OCR text as annotations on the [IIIF] page images – is that something that is on your roadmap? Or is that the kind of thing that would be useful to add as a github issue? And are there are any plans to make the IIIF content more discoverable, or does that need to wait until the service comes out of beta? Also, any plans or hopes to implement other IIIF services?"

The answer to her question is:
That's correct, we don't current offer companion OCRd text via our IIIF service. Technically, all our infrastructure is setup to support this use-case. We have OCR'd transcripts available in XML for each text item. We also have a djvu file which maps OCR'd words to the regions wherein the words appear inside each book page image. All of the aforementioned data is programatically and publically accessible. It's just not yet (a) in IIIF format, (b) easily queriable on a "per page" basis, or (c) linked up to our IIIF API.

Next Step:
I am having Emory contact @jcg to manage expectations and decide to what degree we want to support our IIIF service. Implementing the changes above would require ~4 hours of work (<1 day) and would allow all our books to (essentially) be compatible with search inside via IIIF and the mirador reader (this is the bigger / strategic implication).

Feature Technical Details:
Coincidentally, I had worked on something similar this weekend:
https://api.archivelab.org/search/books/virginiawildlife02unse – this is an extremely inefficient endpoint which reads in our XML djvu file for a text and returns its OCR'd text + the location/regions of the text within the page images (in json format).

Technically, this would only need to be done once per book (the json version can be cached on the IIIF service). Alternatively, we can fetch the OCR + region information only for pages as we need them. Or, we could add a derive process to produce a version of the djvu files in a format which is easier for our IIIF service to use. (the first option seems easiest)



John Gonzalez


Mek Karpeles