Transcription workflow notes

So, it’s been a while since I’ve written a blog post, but I’ve not been inactive. And since I have the day off today, I thought I’d catch you up. Over the next couple of days, I’ll be putting up two chapters from the 1946 Parish Practice in Universalist Churches as text; I’ve previously posted it as a scanned PDF.

I want to discuss my workflow. I can do the odd report, but I’d like to see more Universalist and other documents transcribed, and to have typographic errors discovered and corrected. I shouldn’t be the bottleneck.

In the past — going back twenty years or so — I would photocopy a book, carefully crop it into a single column, rephotocopy these onto letter size and take them to a central computer center where they would be processed by Optical Character Recognition (OCR). I’d get a file back, and then edit it.  Later, I would use a flatbed scanner at home and OCR software at home, but some documents required the images being edited to one column. These processes were very time consuming. Sometimes, transcribing by keyboard was more efficient!

Image capture and OCR software have improved markedly. Today, instead of scanning, I take a picture with my phone, and use a graphical front-end to powerful OCR software to process the text. It’s not always clean — a second snap and process is sometimes necessary — but the improvement over twenty years ago is striking.

In particular, on my Ubuntu Linux (14.04 LTR) machine, I use YAGF — “Yet Another Graphical Front-end for cuneiform and tesseract OCR engines” with the tesseract engine.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.