Book Reversions – Hardcopy to Word Version

My wife, Margaret Watson, recently got reversions of 10 older Harlequin/Silhouette books.  This means she now has the right to publish them herself.  Unfortunately, by the time she was notified, digital copies of these books were no longer available because Silhouette had already removed them from on-line sites.

She asked Harlequin to send her digital versions, but they wanted $500 per book.  So we decided to create digital versions from the hardcopy books.

In 2011, we used Blue Leaf Book Scanning to digitize a hard copy book for about $25.  You mail them the book, they cut off the spine, scan it and digitize it using OCR (Optical Character Recognition).  Then you download the Word format from their site.

To digitize these 10 Silhouette books, we used 1DollarScan.com, which costs $6 per book.  I was initially disappointed by the OCR accuracy, so paid an additional $6 for their high quality OCR option.  This feature, which they call HQT (High Quality Touchup), produced very good results.  Before applying the OCR engine, it slightly rotated any page that was tilted, to improve character recognition.

I downloaded the books from their website in pdf format.  The files contained both the scanned pages, plus the OCR results (i.e. text).

I first used copy/paste to put the text into Word, but the paragraph breaks did not come across.  The OCR engine does not interpret blank lines or indentation as anything.

I searched online and found a free utility called UniPDF, which is a PDF to Word converter.  It was easy to use.  From each pdf, it produced a Word document with the proper paragraph breaks.

Silhouette indicated scene changes in our books by inserting a blank line between paragraphs.  The OCR doesn’t pick this up, so I had to review the hardcopy and manually add scene changes as a line with “***”.

The OCR was very accurate, but not perfect.  Since I had extensive work experience writing Microsoft Excel macros, I wrote Word macros (VBA) to help format the book.

One macro removed all the page headings.  Where each heading was found, it inserted a Word comment with the old page number.  This allowed me to more easily cross-reference the word document to the physical book during my review.  At the end of the review, all comments were removed with a single command.

Other macros were written to initialize styles, replace double paragraph breaks, and format chapter headings.  The macros were written just once, stored in the Normal.dot, then used to prepare all the books.

Other formatting issues were handled in a less automated fashion, to avoid inadvertently making improper changes.  As I worked through the initial books, I created a log in a spreadsheet tracking what issues should be addressed, and in what order.  I continually refined the log and used it as I began formatting each new book.

For example, the log included how to deal with contractions (e.g., didn’t, could’ve).  Contractions were an issue because the OCR generally inserted a space after each single quote.  My log listed all the contractions (‘s, ‘t, ‘d, ‘ve, ‘ll, ‘re) that I needed to address.  Word’s repeated find/replace feature worked well here.

Other log issues included ellipsis, end-of-line dashes, em dashes, I’s interpreted as 1’s, double spaces, double single quotes instead of single double quotes, end-of-paragraph issues, end-of-sentence issues, proper double space after period, …

And there were some pure OCR issues.  For example, “corner” was often interpreted as “comer”.  And “barn” sometimes came out as “bam”.  Once I had my checklist, I used it to review and fix each book.

I was pleased to see that italics were properly interpreted during the scanning and OCR process.

My initial review took about 4 hours per book.  Then I’d do a complete edit, reading the entire book, which took about two days.  I’m not a particularly fast reader, and I usually found errors from the original hardcopy book.

After my review, I turned each book over to my wife, who spent 1 ½ days reviewing.  So in total, we spent about 4 days of effort to convert each hardcopy book into a digital version.  Plus $12 for scanning.  This excludes the final formatting, front and back matter, and conversion needed before uploading to the digital platforms (e.g., Amazon).

We expect five of these books to be up for sale by mid-October 2016.  Look for the Cameron Utah series.