OCR scanned PDF files – a must in the researcher’s toolkit!!
Sometimes when downloading sources from databases, particularly older articles and documents, I have found that they have been scanned and uploaded. The ability to highlight or search through the document is reduced or removed due to the fact that the PDF file has been created from a series of images of documents.
OCR (Optical Character Recognition) software can concert these ‘pictures’ of documents in to PDF files that can then allow the highlighting of the text.
Adobe Acrobat Pro has been awesome, but it’s expensive. There are other tools such as ABBYY Finereader, and a number of free upload-and-convert-to-text-or-Word-doc style sites. Evernote and Onenote have the capability to read PDFs so that if you need to search for terms it can locate those terms within the PDF – however, they do not perform OCR on the document and your ability to highlight text is still limited. Essentially for me, Acrobat Pro has been my go-to text recognition software for PDFs, largely because it leaves the PDF in-tact and I can then import the PDF in to Sente. It’s a key step in my process.
Why is this so important?
Personally, I have three main reasons:
Yes, Evernote or similar have the capability to read the text in PDF documents, however when I want to search for a particular phrase in a document I often use the in-built CMD + <space> shortcut on my Mac. Ensuring that text has been scanned and OCR’d (is that a term? it is now! ) is key to ensuring that it can be found on my computer and highlighted in Preview or Acrobat Reader.
A key feature of being able to read documents in Goodreader, Endnote, Mendeley, Sente or a suitable equivalent is being able to highlight and quote text, either taking quotes direct from the source, or being able to make notes and interpret what I’m reading in the context of my topic/research. Having accurate OCR is a key way to speed up this process, rather than having to flip back and forth typing in quotes or refer to the original document when making notes. This is especially key when reading in a tablet environment, especially when a limitation is often screen space and you end up flipping back and forth a lot.
- Portability, productivity and procrastination
Without the previous two points portability of reading and note-taking is restricted, especially when talking about a workflow on a tablet. On my iPad in Sente I can highlight and quote text – to make comments on this I am taken in to a separate screen. If I don’t have the ability to take the quote directly from the text, then I am continually flipping back and forth… this is not impossible, however it does hinder the flow of thoughts and ideas and make me, a procrastinator, put off the reading. There is one thing that I’ve learnt during this process and that is remove barriers to procrastination – OCR in PDF files is key to this.
As I mentioned above, Adobe Acrobat Pro has been awesome – I was lucky and got it at a discount – however it IS expensive (last time I checked approximately SG$615/US$449). It has now been migrated to a cloud-based platform, called Adobe DC (for Document Cloud), which goes for about SG$20 per month – much more manageable (but still a touch pricey).
Again, because portability is a key aspect for me, particularly working from a library where I might need to import a few pages from a textbook, I decided to check out tablet options – and for me this means iPad. I’ve tested a few, and PDFpen Scan+ with OCR (iTunes or website) seems to have come up a winner. It was US$6.99 when I purchased it, and it allows me to both import and take photos for OCR recognition.
It doesn’t matter which tools you use, and I’m just providing details on ones that have worked for me. I do not receive income from any of the tools I’ve linked to in this post. I do recommend that you find suitable OCR scanning software that allows you to convert text in your PDF documents, especially if you are a researcher that is looking to go through large volumes of documents and need to be able to scan, quote or use the information effectively in your flow.
Update: 5.Sep.2015: I’ve been using both a fair bit recently, and I have to say that the OCR recognition with Adobe Acrobat Pro is superior to that of the PDFpen Scan+ product. I still use both, however I go through any text I quote from the PDFpen documents more thoroughly as there are some errors with some poor quality scans. Hopefully this will improve over time.