/ Ginae B. McDonald

OCR scanned PDF files – a must in the researcher’s toolkit!!

Sometimes when downloading sources from databases, particularly older articles and documents, I have found that they have been scanned and uploaded. The ability to highlight or search through the document is reduced or removed due to the fact that the PDF file has been created from a series of images of documents.

OCR (Optical Character Recognition) software can¬†concert these ‘pictures’ of documents in to PDF files that can then allow the highlighting of the text.

Adobe Acrobat Pro has been awesome, but it’s expensive. There are other tools such as ABBYY Finereader, and a number of free upload-and-convert-to-text-or-Word-doc style sites. Evernote and Onenote have the capability to read PDFs so that if you need to search for terms it can locate those terms within the PDF – however, they do not perform OCR on the document and your ability to highlight text is still limited. Essentially for me, Acrobat Pro has been my go-to text recognition software for PDFs, largely because it leaves the PDF in-tact and I can then import the PDF in to Sente. It’s a key step in my process.

Why is this so important?

Personally, I have three main reasons: […]


I have made reference before to James Hayton’s videos, where I personally discovered “Why?” as a driver for my research journey and participation from potential candidates.

Yesterday I attended a webinar while on the road entitled ‘Understanding Academic Literature‘ (thank goodness for the internet). One of the things that resonated strongly with me was the idea of a simple classification system for material that can be applied as you read. It’s a simple A/B/C/D system, and while it wasn’t the only takeaway from James’ presentation, it was one that I could implement immediately – especially on a four hour plane journey home.

Classifying literature, by James Hayton

I’ve modified the classification slightly to suit my own needs, as anyone should, and it has resulted in the following:

  • A – Central
  • B – Supportive and informational, or relevant to methodology
  • C – Might be useful – KIV
  • D – Not relevant

“KIV” is a term that I hadn’t heard until I moved to Singapore, but it’s an acronym for Keep In View. Basically something that you’re not quite willing to discard, but may need to refer to later. For me it’s usually something that’s filtering around in the back of my brain that won’t quite let me discard the article. I have a personal rule that there shouldn’t be more than 10 papers with this label. I’m setting a monthly reminder to review these so that this doesn’t become an article graveyard.

James discussed using a “print and keep in folder” method. That doesn’t work for me due to the portability of my work and life and general. Sente knocks it out of the ballpark for me on this one – I’m using its tagging feature to keep track of this classification system. I’ve written before about why Sente works for me, and I recommend it to anyone who needs a portable, syncing database for literature management.

Other things that got me thinking from James’ webinar:

  • Identification of seminal works/authors as a key requirement for literature review – this is probably going to be key for me as I pull a number of research areas together – culture, logistics, decision making, MNCs and international business. I need to look at citations and influence when considering some of these areas. This doesn’t mean I must include them, but people may expect that I understand they exist, and be able to justify why I haven’t included them.
  • Identification of peers/competitors – I am not sure I am comfortable with the competitor term (although that’s probably my inexperience speaking), however I understand the concept of looking at people who are doing research in similar areas. I probably need to drill into this more than I have previously as I haven’t given it much thought in this context.
  • Answer three main questions when evaluating each piece of literature: What was done in the article? How was it done? Why is it significant (especially in the context of my own research)?
  • Often my expertise is lacking, especially at the beginning of each phase (reading, data collection, analysis, writing), and it’s okay to refer to teaching textbooks and Wikipedia to learn in the areas where I lack skills or knowledge. Just don’t reference them!

One of the recurring messages in James’ presentations is that PhD candidates should understand that they begin the journey with little experience, and build it up over the course of their PhD (to ‘more experienced’, not expert!). I find that this is not a linear progression, but something that is more iterative as I enter different phases of my PhD. I would say my journey is better represented by the grey line below.

My journey is more like the dotted grey line, with expertise and skills being built upon.

While I have more than 500 articles in my Sente Database, and notes written on all of them, now that I’m in data collection I find myself revisiting the literature and needing to start classify/sort with more clarity. This webinar was timely in this regards, and thanks to James for making it available at a reasonable fee.