PDF word statistics in one “looong” line

The paper that I am writing requires some background information about how digital imaging has been related to phenology. For this I have decided to implement an SLR. I am at the point in the SLR process where I have to “Define or elicit search string”. I chose to do this by counting the repeated words in my initial list of papers (Denominated “Quasi Gold Standard” in Pablo’s paper).

The linux Journal got me started with this interesting article from Dave Taylor. The article got most of the work done, but it lacked the part where I change all my pdf files into text files. So here is the revised version of the Dave’s really cool one liner :

find . -name *.pdf -exec pdftotext '{}' - \; |tr ' ' '\
' |tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | grep -v '[^a-z]' |sort | uniq -c | sort -rn > output.txt

I just added the initial part where I find all the pdf and execute `pdftotext`. Hope this is useful :)

Advertisement

About joelgranados

I'm fascinated with how technology and science impact our reality and am drawn to leverage them in order to increase the potential of human activity.
This entry was posted in commands and tagged , , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s