PDF word statistics in one “looong” line

The paper that I am writing requires some background information about how digital imaging has been related to phenology. For this I have decided to implement an SLR. I am at the point in the SLR process where I have to “Define or elicit search string”. I chose to do this by counting the repeated words in my initial list of papers (Denominated “Quasi Gold Standard” in Pablo’s paper).

The linux Journal got me started with this interesting article from Dave Taylor. The article got most of the work done, but it lacked the part where I change all my pdf files into text files. So here is the revised version of the Dave’s really cool one liner :

find . -name *.pdf -exec pdftotext '{}' - \; |tr ' ' '\
' |tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | grep -v '[^a-z]' |sort | uniq -c | sort -rn > output.txt

I just added the initial part where I find all the pdf and execute `pdftotext`. Hope this is useful :)

About joelgranados

I'm fascinated with how technology and science impact our reality and am drawn to leverage them in order to increase the potential of human activity.

View all posts by joelgranados →

	joelgranados on Adjusting point size in R…
	joelgranados on Idea: Getting the color of a f…
	Luisa Riato on Adjusting point size in R…
	Eric on Idea: Getting the color of a f…
	Jackson Huang on imroi: bloated Matlab com…