R: Efficient data loading

I’m working with the HOG_01 directory from this dataset. They have organized so that each data point is a 1568 line text file. The data sets are also separated in 43 types. Unzip the data and you will see what I am talking about.

After reading a post on my Machine learning class blog I came up with a script that managed to execute in 1m8s. The script outputs a file that is created with R’s ‘save’ command. Here is the code:

cat > space <<EOF
EOF
find $1 -type f -name *.txt -exec cat {} space \; | tr -s "\r\n" " " >> matrix.out
rm -f space
R -e "HOG_01 = scan('matrix.out'); \
      dim(HOG_01) <- c(26640, 1568); \
      save(HOG_01,file='matrix.RData');"
rm -f matrix.out

The space is a hack that I used to put a space separator at the end of each data set (because the text files did not end with a line break character). The create file is very compact and takes a couple of seconds to load to R. Note that you must use load to load the data from file.

load("matrix.RData");

About joelgranados

I'm fascinated with how technology and science impact our reality and am drawn to leverage them in order to increase the potential of human activity.

View all posts by joelgranados →

This entry was posted in commands, PhD, R. Bookmark the permalink.

	joelgranados on Adjusting point size in R…
	joelgranados on Idea: Getting the color of a f…
	Luisa Riato on Adjusting point size in R…
	Eric on Idea: Getting the color of a f…
	Jackson Huang on imroi: bloated Matlab com…