R: Efficient data loading

I’m working with the HOG_01 directory from this dataset.  They have organized so that each data point is a 1568 line text file.  The data sets are also separated in 43 types.  Unzip the data and you will see what I am talking about.

After reading a post on my Machine learning class blog I came up with a script that managed to execute in 1m8s.  The script outputs a file that is created with R’s ‘save’ command.  Here is the code:

cat > space <<EOF
find $1 -type f -name *.txt -exec cat {} space \; | tr -s "\r\n" " " >> matrix.out
rm -f space
R -e "HOG_01 = scan('matrix.out'); \
      dim(HOG_01) <- c(26640, 1568); \
rm -f matrix.out

The space is a hack that I used to put a space separator at the end of each data set (because the text files did not end with a line break character).  The create file is very compact and takes a couple of seconds to load to R.  Note that you must use load to load the data from file.


This entry was posted in commands, PhD, R.

