R: Efficient data loading

I’m working with the HOG_01 directory from this dataset.  They have organized so that each data point is a 1568 line text file.  The data sets are also separated in 43 types.  Unzip the data and you will see what I am talking about.

After reading a post on my Machine learning class blog I came up with a script that managed to execute in 1m8s.  The script outputs a file that is created with R’s ‘save’ command.  Here is the code:

cat > space <<EOF
EOF
find $1 -type f -name *.txt -exec cat {} space \; | tr -s "\r\n" " " >> matrix.out
rm -f space
R -e "HOG_01 = scan('matrix.out'); \
      dim(HOG_01) <- c(26640, 1568); \
      save(HOG_01,file='matrix.RData');"
rm -f matrix.out

The space is a hack that I used to put a space separator at the end of each data set (because the text files did not end with a line break character).  The create file is very compact and takes a couple of seconds to load to R.  Note that you must use load to load the data from file.

load("matrix.RData");
Advertisement

About joelgranados

I'm fascinated with how technology and science impact our reality and am drawn to leverage them in order to increase the potential of human activity.
This entry was posted in commands, PhD, R. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s