I’m working with the HOG_01 directory from this dataset. They have organized so that each data point is a 1568 line text file. The data sets are also separated in 43 types. Unzip the data and you will see what I am talking about.
After reading a post on my Machine learning class blog I came up with a script that managed to execute in 1m8s. The script outputs a file that is created with R’s ‘save’ command. Here is the code:
cat > space <<EOF EOF find $1 -type f -name *.txt -exec cat {} space \; | tr -s "\r\n" " " >> matrix.out rm -f space R -e "HOG_01 = scan('matrix.out'); \ dim(HOG_01) <- c(26640, 1568); \ save(HOG_01,file='matrix.RData');" rm -f matrix.out
The space is a hack that I used to put a space separator at the end of each data set (because the text files did not end with a line break character). The create file is very compact and takes a couple of seconds to load to R. Note that you must use load to load the data from file.
load("matrix.RData");