The R script run_analysis.R is the main program to clean and tidy the data collected from smartphones, for the objective of human activity recognition.
The script takes in a relative root directory path that points to the data (e.g. UCI HAR Dataset), as a function argument. The data set is split into a training and a test subsets, each comprised of a subject id, an activity label, and a 561 element, feature vector.
In the first step the train and test data files are read in from their respective folders under the root directory. For each train and test, the data is column combined to form a data frame for each.
Next, I read in the feature definitions (from features.txt) into a data frame, followed by creating a logical vector, using the R's grepl function for selecting features that contain any of 'mean' or 'std' in their descriptive string. I then subset both the train and test data frames by the logical vector obtained.
Then, I combine the pruned train and test data frames, row wise, into a single data frame. Original column names are further cleaned up, specifically removing instances of open and close parenthesis (e.g. ()). Subsequently, the numerical activity labels are substituted with the descriptive version fetched from activity_labels.txt.
Finally, I split the data frame by subject id, and for each by an activity label, and compute the mean for each of the tidy, feature vector elements. The tidy data set is written out to a text file.
In extracting the subset of features that pertain to any of of mean or standard deviation measurements, I uniformly searched for the 'mean|std' pattern in the feature name. This could certainly be improved with a broader regular expression, and further filter out instances of 'meanFreq' and 'stdFreq' from the feature vector.
I have thought there was little merit to use download.file() in the script, on the original data zip file, since the zip was archived manually on my desktop.
I considered initially removing the hyphens (-) from column names to better comply with data naming conventions. I ended up leaving them in to improve the clarity of a name that is already composed of a handful of abbreviated words. For a similar reason, I found lowering the case of column names less compelling.
The tidy data has the first two columns in the order of SubjectId followed by ActivityLabel. I found this to be more intuitive, rather than the original swapped version that is generated by the plyr aggregate function.
The original data was obtained from here: https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip.
- Set the working directory to the parent folder of the data set folder.
- source("run_analysis.R")
- library(plyr)
- run_analysis("UCI HAR Dataset")
- The script produces the tidy data set file "UCI HAR Tidy Dataset.txt" in the root directory.