GetCleanDataProject

The R script run_analysis.R is the main program to clean and tidy the data collected from smartphones, for the objective of human activity recognition.

How the script processes the data?

The script takes in a relative root directory path that points to the data (e.g. UCI HAR Dataset), as a function argument. The data set is split into a training and a test subsets, each comprised of a subject id, an activity label, and a 561 element, feature vector.

In the first step the train and test data files are read in from their respective folders under the root directory. For each train and test, the data is column combined to form a data frame for each.

Next, I read in the feature definitions (from features.txt) into a data frame, followed by creating a logical vector, using the R's grepl function for selecting features that contain any of 'mean' or 'std' in their descriptive string. I then subset both the train and test data frames by the logical vector obtained.

Then, I combine the pruned train and test data frames, row wise, into a single data frame. Original column names are further cleaned up, specifically removing instances of open and close parenthesis (e.g. ()). Subsequently, the numerical activity labels are substituted with the descriptive version fetched from activity_labels.txt.

Finally, I split the data frame by subject id, and for each by an activity label, and compute the mean for each of the tidy, feature vector elements. The tidy data set is written out to a text file.

What assumptions have been made?

In extracting the subset of features that pertain to any of of mean or standard deviation measurements, I uniformly searched for the 'mean|std' pattern in the feature name. This could certainly be improved with a broader regular expression, and further filter out instances of 'meanFreq' and 'stdFreq' from the feature vector.

Why things were done a certain way?

I have thought there was little merit to use download.file() in the script, on the original data zip file, since the zip was archived manually on my desktop.

I considered initially removing the hyphens (-) from column names to better comply with data naming conventions. I ended up leaving them in to improve the clarity of a name that is already composed of a handful of abbreviated words. For a similar reason, I found lowering the case of column names less compelling.

The tidy data has the first two columns in the order of SubjectId followed by ActivityLabel. I found this to be more intuitive, rather than the original swapped version that is generated by the plyr aggregate function.

Where did the original data obtained?

The original data was obtained from here: https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip.

How to run the script?

Set the working directory to the parent folder of the data set folder.
source("run_analysis.R")
library(plyr)
run_analysis("UCI HAR Dataset")
The script produces the tidy data set file "UCI HAR Tidy Dataset.txt" in the root directory.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
CodeBook.md		CodeBook.md
README.md		README.md
run_analysis.R		run_analysis.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GetCleanDataProject

How the script processes the data?

What assumptions have been made?

Why things were done a certain way?

Where did the original data obtained?

How to run the script?

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GetCleanDataProject

How the script processes the data?

What assumptions have been made?

Why things were done a certain way?

Where did the original data obtained?

How to run the script?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages