class: center, middle, inverse, title-slide # Manipulating data in R ### Susy Echeverría-Londoño and Drew Kerkhoff --- class: inverse, center, middle # Data format --- # Entering data in a spreadsheet program Common formats: - `*.txt` - `*.csv(comma separated text)` --- # From Data carpentry 1. Download the data [here](https://ndownloader.figshare.com/files/2252083) (Ernest et al., 2017. FigShare) -- 2. Open up in a spreadsheet program - Two surveys from different field assistants (2013, 2014) -- 3. Discuss with the person next to you what is wrong with this spreadsheet -- 4. Write down the steps you would need to clean up the data .footnote[ Bahlai and Pawlik, 2017 ] --- # Common Spreadsheet Errors * Using multiple tables -- * Using multiple tabs -- * Not filling zeros (blanks, NA, na, N/A) -- * Formatting: highlighting cells -- * Formatting: "pretty cells" -- * Comments or more than one information in a cell -- * Problematic field names * Avoid spaces "variable name" * Use always underscores "variable_name" .footnote[ Bahlai and Pawlik, 2017 ] --- # Variables names ![](./figs/Var_names_examples.png) .footnote[ Bahlai and Pawlik, 2017 ] --- # Exercise: Structuring data in spreadsheets 1. Put all your variables in columns (e.g., "weight" or "temperature") 2. Put each observation in its own row 3. Don't combine multiple pieces of information in one cell. * bad: Species-Sex * better: Species, Sex 4. Leave the raw data raw - don't change it! 5. Keep track of the steps you took in your clean up or analysis. 6. Export the cleaned data to a text-based format like csv (comma-separated values) or txt format. --- class: inverse, center, middle # Importing and exploring data --- # Importing data in R ```r mydata <- read.csv("/directoryname/filename.csv") mydata <- read.table("/directoryname/filename.csv", sep = ",", header = TRUE) mydata <- read.csv(file.choose()) ## No recommended! ``` --- # Importing data in R ```r # data ---- mydata <- read.csv("Exercise/data/Survey_data_clean.csv") str(mydata) ``` ``` ## 'data.frame': 37 obs. of 8 variables: ## $ Species : Factor w/ 4 levels "DM","DM ","DO",..: 2 1 1 1 1 1 1 1 1 1 ... ## $ Plot : int 2 7 3 1 3 7 4 4 7 7 ... ## $ Sex : Factor w/ 5 levels " M","F","F ",..: 2 4 4 5 4 4 2 2 1 2 ... ## $ Weight : int NA 33 NA NA 40 48 29 46 36 35 ... ## $ Year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ... ## $ Month : int 7 7 7 7 7 7 7 7 7 7 ... ## $ Day : int 16 16 16 16 18 18 18 18 18 18 ... ## $ Weight_Notes: Factor w/ 3 levels "","scale not calibrated",..: 1 1 1 1 1 1 1 1 1 1 ... ``` ```r levels(mydata$Sex) ``` ``` ## [1] " M" "F" "F " "M" "M " ``` --- # Exploring your data ```r # data ---- mydata <- read.csv("Exercise/data/Survey_data_clean.csv", stringsAsFactors = FALSE, strip.white = TRUE, na.strings = c("NA","")) str(mydata) ``` ``` ## 'data.frame': 37 obs. of 8 variables: ## $ Species : chr "DM" "DM" "DM" "DM" ... ## $ Plot : int 2 7 3 1 3 7 4 4 7 7 ... ## $ Sex : chr "F" "M" "M" "M" ... ## $ Weight : int NA 33 NA NA 40 48 29 46 36 35 ... ## $ Year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ... ## $ Month : int 7 7 7 7 7 7 7 7 7 7 ... ## $ Day : int 16 16 16 16 18 18 18 18 18 18 ... ## $ Weight_Notes: chr NA NA NA NA ... ``` ```r unique(mydata$Sex) ``` ``` ## [1] "F" "M" ``` --- # Exploring your data ```r # libs ---- library(tidyverse) # data ---- mydata <- read_csv("Exercise/data/Survey_data_clean.csv") mydata ``` ``` ## # A tibble: 37 x 8 ## Species Plot Sex Weight Year Month Day Weight_Notes ## <chr> <int> <chr> <int> <int> <int> <int> <chr> ## 1 DM 2 F NA 2013 7 16 <NA> ## 2 DM 7 M 33 2013 7 16 <NA> ## 3 DM 3 M NA 2013 7 16 <NA> ## 4 DM 1 M NA 2013 7 16 <NA> ## 5 DM 3 M 40 2013 7 18 <NA> ## 6 DM 7 M 48 2013 7 18 <NA> ## 7 DM 4 F 29 2013 7 18 <NA> ## 8 DM 4 F 46 2013 7 18 <NA> ## 9 DM 7 M 36 2013 7 18 <NA> ## 10 DM 7 F 35 2013 7 18 <NA> ## # ... with 27 more rows ``` --- # Exploring your data ```r class(mydata$Sex) ``` ``` ## [1] "character" ``` ```r levels(mydata$Sex) ``` ``` ## NULL ``` ```r unique(mydata$Sex) ``` ``` ## [1] "F" "M" ``` --- # Exploring your data ```r mydata$Sex<-as.factor(mydata$Sex) class(mydata$Sex) ``` ``` ## [1] "factor" ``` ```r levels(mydata$Sex) ``` ``` ## [1] "F" "M" ``` --- class: inverse, center, middle # Manipulating data --- # Exercise 1. Download the code handout from [here](https://globalecologybiogeography.github.io/Ecoinformatics/code/Week3_CodeHandout.R) 2. Read data ```r # Download data download.file("url", "./data/portal_data_joined.csv") # Read data mydata<-read_csv("./data/portal_data_joined.csv") mydata<-read.csv("./data/portal_data_joined.csv") ``` --- # Inspecting your data ```r # Dimensions dim(mydata) ncol(mydata) nrow(mydata) # Content head(mydata) tail(mydata) # Variables and observation names names(mydata) rownames(mydata) # Summary str(mydata) summary(mydata) ``` --- # Factors in R * Represent categorical data * Ordered vs unordered (Statistics and plots) * Stored as integers and not characters <img src="http://r4ds.had.co.nz/visualize_files/figure-html/unnamed-chunk-30-1.png" style="width: 70%"/> --- # Factors in R ```r habitat<-c("Crops","Crops","Urban","Forest","Forest") unique(habitat) ``` ``` ## [1] "Crops" "Urban" "Forest" ``` ```r habitat<-as.factor(habitat) levels(habitat) ``` ``` ## [1] "Crops" "Forest" "Urban" ``` --- # Factors in R ```r habitat<-relevel(habitat, ref = "Forest") levels(habitat) ``` ``` ## [1] "Forest" "Crops" "Urban" ``` ```r habitat<-factor(habitat,levels=c("Forest","Urban","Crops")) levels(habitat) ``` ``` ## [1] "Forest" "Urban" "Crops" ``` --- # Selecting columns ```r mycols<-mydata[,c("plot_id","genus")] mycols<-mydata[,c(5,10)] # tidyverse mycols<-select(mydata, plot_id, genus) ``` --- # Filtering rows Rows based on a specific criteria ```r ref<-which(mydata$year==2000) myrows<-mydata[ref,] # tidyverse myrows<-filter(mydata, year==2000) ``` --- # Create new cols or variables ```r mydata$weight_kg<-mydata$weight/1000 mydata["weight_kg"]<-mydata["weight"]/1000 # tidyverse mutate(mydata, weight_kg = weight / 1000, weight_kg2 = weight_kg * 2) ``` --- # Summarise dataframes ```r # Mean weight mean(mydata$weight) #tidyverse summarize(mydata,mean_weight = mean(weight, na.rm = TRUE)) summarize(mydata, mean_weight = mean(weight, na.rm = TRUE), duration = max(year) - min(year), N_Genus=length(unique(genus))) ``` --- # Summarise dataframes by groups ```r # Mean weight per sex tapply(mydata$weight,mydata$sex,mean,na.rm=TRUE) #tidyverse ob1<-group_by(mydata, sex) summarise(ob1,mean_weight = mean(weight, na.rm = TRUE)) summarise( group_by(mydata, sex), mean_weight = mean(weight, na.rm = TRUE)) ``` --- # Number records for each category ```r # Number of records per sex table(mydata$sex) #tidyverse ob1<-group_by(mydata, sex) tally(ob1) ``` --- # Export data ```r newdata<-filter(mydata, year>=2000 & !is.na(weight)) write.csv(newdata, file="./outputs/mynewdata.csv") ``` --- # Pipes in R ```r #1. Remove rows with NAs in weight #2. Summarise the mean and min weight per sex obj1<-mydata[!is.na(mydata$weight),] obj2<-tapply(obj1$weight, list(obj1$genus,obj1$sex), mean) ``` --- # Pipes in R ```r #1. Remove rows with NAs in weight #2. Summarise the mean and min weight per genus per sex obj1<-filter(mydata, !is.na(weight)) obj2<-group_by(obj1, genus,sex) summarise(obj2, mean_weight = mean(weight), min_weight = min(weight)) ``` --- # Pipes in R ```r #1. Remove rows with NAs in weight #2. Summarise the mean and min weight per sex # Really bad summarize(group_by(filter(mydata,!is.na(weight)),genus,sex),mean_weight = mean(weight),min_weight = min(weight)) # Better obj1<- mydata %>% filter(!is.na(weight)) %>% group_by(genus,sex) %>% summarize(mean_weight = mean(weight), min_weight = min(weight)) ```