Manipulating data in R

# Manipulating data in R
### Susy Echeverría-Londoño and Drew Kerkhoff

---

# Data format

---

# Entering data in a spreadsheet program

Common formats: 
- `*.txt`
- `*.csv(comma separated text)`

---
# From Data carpentry

1. Download the data [here](https://ndownloader.figshare.com/files/2252083) (Ernest et al., 2017. FigShare)

2. Open up in a spreadsheet program
  - Two surveys from different field assistants (2013, 2014)

3. Discuss with the person next to you what is wrong with this spreadsheet

4. Write down the steps you would need to clean up the data

---
# Common Spreadsheet Errors

* Using multiple tables

* Using multiple tabs

--
* Not filling zeros (blanks, NA, na, N/A)

--
* Formatting: highlighting cells

--
* Formatting: "pretty cells"

--
* Comments or more than one information in a cell

--
* Problematic field names
  * Avoid spaces "variable name"
  * Use always underscores "variable_name"

---
# Variables names

![](./figs/Var_names_examples.png)

---
# Exercise: Structuring data in spreadsheets

1. Put all your variables in columns (e.g., "weight" or "temperature")

2. Put each observation in its own row

3. Don't combine multiple pieces of information in one cell. 
  * bad: Species-Sex
  * better: Species, Sex

4. Leave the raw data raw - don't change it!

5. Keep track of the steps you took in your clean up or analysis.

6. Export the cleaned data to a text-based format like csv (comma-separated values) or txt format.

---

# Importing and exploring data

---

# Importing data in R

```r
mydata <- read.csv("/directoryname/filename.csv")

mydata <- read.table("/directoryname/filename.csv",
                     sep = ",",
                     header = TRUE)

mydata <- read.csv(file.choose()) ## No recommended!      
```

---

# Importing data in R

```r
# data ----
mydata <- read.csv("Exercise/data/Survey_data_clean.csv")

str(mydata)
```

```
## 'data.frame':	37 obs. of  8 variables:
##  $ Species     : Factor w/ 4 levels "DM","DM ","DO",..: 2 1 1 1 1 1 1 1 1 1 ...
##  $ Plot        : int  2 7 3 1 3 7 4 4 7 7 ...
##  $ Sex         : Factor w/ 5 levels " M","F","F ",..: 2 4 4 5 4 4 2 2 1 2 ...
##  $ Weight      : int  NA 33 NA NA 40 48 29 46 36 35 ...
##  $ Year        : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ Month       : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ Day         : int  16 16 16 16 18 18 18 18 18 18 ...
##  $ Weight_Notes: Factor w/ 3 levels "","scale not calibrated",..: 1 1 1 1 1 1 1 1 1 1 ...
```

```r
levels(mydata$Sex)
```

```
## [1] " M" "F"  "F " "M"  "M "
```

---

# Exploring your data

```r
# data ----
mydata <- read.csv("Exercise/data/Survey_data_clean.csv", 
                   stringsAsFactors = FALSE,
                   strip.white = TRUE,
                   na.strings = c("NA",""))
str(mydata)
```

```
## 'data.frame':	37 obs. of  8 variables:
##  $ Species     : chr  "DM" "DM" "DM" "DM" ...
##  $ Plot        : int  2 7 3 1 3 7 4 4 7 7 ...
##  $ Sex         : chr  "F" "M" "M" "M" ...
##  $ Weight      : int  NA 33 NA NA 40 48 29 46 36 35 ...
##  $ Year        : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ Month       : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ Day         : int  16 16 16 16 18 18 18 18 18 18 ...
##  $ Weight_Notes: chr  NA NA NA NA ...
```

```r
unique(mydata$Sex)
```

```
## [1] "F" "M"
```

---
# Exploring your data

```r
# libs ----
library(tidyverse)

# data ----
mydata <- read_csv("Exercise/data/Survey_data_clean.csv")

mydata
```

```
## # A tibble: 37 x 8
##    Species  Plot   Sex Weight  Year Month   Day Weight_Notes
##      <chr> <int> <chr>  <int> <int> <int> <int>        <chr>
##  1      DM     2     F     NA  2013     7    16         <NA>
##  2      DM     7     M     33  2013     7    16         <NA>
##  3      DM     3     M     NA  2013     7    16         <NA>
##  4      DM     1     M     NA  2013     7    16         <NA>
##  5      DM     3     M     40  2013     7    18         <NA>
##  6      DM     7     M     48  2013     7    18         <NA>
##  7      DM     4     F     29  2013     7    18         <NA>
##  8      DM     4     F     46  2013     7    18         <NA>
##  9      DM     7     M     36  2013     7    18         <NA>
## 10      DM     7     F     35  2013     7    18         <NA>
## # ... with 27 more rows
```

---

# Exploring your data

```r
class(mydata$Sex)
```

```
## [1] "character"
```

```r
levels(mydata$Sex)
```

```
## NULL
```

```r
unique(mydata$Sex)
```

```
## [1] "F" "M"
```

---

# Exploring your data

```r
mydata$Sex<-as.factor(mydata$Sex)

class(mydata$Sex)
```

```
## [1] "factor"
```

```r
levels(mydata$Sex)
```

```
## [1] "F" "M"
```

---

# Manipulating data

---

# Exercise

1. Download the code handout from [here](https://globalecologybiogeography.github.io/Ecoinformatics/code/Week3_CodeHandout.R)

2. Read data

```r
# Download data

download.file("url", "./data/portal_data_joined.csv")

# Read data

mydata<-read_csv("./data/portal_data_joined.csv")

mydata<-read.csv("./data/portal_data_joined.csv")
```
---

# Inspecting your data

```r
# Dimensions
dim(mydata)
ncol(mydata)
nrow(mydata)

# Content
head(mydata)
tail(mydata)

# Variables and observation names
names(mydata)
rownames(mydata)

# Summary
str(mydata)
summary(mydata)
```

---

# Factors in R

* Represent categorical data

* Ordered vs unordered (Statistics and plots)

* Stored as integers and not characters

---

# Factors in R

```r
habitat<-c("Crops","Crops","Urban","Forest","Forest")
unique(habitat)
```

```
## [1] "Crops"  "Urban"  "Forest"
```

```r
habitat<-as.factor(habitat)
levels(habitat)
```

```
## [1] "Crops"  "Forest" "Urban"
```

---

# Factors in R

```r
habitat<-relevel(habitat, ref = "Forest")
levels(habitat)
```

```
## [1] "Forest" "Crops"  "Urban"
```

```r
habitat<-factor(habitat,levels=c("Forest","Urban","Crops"))
levels(habitat)
```

```
## [1] "Forest" "Urban"  "Crops"
```

---

# Selecting columns

```r
mycols<-mydata[,c("plot_id","genus")]

mycols<-mydata[,c(5,10)]

# tidyverse
mycols<-select(mydata, plot_id, genus)
```

---

# Filtering rows

Rows based on a specific criteria

```r
ref<-which(mydata$year==2000)
myrows<-mydata[ref,]

# tidyverse
myrows<-filter(mydata, year==2000)
```

---

# Create new cols or variables

```r
mydata$weight_kg<-mydata$weight/1000

mydata["weight_kg"]<-mydata["weight"]/1000

# tidyverse
mutate(mydata, 
       weight_kg = weight / 1000,
       weight_kg2 = weight_kg * 2)
```
---

# Summarise dataframes

```r
# Mean weight
mean(mydata$weight)

#tidyverse
summarize(mydata,mean_weight = mean(weight, na.rm = TRUE))

summarize(mydata,
          mean_weight = mean(weight, na.rm = TRUE),
          duration = max(year) - min(year),
          N_Genus=length(unique(genus)))
```

---

# Summarise dataframes by groups

```r
# Mean weight per sex 
tapply(mydata$weight,mydata$sex,mean,na.rm=TRUE)

#tidyverse
ob1<-group_by(mydata, sex)

summarise(ob1,mean_weight = mean(weight, na.rm = TRUE))

summarise(
  group_by(mydata, sex),
  mean_weight = mean(weight, na.rm = TRUE))
```

---

# Number records for each category

```r
# Number of records per sex 
table(mydata$sex)

#tidyverse
ob1<-group_by(mydata, sex)

tally(ob1)
```

---

# Export data

```r
newdata<-filter(mydata, 
                year>=2000 & !is.na(weight))

write.csv(newdata, file="./outputs/mynewdata.csv")
```

---

# Pipes in R

```r
#1. Remove rows with NAs in weight
#2. Summarise the mean and min weight per sex

obj1<-mydata[!is.na(mydata$weight),]

obj2<-tapply(obj1$weight,
             list(obj1$genus,obj1$sex),
             mean)
```

---

# Pipes in R

```r
#1. Remove rows with NAs in weight
#2. Summarise the mean and min weight per genus per sex

obj1<-filter(mydata, !is.na(weight))

obj2<-group_by(obj1, genus,sex)

summarise(obj2, 
          mean_weight = mean(weight),
          min_weight = min(weight))
```

---

# Pipes in R

```r
#1. Remove rows with NAs in weight
#2. Summarise the mean and min weight per sex

# Really bad
summarize(group_by(filter(mydata,!is.na(weight)),genus,sex),mean_weight = mean(weight),min_weight = min(weight))

# Better
obj1<-
  mydata %>%
  filter(!is.na(weight)) %>%
  group_by(genus,sex) %>%
  summarize(mean_weight = mean(weight),
            min_weight = min(weight))
```