Aggregating Data

<br>
<br>
.right-panel[

# Aggregating Data
## Dr. Mine Dogucu
]

---

---
class: inverse middle

---

```r
lapd %>% 
  count(employment_type)
```

```
## # A tibble: 3 x 2
##   employment_type     n
##   <fct>           <int>
## 1 Full Time       14664
## 2 Part Time         132
## 3 Per Event          28
```

---

```r
lapd %>% 
  count(employment_type) %>% 
  mutate(prop = n/sum(n))
```

```
## # A tibble: 3 x 3
##   employment_type     n    prop
##   <fct>           <int>   <dbl>
## 1 Full Time       14664 0.989  
## 2 Part Time         132 0.00890
## 3 Per Event          28 0.00189
```

---

---

## Mean

```r
summarize(lapd, 
          mean(base_pay))
```

```
## # A tibble: 1 x 1
##   `mean(base_pay)`
##              <dbl>
## 1           85149.
```
]

```r
mean(lapd$base_pay)
```

```
## [1] 85149.05
```

]

---

## Median

```r
summarize(lapd, 
          median(base_pay))
```

```
## # A tibble: 1 x 1
##   `median(base_pay)`
##                <dbl>
## 1             97601.
```
]

```r
median(lapd$base_pay)
```

```
## [1] 97600.66
```

]

---

## Quantiles

```r
summarize(lapd, quantile(base_pay, c(0.25, 0.50, 0.75)))
```

```
## # A tibble: 3 x 1
##   `quantile(base_pay, c(0.25, 0.5, 0.75))`
##                                      <dbl>
## 1                                   67266.
## 2                                   97601.
## 3                                  109368.
```

---

```r
summarize(lapd,
          mean(base_pay),
          median(base_pay))
```

```
## # A tibble: 1 x 2
##   `mean(base_pay)` `median(base_pay)`
##              <dbl>              <dbl>
## 1           85149.             97601.
```

Note how the variables names in this table is not easy to read.

---

```r
summarize(lapd,
          mean_base_pay = mean(base_pay),
          med_base_pay = median(base_pay))
```

```
## # A tibble: 1 x 2
##   mean_base_pay med_base_pay
##           <dbl>        <dbl>
## 1        85149.       97601.
```

What if we wanted to extract `mean_base_pay` from this table?

---

```r
summarize(lapd,
          mean_base_pay = mean(base_pay),
          med_base_pay = median(base_pay)) %>% 
  select(mean_base_pay) %>% 
  slice(1) %>% 
  pull()
```

```
## [1] 85149.05
```

---

---

`group_by()`

---

Q. What is the median salary for each employment type?

---

```r
lapd %>% 
  group_by(employment_type)
```

```
## # A tibble: 14,824 x 4
## # Groups:   employment_type [3]
##    job_class_title                  employment_type base_pay base_pay_level     
##    <fct>                            <fct>              <dbl> <chr>              
##  1 Police Detective II              Full Time        119322. Greater than Median
##  2 Police Sergeant I                Full Time        113271. Greater than Median
##  3 Police Lieutenant II             Full Time        148116  Greater than Median
##  4 Police Service Representative II Full Time         78677. Greater than Median
##  5 Police Officer III               Full Time        109374. Greater than Median
##  6 Police Officer II                Full Time         95002. Greater than Median
##  7 Police Officer II                Full Time         95379. Greater than Median
##  8 Police Officer II                Full Time         95388. Greater than Median
##  9 Equipment Mechanic               Full Time         80496  Greater than Median
## 10 Detention Officer                Full Time         69640  Greater than Median
## # … with 14,814 more rows
```

---

```r
lapd %>% 
  group_by(employment_type) %>% 
  summarize(med_base_pay = median(base_pay))
```

```
## `summarise()` ungrouping output (override with `.groups` argument)
```

```
## # A tibble: 3 x 2
##   employment_type med_base_pay
##   <fct>                  <dbl>
## 1 Full Time             97996.
## 2 Part Time             14474.
## 3 Per Event              4275
```

---

We can also remind ourselves how many staff members there were in each group.

```r
lapd %>% 
  group_by(employment_type) %>% 
  summarize(med_base_pay = median(base_pay),
            n = n())
```

```
## `summarise()` ungrouping output (override with `.groups` argument)
```

```
## # A tibble: 3 x 3
##   employment_type med_base_pay     n
##   <fct>                  <dbl> <int>
## 1 Full Time             97996. 14664
## 2 Part Time             14474.   132
## 3 Per Event              4275     28
```