class: title-slide <br> <br> .right-panel[ # Aggregating Data ## Dr. Mine Dogucu ] --- class: middle .pull-left[ ## Data Observations ] .pull-left[ ## Aggregate Data Summaries of observations ] --- class: inverse middle .font75[Aggregating Categorical Data] --- class: middle ```r lapd %>% count(employment_type) ``` ``` ## # A tibble: 3 x 2 ## employment_type n ## <fct> <int> ## 1 Full Time 14664 ## 2 Part Time 132 ## 3 Per Event 28 ``` --- ```r lapd %>% count(employment_type) %>% mutate(prop = n/sum(n)) ``` ``` ## # A tibble: 3 x 3 ## employment_type n prop ## <fct> <int> <dbl> ## 1 Full Time 14664 0.989 ## 2 Part Time 132 0.00890 ## 3 Per Event 28 0.00189 ``` --- class: inverse middle .font75[Aggregating Numerical Data] --- ## Mean .pull-left[ ```r summarize(lapd, mean(base_pay)) ``` ``` ## # A tibble: 1 x 1 ## `mean(base_pay)` ## <dbl> ## 1 85149. ``` ] -- .pull-right[ ```r mean(lapd$base_pay) ``` ``` ## [1] 85149.05 ``` ] --- ## Median .pull-left[ ```r summarize(lapd, median(base_pay)) ``` ``` ## # A tibble: 1 x 1 ## `median(base_pay)` ## <dbl> ## 1 97601. ``` ] -- .pull-right[ ```r median(lapd$base_pay) ``` ``` ## [1] 97600.66 ``` ] --- ## Quantiles ```r summarize(lapd, quantile(base_pay, c(0.25, 0.50, 0.75))) ``` ``` ## # A tibble: 3 x 1 ## `quantile(base_pay, c(0.25, 0.5, 0.75))` ## <dbl> ## 1 67266. ## 2 97601. ## 3 109368. ``` --- ```r summarize(lapd, mean(base_pay), median(base_pay)) ``` ``` ## # A tibble: 1 x 2 ## `mean(base_pay)` `median(base_pay)` ## <dbl> <dbl> ## 1 85149. 97601. ``` Note how the variables names in this table is not easy to read. --- ```r summarize(lapd, mean_base_pay = mean(base_pay), med_base_pay = median(base_pay)) ``` ``` ## # A tibble: 1 x 2 ## mean_base_pay med_base_pay ## <dbl> <dbl> ## 1 85149. 97601. ``` What if we wanted to extract `mean_base_pay` from this table? --- ```r summarize(lapd, mean_base_pay = mean(base_pay), med_base_pay = median(base_pay)) %>% select(mean_base_pay) %>% slice(1) %>% pull() ``` ``` ## [1] 85149.05 ``` --- class: inverse middle .font75[Aggregating Data by Groups] --- `group_by()` <img src="img/data-wrangle.003.jpeg" width="80%" style="display: block; margin: auto;" /> --- Q. What is the median salary for each employment type? --- ```r lapd %>% group_by(employment_type) ``` ``` ## # A tibble: 14,824 x 4 ## # Groups: employment_type [3] ## job_class_title employment_type base_pay base_pay_level ## <fct> <fct> <dbl> <chr> ## 1 Police Detective II Full Time 119322. Greater than Median ## 2 Police Sergeant I Full Time 113271. Greater than Median ## 3 Police Lieutenant II Full Time 148116 Greater than Median ## 4 Police Service Representative II Full Time 78677. Greater than Median ## 5 Police Officer III Full Time 109374. Greater than Median ## 6 Police Officer II Full Time 95002. Greater than Median ## 7 Police Officer II Full Time 95379. Greater than Median ## 8 Police Officer II Full Time 95388. Greater than Median ## 9 Equipment Mechanic Full Time 80496 Greater than Median ## 10 Detention Officer Full Time 69640 Greater than Median ## # … with 14,814 more rows ``` --- ```r lapd %>% group_by(employment_type) %>% summarize(med_base_pay = median(base_pay)) ``` ``` ## `summarise()` ungrouping output (override with `.groups` argument) ``` ``` ## # A tibble: 3 x 2 ## employment_type med_base_pay ## <fct> <dbl> ## 1 Full Time 97996. ## 2 Part Time 14474. ## 3 Per Event 4275 ``` --- We can also remind ourselves how many staff members there were in each group. ```r lapd %>% group_by(employment_type) %>% summarize(med_base_pay = median(base_pay), n = n()) ``` ``` ## `summarise()` ungrouping output (override with `.groups` argument) ``` ``` ## # A tibble: 3 x 3 ## employment_type med_base_pay n ## <fct> <dbl> <int> ## 1 Full Time 97996. 14664 ## 2 Part Time 14474. 132 ## 3 Per Event 4275 28 ```