1.1 Let’s look again into gapminder dataset and create a new cloumn, life_level, that contains five categories (“very high”, “high”,“moderate”, “low” and “very low”) based on life expectancy in 1997. Assign categories accoring to the table below:
| Criteria | life_level |
|---|---|
| less than 23 | very low |
| between 23 and 48 | low |
| between 48 and 59 | moderate |
| between 59 and 70 | high |
| more than 70 | very high |
Function case_when() is a tidier way to vectorise multiple if_else() statements. you can read more about this function here.
# kind of like a switch/case statement in other langauges
gapminder %>%
filter(year == 1997) %>%
mutate(life_level = case_when(lifeExp < 23 ~ "very low",
lifeExp < 48 ~ "low",
lifeExp < 59 ~ "moderate",
lifeExp < 70 ~ "high",
TRUE ~ "very high")) %>%
ggplot() + geom_boxplot(aes(x = life_level, y = gdpPercap)) +
labs(y = "GDP per capita, $", x= "Life expectancy level, years") +
theme_bw()
Do you notice anything odd/wrong about the graph?
We can make a few observations:
It seems that none of the countries had a “very low” life-expectancy in 1997.
However, since it was an option in our analysis it should be included in our plot. Right?
Notice also how levels on x-axis are placed in the “wrong” order.
1.2 You can correct these issues by explicitly setting the levels parameter in the call to factor(). Use, drop = FALSE to tell the plot not to drop unused levels
# can use ylab and xlab instead of labs as well
gapminder %>%
filter(year == 1997) %>%
mutate(life_level = factor(case_when(lifeExp < 23 ~ "very low",
lifeExp < 48 ~ "low",
lifeExp < 59 ~ "moderate",
lifeExp < 70 ~ "high",
TRUE ~ "very high") ,
levels = c("very low", "low", "moderate", "high", "very high")
)
) %>%
ggplot() + geom_boxplot(aes(x = life_level, y = gdpPercap)) +
labs(y = "GDP per capita, $", x= "Life expectancy level, years") +
scale_x_discrete(drop = FALSE) +
theme_bw()
In Activity 1, we created our own factors, so now let’s explore what categorical variables that we have in the gapminder dataset.
gapminder$continent (activity 2.1)Use functions such as str(), levels(), nlevels() and class() to answer the following questions:
continent(a factor or charecter)?class(gapminder$continent)
## [1] "factor"
levels(gapminder$continent)
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
nlevels(gapminder$continent)
## [1] 5
str(gapminder$continent)
## Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
# first few continents are Asia-- that's why str() yields 3,3,3
# because Asia is 3rd in levels()
datatable(gapminder)
gapminder$country (activity 2.2)Let’s explore what else we can do with factors:
Answer the following questions:
country?gapminder dataset by 5 countries of your choice. How many levels are in your filtered dataset?nlevels(gapminder$country)
## [1] 142
# use %in% for 'membership checks', like "in" in python
countries <- c("Canada", "United States", "Mexico", "China", "Japan")
gap <- gapminder %>%
filter(country %in% countries)
# Factor levels preserved?
nlevels(gap$country)
## [1] 142
What if we want to get rid of some levels that are “unused” - how do we do that?
The function droplevels() operates on all the factors in a data frame or on a single factor. The function forcats::fct_drop() operates on a factor.
h_gap_dropped <- gap %>%
droplevels()
# these functions must have 'factors' as input, NOT a tibble
nlevels(h_gap_dropped$country)
## [1] 5
Let’s say we wanted to re-order the levels of a factor using a new metric - say, count().
We should first produce a frequency table as a tibble using dplyr::count():
gapminder %>%
count(continent)
## # A tibble: 5 x 2
## continent n
## <fct> <int>
## 1 Africa 624
## 2 Americas 300
## 3 Asia 396
## 4 Europe 360
## 5 Oceania 24
The table is nice, but it would be better to visualize the data. Factors are most useful/helpful when plotting data. So let’s first plot this:
# ylab is actually the x-axis after the coord_flip()
# xlab is actually the y-axis after the coord_flip()
gapminder %>%
ggplot() +
geom_bar(aes(continent)) +
coord_flip() +
theme_bw() +
ylab("Number of entries") +
xlab("Continent")
Think about how levels are normally ordered. It turns out that by default, R always sorts levels in alphabetical order. However, it is preferable to order the levels according to some principle:
fct_infreq() might be useful.fct_rev() will sort them in the opposite order.For instance , `
gapminder %>%
ggplot() +
geom_bar(aes(fct_rev(fct_infreq(continent)))) +
coord_flip()+
theme_bw() +
ylab("Number of entries") +
xlab("Continent")
Section 9.6 of Jenny Bryan’s notes has some helpful examples.
gapminder countries by life expectancy, we can visualize the results using fct_reorder().# default summarizing function is median() NOT mean()
# ordered by life Expectancy, but is not evident as the bars are measuring number of entries
gapminder %>%
ggplot() +
geom_bar(aes(fct_reorder(continent, lifeExp, max))) +
coord_flip()+
theme_bw() +
xlab("Continent") +
ylab("Number of entries")
Use fct_reorder2() when you have a line chart of a quantitative x against another quantitative y and your factor provides the color.
## order by life expectancy
ggplot(h_gap, aes(x = year, y = lifeExp,
color = fct_reorder2(country,year,lifeExp))) +
geom_line() +
labs(color = "country")
## Error in ggplot(h_gap, aes(x = year, y = lifeExp, color = fct_reorder2(country, : object 'h_gap' not found
This might be useful if you are preparing a report for say, the state of affairs in Africa.
# can also specify levels directly, using levels = c() in fct_relevel()
gapminder %>%
ggplot() +
geom_bar(aes(fct_relevel(continent,"Oceania", after = 2))) +
coord_flip()+
theme_bw() +
xlab("Continent") + ylab("Number of entries")
More details on reordering factor levels by hand can be found [here] https://forcats.tidyverse.org/reference/fct_relevel.html
Sometimes you want to specify what the levels of a factor should be. For instance, if you had levels called “blk” and “brwn”, you would rather they be called “Black” and “Brown” - this is called recoding. Lets recode Oceania and the Americas in the graph above as abbreviations OCN and AME respectively using the function fct_recode().
# similar to rename()
gapminder %>%
ggplot() +
geom_bar(aes(fct_recode(continent,"OCN"="Oceania", "AME" = "Americas"))) +
coord_flip()+
theme_bw() +
xlab("Continent") + ylab("Number of entries")