Hands-on Exercise 1

Author

Ong Chae Hui

1. Getting Started

1.1. Install and launch R packages

The code chunk below uses p_load() of pacman package to check if tidyverse packages are installed in the computer. If they are, then they will be launched into R.

Code

pacman::p_load(tidyverse)

1.2. Importing the data

Code

exam_data <- read_csv("data/Exam_data.csv")

2. R Graphics VS ggplot

2.1. Plotting Graphics using R Graphics

Code

hist(exam_data$MATHS)

2.2. Plotting Graphics using ggplot

Code

ggplot(data=exam_data, aes(x = MATHS)) +
  geom_histogram(bins=10,
                 boundary=100,
                 color="black",
                 fill="grey") +
  ggtitle("Distribution of Maths scores")

2.2.1. Essential Grammatical Elements in ggplot2: data, showing empty canvas

Code

ggplot(data=exam_data)

2.2.2. Essential Grammatical Elements in ggplot2: Aesthetic mappings, showing x-axis and y-axis

Code

ggplot(data=exam_data,
       aes(x = MATHS))

2.3. Essential Grammatical Elements in ggplot2: geom

Geometric objects are the actual marks we put on a plot. Examples include: - geom_point for drawing individual points (e.g., a scatter plot) - geom_line for drawing lines (e.g., for a line charts) - geom_smooth for drawing smoothed lines (e.g., for simple trends or approximations) - geom_bar for drawing bars (e.g., for bar charts) - geom_histogram for drawing binned values (e.g. a histogram) - geom_polygon for drawing arbitrary shapes geom_map for drawing polygons in the shape of a map! (You can access the data to use for these maps by using the map_data() function).

2.3.1. Essential Grammatical Elements in ggplot2: geom_bar, showing bar charts

Code

ggplot(data=exam_data,
       aes(x = RACE)) +
  geom_bar()

2.3.2. Essential Grammatical Elements in ggplot2: geom_dotplot

Note that the y-scale is not very useful and is very misleading

Code

ggplot(data=exam_data,
       aes(x = MATHS)) +
  geom_dotplot(dotsize = 0.5)

Bin width defaults to 1/30 of the range of the data. Pick better value with
`binwidth`.

2.3.3. Essential Grammatical Elements in ggplot2: geom_dotplot, without y-scale

Code

ggplot(data=exam_data,
       aes(x = MATHS)) +
  geom_dotplot(binwidth = 2.5,
               dotsize = 0.5) + 
  scale_y_continuous(NULL, breaks = NULL)

2.3.4. Essential Grammatical Elements in ggplot2: geom_histogram

Default bin is 30

Code

ggplot(data=exam_data,
       aes(x = MATHS)) +
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2.3.5. Essential Grammatical Elements in ggplot2: geom_dotplot, changing the defaults

Code

ggplot(data=exam_data,
       aes(x = MATHS)) +
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue")

2.3.6. Modifying a geometric object by changing aes()

Can also be used to colour, fill and alpha of the geometric

Code

ggplot(data=exam_data,
       aes(x = MATHS,
           fill = GENDER)) +
  geom_histogram(bins=20,
                 color="grey30")

2.3.7. Geometric Objects: geom-density()

geom-density() computes and plots kernel density estimate, which is a smoothed version of the histogram.

It is a useful alternative to the histogram for continuous data that comes from an underlying smooth distribution.

The code below plots the distribution of Maths scores in a kernel density estimate plot.

Code

ggplot(data=exam_data,
       aes(x = MATHS)) +
  geom_density()

Using colour or fill arguments of aes()

Code

ggplot(data=exam_data,
       aes(x = MATHS,
           color = GENDER)) +
  geom_density()

2.3.8. Geometric Objects: geom_boxplot

geom_boxplot() displays continuous value list. It visualises five summary statistics (the median, two hinges and two whiskers), and all “outlying” points individually.

The code chunk below plots boxplots by using geom_boxplot().

Code

ggplot(data=exam_data,
       aes(x = MATHS,
           colour = GENDER)) +
  geom_boxplot()

Notches are used in box plots to help visually assess whether the medians of distributions differ. If the notches do not overlap, this is evidence that the medians are different.

Code

ggplot(data=exam_data,
       aes(y = MATHS,
           x= GENDER)) + 
  geom_boxplot(notch = TRUE)

2.3.9. Geometric Objects: geom_violin()

geom_violin is designed for creating violin plot. Violin plots are a way of comparing multiple data distributions. With ordinary density curves, it is difficult to compare more than just a few distributions because the lines visually interfere with each other. With a violin plot, it’s easier to compare several distributions since they’re placed side by side.

Code

ggplot(data=exam_data,
       aes(y = MATHS,
           x = GENDER)) +
  geom_violin()

2.3.10. Geometric Objects: geom_point()

geom_point() is especially useful for creating scatterplot.

Code

ggplot(data=exam_data,
       aes(x = MATHS,
           y = ENGLISH)) +
  geom_point()

2.3.11. geom objects can be combined

Code

ggplot(data=exam_data, 
       aes(y = MATHS, 
           x= GENDER)) +
  geom_boxplot() +                    
  geom_point(position="jitter", 
             size = 0.5)

2.4. Essential Grammatical Elements in ggplot2: stat

The Statistics functions statistically transform data, usually as some form of summary. For example:

frequency of values of a variable (bar graph)
- a mean
- a confidence limit
There are two ways to use these functions:
- add a stat_() function and override the default geom, or
- add a geom_() function and override the default stat.

2.4.1. Working with stat()

The boxplots below are incomplete because the positions of the means were not shown.

Code

ggplot(data=exam_data, 
       aes(y = MATHS, 
           x= GENDER)) +
  geom_boxplot()

2.4.2. Working with stat - the stat_summary() method

Code

ggplot(data=exam_data, 
       aes(y = MATHS, 
           x= GENDER)) +
  geom_boxplot() + 
  stat_summary(geom = "point",
               fun.y="mean",
               colour="red",
               size=4)

Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
ℹ Please use the `fun` argument instead.

2.4.3. Working with stat - the geom() method

Code

ggplot(data=exam_data, 
       aes(y = MATHS, 
           x= GENDER)) +
  geom_boxplot() + 
  geom_point(stat="summary",
               fun.y="mean",
               colour="red",
               size=4)

Warning in geom_point(stat = "summary", fun.y = "mean", colour = "red", :
Ignoring unknown parameters: `fun.y`

No summary function supplied, defaulting to `mean_se()`

2.4.4. Adding a best fit curve on a scatterplot

The scatterplot below shows the relationship of Maths and English grades of pupils. The interpretability of this graph can be improved by adding a best fit curve.

Before adding the best fit curve

Code

ggplot(data=exam_data,
       aes(x = MATHS,
           y = ENGLISH)) +
  geom_point()

After adding the best fit curve

Note that the default method used is loess.

Code

ggplot(data=exam_data,
       aes(x = MATHS,
           y = ENGLISH)) +
  geom_point() + 
  geom_smooth(size=0.5)

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The default smoothing method can be overridden as shown below

Code

ggplot(data=exam_data,
       aes(x = MATHS,
           y = ENGLISH)) +
  geom_point() + 
  geom_smooth(method=lm,
              size=0.5)

`geom_smooth()` using formula = 'y ~ x'

2.5. Essential Grammatical Elements in ggplot2: Facets

Facetting generates small multiples (sometimes also called trellis plot), each displaying a different subset of the data. They are an alternative to aesthetics for displaying additional discrete variables. ggplot2 supports two types of factes, namely: facet_grid() and facet_wrap.

2.5.1. Working with facet_wrap()

facet_wrap wraps a 1d sequence of panels into 2d. This is generally a better use of screen space than facet_grid because most displays are roughly rectangular.

Code

ggplot(data=exam_data,
       aes(x = MATHS)) + 
  geom_histogram(bins=20) +
    facet_wrap(~ CLASS)

2.5.2. facet_grid()

facet_grid() forms a matrix of panels defined by row and column facetting variables. It is most useful when you have two discrete variables, and all combinations of the variables exist in the data.

Code

ggplot(data=exam_data,
       aes(x = MATHS)) + 
  geom_histogram(bins=20) +
    facet_grid(~ CLASS)

2.6. Essential Grammatical Elements in ggplot2: Coordinates

The Coordinates functions map the position of objects onto the plane of the plot. There are a number of different possible coordinate systems to use, they are:

coord_cartesian(): the default cartesian coordinate systems, where you specify x and y values (e.g. allows you to zoom in or out).
coord_flip(): a cartesian system with the x and y flipped.
coord_fixed(): a cartesian system with a “fixed” aspect ratio (e.g. 1.78 for a “widescreen” plot).
coord_quickmap(): a coordinate system that approximates a good aspect ratio for maps.

2.6.1. Working with Coordinate

By the default, the bar chart of ggplot2 is in vertical form.

Code

ggplot(data=exam_data, 
       aes(x=RACE)) +
  geom_bar()

Flipping the chart by using coord_flip().

Code

ggplot(data=exam_data, 
       aes(x=RACE)) +
  geom_bar() + 
  coord_flip()

2.6.2. Changing the y-axis and x-axis range

The scatterplot is slightly misleading because the y-axis and x-axis range are not equal.

Code

ggplot(data=exam_data, 
       aes(x= MATHS, y=ENGLISH)) +
  geom_point() +
  geom_smooth(method=lm, size=0.5)

`geom_smooth()` using formula = 'y ~ x'

The code chunk below fixed both the y-axis and x-axis range from 0-100.

Code

ggplot(data=exam_data, 
       aes(x= MATHS, y=ENGLISH)) +
  geom_point() +
  geom_smooth(method=lm, 
              size=0.5) +  
  coord_cartesian(xlim=c(0,100),
                  ylim=c(0,100))

`geom_smooth()` using formula = 'y ~ x'

2.7. Essential Grammatical Elements in ggplot2: themes

Themes control elements of the graph not related to the data. For example: - background colour - size of fonts - gridlines - colour of labels

Built-in themes include: - theme_gray() (default) - theme_bw() - theme_classic()

A list of theme can be found at this link. Each theme element can be conceived of as either a line (e.g. x-axis), a rectangle (e.g. graph background), or text (e.g. axis title).

2.7.1. Working with theme

Code

ggplot(data=exam_data, 
       aes(x=RACE)) +
  geom_bar() +
  coord_flip() +
  theme_gray()

A horizontal bar chart plotted using theme_classic().

Code

ggplot(data=exam_data, 
       aes(x=RACE)) +
  geom_bar() +
  coord_flip() +
  theme_classic()

A horizontal bar chart plotted using theme_minimal().

Code

ggplot(data=exam_data, 
       aes(x=RACE)) +
  geom_bar() +
  coord_flip() +
  theme_minimal()