Hands-on Exercise 1

Author

Ong Chae Hui

1. Getting Started

1.1. Install and launch R packages

The code chunk below uses p_load() of pacman package to check if tidyverse packages are installed in the computer. If they are, then they will be launched into R.

Code
pacman::p_load(tidyverse)

1.2. Importing the data

Code
exam_data <- read_csv("data/Exam_data.csv")

2. R Graphics VS ggplot

2.1. Plotting Graphics using R Graphics

Code
hist(exam_data$MATHS)

2.2. Plotting Graphics using ggplot

Code
ggplot(data=exam_data, aes(x = MATHS)) +
  geom_histogram(bins=10,
                 boundary=100,
                 color="black",
                 fill="grey") +
  ggtitle("Distribution of Maths scores")

2.2.1. Essential Grammatical Elements in ggplot2: data, showing empty canvas

Code
ggplot(data=exam_data)

2.2.2. Essential Grammatical Elements in ggplot2: Aesthetic mappings, showing x-axis and y-axis

Code
ggplot(data=exam_data,
       aes(x = MATHS))

2.3. Essential Grammatical Elements in ggplot2: geom

Geometric objects are the actual marks we put on a plot. Examples include: - geom_point for drawing individual points (e.g., a scatter plot) - geom_line for drawing lines (e.g., for a line charts) - geom_smooth for drawing smoothed lines (e.g., for simple trends or approximations) - geom_bar for drawing bars (e.g., for bar charts) - geom_histogram for drawing binned values (e.g. a histogram) - geom_polygon for drawing arbitrary shapes geom_map for drawing polygons in the shape of a map! (You can access the data to use for these maps by using the map_data() function).

2.3.1. Essential Grammatical Elements in ggplot2: geom_bar, showing bar charts

Code
ggplot(data=exam_data,
       aes(x = RACE)) +
  geom_bar()

2.3.2. Essential Grammatical Elements in ggplot2: geom_dotplot

Note that the y-scale is not very useful and is very misleading

Code
ggplot(data=exam_data,
       aes(x = MATHS)) +
  geom_dotplot(dotsize = 0.5)
Bin width defaults to 1/30 of the range of the data. Pick better value with
`binwidth`.

2.3.3. Essential Grammatical Elements in ggplot2: geom_dotplot, without y-scale

Code
ggplot(data=exam_data,
       aes(x = MATHS)) +
  geom_dotplot(binwidth = 2.5,
               dotsize = 0.5) + 
  scale_y_continuous(NULL, breaks = NULL)

2.3.4. Essential Grammatical Elements in ggplot2: geom_histogram

Default bin is 30

Code
ggplot(data=exam_data,
       aes(x = MATHS)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2.3.5. Essential Grammatical Elements in ggplot2: geom_dotplot, changing the defaults

Code
ggplot(data=exam_data,
       aes(x = MATHS)) +
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue")

2.3.6. Modifying a geometric object by changing aes()

Can also be used to colour, fill and alpha of the geometric

Code
ggplot(data=exam_data,
       aes(x = MATHS,
           fill = GENDER)) +
  geom_histogram(bins=20,
                 color="grey30")

2.3.7. Geometric Objects: geom-density()

geom-density() computes and plots kernel density estimate, which is a smoothed version of the histogram.

It is a useful alternative to the histogram for continuous data that comes from an underlying smooth distribution.

The code below plots the distribution of Maths scores in a kernel density estimate plot.

Code
ggplot(data=exam_data,
       aes(x = MATHS)) +
  geom_density()

Using colour or fill arguments of aes()

Code
ggplot(data=exam_data,
       aes(x = MATHS,
           color = GENDER)) +
  geom_density()

2.3.8. Geometric Objects: geom_boxplot

geom_boxplot() displays continuous value list. It visualises five summary statistics (the median, two hinges and two whiskers), and all “outlying” points individually.

The code chunk below plots boxplots by using geom_boxplot().

Code
ggplot(data=exam_data,
       aes(x = MATHS,
           colour = GENDER)) +
  geom_boxplot()

Notches are used in box plots to help visually assess whether the medians of distributions differ. If the notches do not overlap, this is evidence that the medians are different.

Code
ggplot(data=exam_data,
       aes(y = MATHS,
           x= GENDER)) + 
  geom_boxplot(notch = TRUE)

2.3.9. Geometric Objects: geom_violin()

geom_violin is designed for creating violin plot. Violin plots are a way of comparing multiple data distributions. With ordinary density curves, it is difficult to compare more than just a few distributions because the lines visually interfere with each other. With a violin plot, it’s easier to compare several distributions since they’re placed side by side.

Code
ggplot(data=exam_data,
       aes(y = MATHS,
           x = GENDER)) +
  geom_violin()

2.3.10. Geometric Objects: geom_point()

geom_point() is especially useful for creating scatterplot.

Code
ggplot(data=exam_data,
       aes(x = MATHS,
           y = ENGLISH)) +
  geom_point()

2.3.11. geom objects can be combined

Code
ggplot(data=exam_data, 
       aes(y = MATHS, 
           x= GENDER)) +
  geom_boxplot() +                    
  geom_point(position="jitter", 
             size = 0.5)

2.4. Essential Grammatical Elements in ggplot2: stat

The Statistics functions statistically transform data, usually as some form of summary. For example:

  • frequency of values of a variable (bar graph)
    • a mean
    • a confidence limit
  • There are two ways to use these functions:
    • add a stat_() function and override the default geom, or
    • add a geom_() function and override the default stat.

2.4.1. Working with stat()

The boxplots below are incomplete because the positions of the means were not shown.

Code
ggplot(data=exam_data, 
       aes(y = MATHS, 
           x= GENDER)) +
  geom_boxplot()

2.4.2. Working with stat - the stat_summary() method

Code
ggplot(data=exam_data, 
       aes(y = MATHS, 
           x= GENDER)) +
  geom_boxplot() + 
  stat_summary(geom = "point",
               fun.y="mean",
               colour="red",
               size=4)
Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
ℹ Please use the `fun` argument instead.

2.4.3. Working with stat - the geom() method

Code
ggplot(data=exam_data, 
       aes(y = MATHS, 
           x= GENDER)) +
  geom_boxplot() + 
  geom_point(stat="summary",
               fun.y="mean",
               colour="red",
               size=4)
Warning in geom_point(stat = "summary", fun.y = "mean", colour = "red", :
Ignoring unknown parameters: `fun.y`
No summary function supplied, defaulting to `mean_se()`

2.4.4. Adding a best fit curve on a scatterplot

The scatterplot below shows the relationship of Maths and English grades of pupils. The interpretability of this graph can be improved by adding a best fit curve.

Before adding the best fit curve

Code
ggplot(data=exam_data,
       aes(x = MATHS,
           y = ENGLISH)) +
  geom_point()

After adding the best fit curve

Note that the default method used is loess.

Code
ggplot(data=exam_data,
       aes(x = MATHS,
           y = ENGLISH)) +
  geom_point() + 
  geom_smooth(size=0.5)
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The default smoothing method can be overridden as shown below

Code
ggplot(data=exam_data,
       aes(x = MATHS,
           y = ENGLISH)) +
  geom_point() + 
  geom_smooth(method=lm,
              size=0.5)
`geom_smooth()` using formula = 'y ~ x'

2.5. Essential Grammatical Elements in ggplot2: Facets

Facetting generates small multiples (sometimes also called trellis plot), each displaying a different subset of the data. They are an alternative to aesthetics for displaying additional discrete variables. ggplot2 supports two types of factes, namely: facet_grid() and facet_wrap.

2.5.1. Working with facet_wrap()

facet_wrap wraps a 1d sequence of panels into 2d. This is generally a better use of screen space than facet_grid because most displays are roughly rectangular.

Code
ggplot(data=exam_data,
       aes(x = MATHS)) + 
  geom_histogram(bins=20) +
    facet_wrap(~ CLASS)

2.5.2. facet_grid()

facet_grid() forms a matrix of panels defined by row and column facetting variables. It is most useful when you have two discrete variables, and all combinations of the variables exist in the data.

Code
ggplot(data=exam_data,
       aes(x = MATHS)) + 
  geom_histogram(bins=20) +
    facet_grid(~ CLASS)

2.6. Essential Grammatical Elements in ggplot2: Coordinates

The Coordinates functions map the position of objects onto the plane of the plot. There are a number of different possible coordinate systems to use, they are:

  • coord_cartesian(): the default cartesian coordinate systems, where you specify x and y values (e.g. allows you to zoom in or out).
  • coord_flip(): a cartesian system with the x and y flipped.
  • coord_fixed(): a cartesian system with a “fixed” aspect ratio (e.g. 1.78 for a “widescreen” plot).
  • coord_quickmap(): a coordinate system that approximates a good aspect ratio for maps.

2.6.1. Working with Coordinate

By the default, the bar chart of ggplot2 is in vertical form.

Code
ggplot(data=exam_data, 
       aes(x=RACE)) +
  geom_bar()

Flipping the chart by using coord_flip().

Code
ggplot(data=exam_data, 
       aes(x=RACE)) +
  geom_bar() + 
  coord_flip()

2.6.2. Changing the y-axis and x-axis range

The scatterplot is slightly misleading because the y-axis and x-axis range are not equal.

Code
ggplot(data=exam_data, 
       aes(x= MATHS, y=ENGLISH)) +
  geom_point() +
  geom_smooth(method=lm, size=0.5)
`geom_smooth()` using formula = 'y ~ x'

The code chunk below fixed both the y-axis and x-axis range from 0-100.

Code
ggplot(data=exam_data, 
       aes(x= MATHS, y=ENGLISH)) +
  geom_point() +
  geom_smooth(method=lm, 
              size=0.5) +  
  coord_cartesian(xlim=c(0,100),
                  ylim=c(0,100))
`geom_smooth()` using formula = 'y ~ x'

2.7. Essential Grammatical Elements in ggplot2: themes

Themes control elements of the graph not related to the data. For example: - background colour - size of fonts - gridlines - colour of labels

Built-in themes include: - theme_gray() (default) - theme_bw() - theme_classic()

A list of theme can be found at this link. Each theme element can be conceived of as either a line (e.g. x-axis), a rectangle (e.g. graph background), or text (e.g. axis title).

2.7.1. Working with theme

Code
ggplot(data=exam_data, 
       aes(x=RACE)) +
  geom_bar() +
  coord_flip() +
  theme_gray()

A horizontal bar chart plotted using theme_classic().

Code
ggplot(data=exam_data, 
       aes(x=RACE)) +
  geom_bar() +
  coord_flip() +
  theme_classic()

A horizontal bar chart plotted using theme_minimal().

Code
ggplot(data=exam_data, 
       aes(x=RACE)) +
  geom_bar() +
  coord_flip() +
  theme_minimal()