[1] 4
[1] 15
[1] 3.333333
[1] 5
ggplot2 and Good Coding Style<-head(), str(), summary()<- to store a value in a name# Using readr (part of the tidyverse)
library(readr)
data <- read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv")
head(data)# A tibble: 6 × 5
sepal_length sepal_width petal_length petal_width species
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
read_csv() is tidyverse-friendly
head() shows the first few rows
spc_tbl_ [150 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ sepal_length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ sepal_width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ petal_length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ petal_width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ species : chr [1:150] "setosa" "setosa" "setosa" "setosa" ...
- attr(*, "spec")=
.. cols(
.. sepal_length = col_double(),
.. sepal_width = col_double(),
.. petal_length = col_double(),
.. petal_width = col_double(),
.. species = col_character()
.. )
- attr(*, "problems")=<externalptr>
sepal_length sepal_width petal_length petal_width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
species
Length:150
Class :character
Mode :character
✅ Open RStudio and run code reproducibly from a script
✅ Assign variables and perform simple operations
✅ Install and load packages
✅ Import and inspect a dataset
✅ Understand the difference between console vs. script
ggplot2Visualisation turns raw data into insights that can be easily understood.
ggplot2, creating basic plots.read.csv() to load a dataset from a URL directly into R.Here’s an example of how I would load the dataset:
url that contains a value (in this case, a dropbox link)data by running a function called read.csvurl, which is a value which is the dropbox link!head command.read.csv().head().head()?Note
The term ‘observation’ refers to a row in a dataset.
read.csv()).Getting a dataset into R correctly is the first step to being able to perform any analysis. And it’s often one of the hardest things to do.
read.csv() and write.csv() to save and load CSV datasets from internet/cloud storage.head() to quickly view first few rows of dataset.I will usually refer to Data objects in the Environment as “dataframes” from now on…
You can write dataframes as files (this removes the need to retain them in the project memory).
Note that this writes the file by default to your project directory.
# Create a small dataframe
scores <- data.frame(
player = c("Alice", "Bob", "Charlie"),
goals = c(3, 5, 2),
assists = c(1, 0, 4)
)
# Write to CSV
write.csv(scores, "scores.csv", row.names = FALSE)
# Check it worked (read it back in)
read.csv("scores.csv") player goals assists
1 Alice 3 1
2 Bob 5 0
3 Charlie 2 4
R can store multiple dataframes at the same time (unlike SPSS etc.) For this reason you need to tell it what data to use when running code.
Some functions ask you to state the dataframe (as above).
To refer to a specific variable in a specific dataframe, you need to use the dollar sign ($).
Type sum(HomeGoals) in the console, hit return, and see what happens. Then type sum(match_data$HomeGoals) and see what happens.
ggplot2ggplot2ggplot2 provides a variety of plot types to help you visualise different kinds of data.ggplot2 but it’s the standard plotting package in R.mtcars dataset.Tip
R comes with a selection of datasets ‘built in’, like mtcars. These are really useful if you want to experiment or play around with code.
ggplot2ggplot2ggplot2 is built around three key components:
aes): Mapping variables in the dataset to visual properties of the plot (e.g., x and y axes, colours, size).geom_*): The visual elements representing the data (e.g., points, lines, bars).The power of ggplot2 lies in how you combine data, aesthetics, and geometries to create customised visualisations.
ggplot2 plot follows a simple structure:
mtcars).# Define the data and aesthetics, then add a scatter plot geometry
ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point() +
labs(title = "Scatter Plot: mpg vs. hp", x = "Horsepower", y = "Miles per Gallon")Explanation:
ggplot(mtcars, aes(x = hp, y = mpg)): Specifies the data and maps hp to the x-axis and mpg to the y-axis.geom_point(): Adds scatter plot points.labs(): Customises title and axis labels.# Define the data and aesthetics, then add a scatter plot geometry
ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point() +
labs(title = "Scatter Plot: mpg vs. hp", x = "Horsepower", y = "Miles per Gallon")ggplot() function sets up the frameworkgeoms add the visual elements.ggplot2ggplot2 is based on layering elements one at a time.Basic Structure of Layers:
Base layer: Defined with ggplot(), setting data and aesthetics.
Geom layers: Added using functions like geom_point(), geom_line(), etc.
Additional layers: These might include labels, themes, or custom scales.
+Explanation:
geom_point(color = "blue"): Adds blue scatter points.
geom_smooth(method = "lm", se = FALSE): Adds a linear regression line without the confidence interval (se = FALSE).
Each layer is stacked in sequence, giving you control over each visual element.
Layers let you add more elements to your plot, like trend lines or annotations, and adjust each one individually.
aes()): You can map not only position (x and y) but also other variables to colour, size, shape, etc.geom_*): You can customize visual properties of the geoms, such as size, colour, shape, and transparency.Explanation:
color = factor(cyl): Maps cyl to color.
size = wt: Maps wt (weight) to point size.
Customising aesthetics lets us emphasise important patterns and relationships in the data.
facet_wrap() or facet_grid() to create small multiples for comparison.Explanation:
facet_wrap(~cyl): Creates separate plots for each level of the cyl variable (e.g., 4, 6, and 8 cylinders).Faceting is a great way to compare subsets of data side-by-side without ‘cluttering’ a single plot.
# Load the football match dataset
url <- "https://www.dropbox.com/scl/fi/wyrihmdl20gsftkhhai79/data_02.csv?rlkey=nh9zu2glcpw36qur81tjnpoip&dl=1"
match_data <- read.csv(url)
head(match_data) MatchID HomeTeam AwayTeam HomeGoals AwayGoals HomePossession AwayPossession
1 1 Team G Team A 1 2 44 56
2 2 Team G Team D 4 0 58 42
3 3 Team C Team A 0 5 59 41
4 4 Team F Team A 0 4 53 47
5 5 Team C Team E 1 0 42 58
6 6 Team B Team C 2 1 47 53
HomeShots AwayShots Date
1 14 5 2025-10-04
2 16 11 2025-02-24
3 6 15 2025-11-27
4 14 11 2025-08-26
5 10 17 2025-09-09
6 19 19 2025-12-05
We’ll start by showing the relationship between goals scored and shots taken for the home and away teams.
Explanation:
aes(x = HomeShots, y = HomeGoals): Maps HomeShots to the x-axis and HomeGoals to the y-axis.
geom_point(): Adds the scatter plot points.
color = HomeTeam: Colours the points by the HomeTeam variable, so each team is visually distinguishable.
Next, we’ll create a bar plot to show the total number of goals scored by each team (both home and away).
# Bar plot of total goals scored by each team
library(dplyr)
library(tidyr)
library(ggplot2)
# Summing goals scored by each team (both home and away)
goals_by_team <- match_data %>%
gather(key = "MatchType", value = "Goals", HomeGoals, AwayGoals) %>%
group_by(HomeTeam) %>%
summarise(TotalGoals = sum(Goals))
# Bar plot
ggplot(goals_by_team, aes(x = reorder(HomeTeam, TotalGoals), y = TotalGoals, fill = HomeTeam)) +
geom_bar(stat = "identity") +
labs(title = "Total Goals Scored by Each Team", x = "Team", y = "Total Goals") +
theme_minimal() +
coord_flip() # Flip the axes for better readabilityExplanation:
We use the gather() function from the tidyr package to reshape the data, combining HomeGoals and AwayGoals into a single column (Goals).
The bar plot then shows the total number of goals scored by each team, ordered by the total goals.
Let’s explore the distribution of possession percentages (both for home and away teams) with a histogram.
Explanation:
geom_histogram(): Creates a histogram to show the distribution of the HomePossession variable.
The binwidth argument controls the width of each bin in the histogram.
Now, we can use faceting to compare how the relationship between shots and goals varies across different teams.
Explanation:
facet_wrap(~ HomeTeam): This creates a separate plot for each team to compare their performance.
aes(color = HomeTeam): Adds color to the points based on the home team.
Finally, let’s add a smooth trend line to the scatter plot to visualize the general trend between shots and goals.
# Scatter plot with trend line
ggplot(match_data, aes(x = HomeShots, y = HomeGoals)) +
geom_point(aes(color = HomeTeam)) +
geom_smooth(method = "lm", se = FALSE, color = "red") + # Adding a linear regression line
labs(title = "Goals vs. Shots with Trend Line", x = "Home Team Shots", y = "Home Team Goals") +
theme_minimal()Explanation:
geom_smooth(method = "lm", se = FALSE): Adds a linear regression line to show the overall trend between shots and goals, without the confidence interval (se = FALSE).ggplot2.In this section, we’ve:
total_goals) or camelCase (e.g., totalGoals) based on your preferences.x or y unless they are loop variables or mathematical expressions.Example:
Good practice
Bad practice
mean(), sum(), plot()), but you can also write your own.Think of a function as a recipe: you give it ingredients (inputs), it follows a set of steps, and produces a dish (output).
Example:
Good practice: Function does one thing
Bad practice: Multiple tasks in one function
Functions should encapsulate one specific task, making them easier to test and debug.
A function should not handle multiple unrelated tasks; separate them into distinct functions for clarity.
Example:
Good practice: Explaining why
# Calculate the average goals per match to assess performance over the season
average_goals <- calculate_goals_per_game(total_goals, total_matches)Bad practice: Over-explaining
Example:
Consistent formatting
Inconsistent formatting (Avoid this)
Now it’s time for you to put these principles into practice by refactoring messy code.
Task:
Example of Messy Code:
What you need to do:
calc to something more descriptive (e.g., calculate_total_goals).d, t, and avg to more meaningful names.Refactored code
Solution Review: Walk through an example of the refactored code and explain the changes made.
<- and reuse theminstall.packages(), library())read.csv(), read_csv())head(), str(), summary()write.csv(), write_csv())ggplot2 (scatter, bar, line, histogram, boxplot)aes(x, y, colour, size))