Grammar (syntax): defined rules for how statements must be written (e.g. SELECT column FROM table WHERE condition).
Vocabulary (keywords): reserved words with specific meaning (SELECT, FROM, WHERE, etc.).
Semantics: each statement has a clear, predictable effect (e.g. filtering rows, updating data, creating a table).
SQL is a declarative language
SQL is declarative, not procedural (R is procedural, as is Python)
That means you describe what you want, not how to get it.
For example:
SELECT name, age FROM students WHERE age > 20;
We don’t tell the database how to find those students. We declare the result we want and the database engine figures out the best way to produce it.
SQL is Domain-Specific
SQL is designed for a specific domain: relational data.
It’s not a general-purpose language like Python or Java; instead, it’s built to:
Query data
Manipulate data
Define database structures
Control access and transactions
What is a SQL database?
A SQL database stores structured data in tables, with rows representing records and columns representing attributes. This is called a relational database.
SQL allows us to use the Structured Query Language (SQL) to extract, filter, join, and summarise data efficiently.
SQL databases are ideal for managing large, relational datasets, where different tables are linked by keys.
SQL - Key Commands
A very basic SQL query would be to select certain columns from a specific table where one or more conditions are met.
In plain English:
“select player_name and player_team from the table ‘player_data’, where player_gametime > 120.”
SQL is often used to create new variables that are based on grouping and aggregating existing data.
In plain English:
“Look at the match_stats table and group the data by team. Then, for each team, calculate the average possession, and show the team name alongside that average.”
SQL code:
SELECT team, AVG(possession) AS avg_possessionFROM match_statsGROUPBY team;
In this SQL query:
We use GROUP BY to summarise by group.
We use common functions: AVG(), COUNT(), SUM() to create summaries.
Filtering Summaries: HAVING
In SQL, the command HAVING allows us to filter.
SQL code:
SELECT team, AVG(possession) AS avg_possessionFROM match_statsGROUPBY teamHAVING avg_possession >55;
team city stadium_capacity
Length:6 Length:6 Min. :14200
Class :character Class :character 1st Qu.:17000
Mode :character Mode :character Median :20250
Mean :30117
3rd Qu.:42625
Max. :60000
Database connections
Before we can run any SQL queries in R using DBI, we need to create a connection to our SQLite database.
SQL engines operate on database files (not in-memory data frames like R), so we must tell R where our database is and how to communicate with it.
We will use the RSQLite package to connect to a database file called sportdata.sqlite.
Once connected, we can write SQL queries to extract and summarise data directly from the tables we’ve loaded into that file.
# Load required packages and connect to the SQLite databaselibrary(DBI)library(RSQLite)con <-dbConnect(RSQLite::SQLite(), "sportdata.sqlite")
If the database file doesn’t already exist, dbConnect() will create it.
If your data is currently in R data frames (e.g., you’ve read the CSVs into match_stats, team_info, and match_results), you’ll need to copy them into the SQLite database before querying.
Here’s how:
# Write R data frames to the connected SQLite database# Assumes match_stats, team_info, and match_results are already loaded in your R environment.dbWriteTable(con, "match_stats", match_stats, overwrite =TRUE)dbWriteTable(con, "team_info", team_info, overwrite =TRUE)dbWriteTable(con, "match_results", match_results, overwrite =TRUE)
This step creates actual tables inside your SQLite database. Once written, you can use SQL queries to access and analyse them just like in any database system.
We can check this:
Code
# Check what tables exist in the databasedbListTables(con)
[1] "match_results" "match_stats" "team_info"
Remember:
SQL engines don’t interact with R data frames directly. They require a connection to a database file (in our case, sportdata.sqlite) where the tables are stored.
Once connected, you can issue SQL commands using dbGetQuery() just as you would inside a dedicated database tool.
Part 1: Exploring a Single Table
Working with a single table is a basic part of SQL.
Here, you’ll practise extracting and summarising data from the match_stats table using basic SQL commands like SELECT, WHERE, and GROUP BY.
These operations are essential for understanding how to retrieve meaningful summaries from raw data: in this case, match-by-match possession statistics across multiple teams.
Overview
Focus:match_stats table Topics:SELECT, FROM, WHERE, ORDER BY, GROUP BY, HAVING
Core Task 1.1: To find the average possession by team:
In SQL, this would look like:
SELECT team, AVG(possession) AS avg_possession FROM match_stats GROUPBY team;
This will:
SELECT team: Returns one row per team.
AVG(possession) AS avg_possession: Computes a new variable which is the average of the possession column for each group, and labels it avg_possession.
FROM match_stats: Specifies the source table.
GROUP BY team: Tells SQL to group the data by team name so that AVG is calculated separately for each team.
We can implement this SQL command in R as:
# DBI implementationlibrary(DBI)con <-dbConnect(RSQLite::SQLite(), "sportdata.sqlite")dbGetQuery(con, "SELECT team, AVG(possession) AS avg_possession FROM match_stats GROUP BY team")
Combine data from two related tables to enrich your analysis.
Focus: match_stats + team_info
Key concept: INNER JOIN
Code
# DBI implementationdbGetQuery(con, "SELECT m.team, t.city, AVG(m.possession) AS avg_possessionFROM match_stats mJOIN team_info t ON m.team = t.teamGROUP BY m.team, t.city")
team city avg_possession
1 Aberdeen Aberdeen 54.27391
2 Celtic Glasgow 52.02188
3 Dundee United Dundee 52.07407
4 Hearts Edinburgh 57.83333
5 Hibs Edinburgh 52.60769
6 Rangers Glasgow 54.74444
Part 3: Multi-Table Summary (streamlined)
Bring together all three tables to produce a simple performance summary.
Focus: match_stats, match_results, team_info Key concepts: JOIN, GROUP BY, WHERE
Code
# DBI implementationdbGetQuery(con, "SELECT m.team,t.city,AVG(m.possession) AS avg_possession,SUM(r.goals_scored) AS total_goalsFROM match_stats mJOIN match_results r ON m.match_id = r.match_idJOIN team_info t ON m.team = t.teamWHERE m.season = 2023GROUP BY m.team, t.city")