dplyrR Training
In this module, we’ll learn how to wrangle (clean, transform, and prepare) tax data using dplyr.
What does “data wrangling” mean?
Data wrangling is the process of transforming raw data into a format that’s ready for analysis.
Think of it like preparing ingredients before cooking:
By the end of this module, you’ll be able to:
dplyr?dplyr is part of the tidyverse - a collection of R packages designed to work together seamlessly.
Advantages:
✅ Readable - verbs match what you want to do
✅ Consistent - similar syntax across operations
✅ Beginner-friendly - clear and logical
✅ Well-documented - tons of resources
Note
We’ll explain what %>% means in just a moment!
Before we dive into data wrangling, we need to understand a crucial concept:
Changes Don’t Stick Unless You Save Them!
When you transform data in R, the changes are NOT automatically saved. You must use the assignment operator <- to store results.
The assignment operator <- stores results in an object:
[1] 211
Create a NEW object when:
The assignment operator <- has a keyboard shortcut:
Tip
Alt + - (minus) (Windows/Linux)
Option + - (minus) (Mac)
This types <- automatically!
%>%The pipe operator (%>%) is the key to writing clear, readable code.
Think of it as “and then”:
How to read it:
tax_data%>%) filter to 2023%>%) select specific columns%>%) sort the resultsTip
Keyboard shortcut: Ctrl + Shift + M (Windows) or Cmd + Shift + M (Mac)
Let’s understand pipes with an everyday example:
Making coffee (without pipes):
Making coffee (with pipes):
Let’s remind ourselves what data we’re working with:
Rows: 1,000
Columns: 8
$ firm_id <chr> "FIRM_0096", "FIRM_0081", "FIRM_0024", "FIRM_0091",…
$ declaration_date <date> 2023-01-31, 2022-05-31, 2023-03-31, 2023-08-31, 20…
$ vat_inputs <int> 45262, 9225, 15289, 9347, 47317, 44471, 23806, 4907…
$ vat_outputs <int> 45914, 29661, 10005, 14410, 33378, 45639, 19855, 25…
$ reimbursement_date <date> 2023-12-31, 2021-07-31, 2023-09-30, 2021-11-30, 20…
$ filing_year <int> 2023, 2022, 2023, 2023, 2023, 2022, 2021, 2022, 202…
$ filing_quarter <int> 1, 2, 1, 3, 3, 4, 1, 3, 4, 4, 4, 2, 2, 1, 3, 1, 4, …
$ filing_month <int> 1, 5, 3, 8, 9, 12, 2, 7, 12, 11, 12, 4, 6, 3, 9, 2,…
Note
glimpse() is a dplyr function that gives us a quick overview of our data - the column names, data types, and first few values.
dplyr has six main verbs (functions) for working with data:
| Verb | What it does | Example |
|---|---|---|
filter() |
Choose rows based on conditions | “Show me only 2023 declarations” |
select() |
Choose columns | “I only need firm ID and VAT amount” |
mutate() |
Create or modify columns | “Calculate net VAT” |
arrange() |
Sort rows | “Order by VAT amount, highest first” |
summarize() |
Calculate summaries | “What’s the average VAT payment?” |
group_by() |
Group data for grouped operations | “Average VAT by industry” |
Let’s learn each one with real tax data examples!
filter()filter() lets you select specific rows based on conditions.
Basic example:
firm_id declaration_date vat_inputs vat_outputs reimbursement_date
<char> <Date> <int> <int> <Date>
1: FIRM_0096 2023-01-31 45262 45914 2023-12-31
2: FIRM_0090 2022-12-31 44471 45639 2022-10-31
3: FIRM_0078 2021-06-30 46685 46336 2022-04-30
filing_year filing_quarter filing_month
<int> <int> <int>
1: 2023 1 1
2: 2022 4 12
3: 2021 2 6
Important
Notice we use == (two equal signs) for comparison, not = (one equal sign).
== means “is equal to”= is for assigning valuesYou can filter using various comparison operators:
firm_id declaration_date vat_inputs vat_outputs reimbursement_date
<char> <Date> <int> <int> <Date>
1: FIRM_0096 2023-01-31 45262 45914 2023-12-31
2: FIRM_0090 2022-12-31 44471 45639 2022-10-31
3: FIRM_0078 2021-06-30 46685 46336 2022-04-30
filing_year filing_quarter filing_month
<int> <int> <int>
1: 2023 1 1
2: 2022 4 12
3: 2021 2 6
Available operators:
== equal to!= not equal to> greater than< less than>= greater than or equal to<= less than or equal toCombine conditions using & (AND) or | (OR):
firm_id declaration_date vat_inputs vat_outputs reimbursement_date
<char> <Date> <int> <int> <Date>
1: FIRM_0095 2023-10-31 8254 43505 2021-12-31
2: FIRM_0015 2022-02-28 1444 44201 2023-01-31
3: FIRM_0099 2023-01-31 16775 47471 2023-05-31
filing_year filing_quarter filing_month
<int> <int> <int>
1: 2023 4 10
2: 2022 1 2
3: 2023 1 1
firm_id declaration_date vat_inputs vat_outputs reimbursement_date
<char> <Date> <int> <int> <Date>
1: FIRM_0080 2022-07-31 49071 25070 2022-06-30
2: FIRM_0040 2022-12-31 43741 49806 2021-07-31
3: FIRM_0100 2021-10-31 48270 14916 2021-06-30
filing_year filing_quarter filing_month
<int> <int> <int>
1: 2022 3 7
2: 2022 4 12
3: 2021 4 10
1. Check if value is in a list: %in%
firm_id declaration_date vat_inputs vat_outputs reimbursement_date
<char> <Date> <int> <int> <Date>
1: FIRM_0010 2021-02-28 30557 40239 2022-03-31
2: FIRM_0005 2022-09-30 22196 30998 2022-04-30
3: FIRM_0005 2021-04-30 5016 30653 2021-10-31
filing_year filing_quarter filing_month
<int> <int> <int>
1: 2021 1 2
2: 2022 3 9
3: 2021 2 4
Tip
%in% is much easier than writing firm_id == "FIRM_0001" | firm_id == "FIRM_0005" | ...
2. Check if value is within a range: between()
firm_id declaration_date vat_inputs vat_outputs reimbursement_date
<char> <Date> <int> <int> <Date>
1: FIRM_0081 2022-05-31 9225 29661 2021-07-31
2: FIRM_0014 2023-09-30 47317 33378 2021-12-31
3: FIRM_0080 2022-07-31 49071 25070 2022-06-30
filing_year filing_quarter filing_month
<int> <int> <int>
1: 2022 2 5
2: 2023 3 9
3: 2022 3 7
select()select() lets you choose which columns to keep.
Select specific columns:
firm_id vat_inputs vat_outputs
<char> <int> <int>
1: FIRM_0096 45262 45914
2: FIRM_0081 9225 29661
3: FIRM_0024 15289 10005
Remove columns with -:
firm_id declaration_date vat_inputs vat_outputs filing_year filing_quarter
<char> <Date> <int> <int> <int> <int>
1: FIRM_0096 2023-01-31 45262 45914 2023 1
2: FIRM_0081 2022-05-31 9225 29661 2022 2
3: FIRM_0024 2023-03-31 15289 10005 2023 1
filing_month
<int>
1: 1
2: 5
3: 3
Drop multiple columns:
firm_id declaration_date vat_inputs vat_outputs filing_quarter
<char> <Date> <int> <int> <int>
1: FIRM_0096 2023-01-31 45262 45914 1
2: FIRM_0081 2022-05-31 9225 29661 2
3: FIRM_0024 2023-03-31 15289 10005 1
filing_month
<int>
1: 1
2: 5
3: 3
mutate()mutate() creates new columns or modifies existing ones.
Basic calculation:
firm_id vat_outputs vat_inputs net_vat
<char> <int> <int> <int>
1: FIRM_0096 45914 45262 652
2: FIRM_0081 29661 9225 20436
3: FIRM_0024 10005 15289 -5284
Note
The original data is NOT changed. mutate() creates a new version with the additional column.
You can create several columns at once:
firm_id net_vat vat_ratio large_taxpayer
<char> <int> <num> <lgcl>
1: FIRM_0096 652 0.9857995 TRUE
2: FIRM_0081 20436 0.3110145 FALSE
3: FIRM_0024 -5284 1.5281359 FALSE
if_else()Create columns based on conditions:
firm_id net_vat is_refund
<char> <int> <lgcl>
1: FIRM_0096 652 FALSE
2: FIRM_0081 20436 FALSE
3: FIRM_0024 -5284 TRUE
4: FIRM_0091 5063 FALSE
5: FIRM_0014 -13939 TRUE
Syntax: if_else(condition, value_if_TRUE, value_if_FALSE)
Tip
You can use text too: if_else(net_vat < 0, "Refund", "Payment")
arrange()arrange() sorts your data by one or more columns.
Ascending order (smallest to largest):
firm_id vat_outputs
<char> <int>
1: FIRM_0066 1014
2: FIRM_0044 1045
3: FIRM_0020 1073
firm_id filing_year vat_outputs
<char> <int> <int>
1: FIRM_0075 2021 49710
2: FIRM_0056 2021 49409
3: FIRM_0060 2021 49309
4: FIRM_0062 2021 49283
5: FIRM_0051 2021 49280
Note
This is useful for ranking: “Who are the top taxpayers each year?”
summarize()summarize() calculates summary statistics and reduces your data to a single row.
Basic example:
total_vat_collected average_vat median_vat num_declarations
1 25883247 25883.25 26212.5 1000
Important
Always use na.rm = TRUE to remove missing values, otherwise you’ll get NA as the result!
| Function | What it calculates | Example |
|---|---|---|
sum() |
Total | Total VAT collected |
mean() |
Average | Average payment |
median() |
Middle value | Median VAT amount |
min() |
Smallest value | Minimum payment |
max() |
Largest value | Maximum payment |
n() |
Count of rows | Number of declarations |
n_distinct() |
Count unique values | Number of unique firms |
group_by()group_by() is powerful - it lets you perform operations separately for each group.
Example: Average VAT by year
# A tibble: 3 × 3
filing_year avg_vat num_declarations
<int> <dbl> <int>
1 2021 26088. 329
2 2022 25410. 332
3 2023 26148. 339
Note
Think of group_by() as creating separate “buckets” - one for each unique value. Then calculations happen within each bucket.
# A tibble: 8 × 4
filing_year filing_quarter avg_vat num_declarations
<int> <int> <dbl> <int>
1 2021 1 25976. 72
2 2021 2 27633. 77
3 2021 3 25752. 95
4 2021 4 25158. 85
5 2022 1 25747. 90
6 2022 2 24686. 78
7 2022 3 27352. 86
8 2022 4 23604. 78
Why .groups = "drop"?
After summarize(), R keeps the grouping structure by default. Adding .groups = "drop" removes this grouping, which:
Always add .groups = "drop" after group_by() + summarize()!
count()count() is a shortcut for group_by() + summarize(n = n()).
Simple frequency table:
filing_year n
<int> <int>
1: 2021 329
2: 2022 332
3: 2023 339
distinct()distinct() keeps only unique rows.
Example: Get unique firms
firm_id
<char>
1: FIRM_0096
2: FIRM_0081
3: FIRM_0024
4: FIRM_0091
5: FIRM_0014
Keep all columns:
firm_id declaration_date vat_inputs vat_outputs reimbursement_date
<char> <Date> <int> <int> <Date>
1: FIRM_0096 2023-01-31 45262 45914 2023-12-31
2: FIRM_0081 2022-05-31 9225 29661 2021-07-31
3: FIRM_0024 2023-03-31 15289 10005 2023-09-30
filing_year filing_quarter filing_month
<int> <int> <int>
1: 2023 1 1
2: 2022 2 5
3: 2023 1 3
Tip
.keep_all = TRUE means “keep all the other columns too, not just the ones I specified”.
Let’s combine multiple operations in one pipeline:
# Complete analysis: Top 5 firms by average VAT in 2023
panel_vat %>%
filter(filing_year == 2023) %>% # Only 2023
mutate(net_vat = vat_outputs - vat_inputs) %>% # Calculate net VAT
group_by(firm_id) %>% # Group by firm
summarize(avg_net_vat = mean(net_vat, na.rm = TRUE), .groups = "drop") %>% # Average per firm
arrange(desc(avg_net_vat)) %>% # Sort highest first
head(5) # Top 5# A tibble: 5 × 2
firm_id avg_net_vat
<chr> <dbl>
1 FIRM_0060 40856
2 FIRM_0053 27003
3 FIRM_0012 21663
4 FIRM_0022 21581
5 FIRM_0018 21551
Note
Read this from top to bottom like a recipe - each step builds on the previous one!
You can find the exercise in the folder “Exercises/exercise_01_template.R”
10:00 Your tasks:
Filtering: Filter panel_vat to declarations with vat_outputs greater than 30,000 and save as high_vat. Check how many rows using nrow().
Selecting: From panel_vat, select only firm_id, declaration_date, and vat_outputs. Display the first few rows with head().
Mutating: Create a new column called net_vat (vat_outputs - vat_inputs) and a flag is_refund (TRUE if net_vat < 0). Display the first few rows.
Arranging: Sort panel_cit by tax_paid (highest first) and show the top 10 taxpayers.
Summarizing: Calculate total, average, and median taxable_income from panel_cit.
Grouping: Calculate average vat_outputs by filing_year and count declarations per year using group_by() and summarize().
Challenge: Find the top 3 firms by total net VAT paid in 2022, showing firm_id and total amount.
# Load required packages
library(dplyr)
library(data.table)
library(here)
library(lubridate)
# Load data
panel_vat <- fread(here("r_training_datax", "Exercises", "data", "intermediate", "panel_vat.csv"))
panel_cit <- fread(here("r_training_datax", "Exercises", "data", "intermediate", "panel_cit.csv"))
# Convert dates
panel_vat$declaration_date <- as.Date(panel_vat$declaration_date)
panel_cit$declaration_date <- as.Date(panel_cit$declaration_date)
# 1. Filtering
high_vat <- panel_vat %>%
filter(vat_outputs > 30000)
nrow(high_vat)
# 2. Selecting
vat_selected <- panel_vat %>%
select(firm_id, declaration_date, vat_outputs)
head(vat_selected)
# 3. Mutating
vat_with_flags <- panel_vat %>%
mutate(
net_vat = vat_outputs - vat_inputs,
is_refund = if_else(net_vat < 0, TRUE, FALSE)
)
head(vat_with_flags)
# 4. Arranging
top_taxpayers <- panel_cit %>%
arrange(desc(tax_paid)) %>%
head(10)
top_taxpayers
# 5. Summarizing
cit_summary <- panel_cit %>%
summarize(
total_income = sum(taxable_income, na.rm = TRUE),
avg_income = mean(taxable_income, na.rm = TRUE),
median_income = median(taxable_income, na.rm = TRUE)
)
cit_summary
# 6. Grouping
vat_by_year <- panel_vat %>%
mutate(filing_year = year(declaration_date)) %>%
group_by(filing_year) %>%
summarize(
avg_vat = mean(vat_outputs, na.rm = TRUE),
num_declarations = n(),
.groups = "drop"
)
vat_by_year
# 7. Challenge
top_firms_2022 <- panel_vat %>%
mutate(filing_year = year(declaration_date)) %>%
filter(filing_year == 2022) %>%
mutate(net_vat = vat_outputs - vat_inputs) %>%
group_by(firm_id) %>%
summarize(total_vat = sum(net_vat, na.rm = TRUE), .groups = "drop") %>%
arrange(desc(total_vat)) %>%
head(3)
top_firms_2022Tax data is inherently time-based:
📆 Filing dates - to check if declarations are late
📆 Tax periods - to calculate quarterly, annual totals
📆 Payment deadlines - to calculate penalties
📆 Time gaps - to identify irregular filing patterns
We need to extract, calculate, and analyze dates properly!
lubridate Packagelubridate makes working with dates much easier!
Common tasks:
Dates are often stored as text. We need to convert them to Date format:
[1] "2023-01-15" "2023-06-20" "2023-12-31"
[1] "2023-01-15" "2023-06-20" "2023-12-31"
[1] "2023-01-15" "2023-06-20" "2023-12-31"
Tip
Use the function that matches your format: ymd() for year-month-day, dmy() for day-month-year, etc.
Pull out specific parts of a date:
firm_id declaration_date filing_year filing_month filing_quarter
<char> <Date> <int> <int> <int>
1: FIRM_0096 2023-01-31 2023 1 1
2: FIRM_0081 2022-05-31 2022 5 2
3: FIRM_0024 2023-03-31 2023 3 1
4: FIRM_0091 2023-08-31 2023 8 3
5: FIRM_0014 2023-09-30 2023 9 3
Note
This is useful for grouping declarations by time period!
Calculate how many days between dates:
Time difference of 10 days
With real data:
Time difference of 1064 days
Note
Date arithmetic in R is straightforward - just subtract one date from another!
[1] "2023-05-30"
You can find the exercise in the folder “Exercises/exercise_02_template.R”
10:00 Your tasks:
Parse dates: Load panel_vat and convert declaration_date to proper Date format using as.Date(). Check with class().
Extract components: Create three new columns:
filing_year using year()filing_quarter using quarter()filing_month using month()Save as panel_vat_dates and display first few rows.
Calculate differences: For each firm, calculate days_since_last (days between consecutive declarations) using arrange(), group_by(), lag(), and ungroup(). Display results.
Filing analysis: Calculate quarter_end, filing_deadline (45 days after quarter end), and days_late. Then summarize: how many declarations were late?
Recent filers: Find firms that filed in the last 180 days from today() using filter() and distinct().
# Load required packages
library(dplyr)
library(lubridate)
library(data.table)
library(here)
# Load data
panel_vat <- fread(here("r_training_datax", "Exercises", "data", "intermediate", "panel_vat.csv"))
# 1. Parse dates
panel_vat$declaration_date <- as.Date(panel_vat$declaration_date)
class(panel_vat$declaration_date)
# 2. Extract date components
panel_vat_dates <- panel_vat %>%
mutate(
filing_year = year(declaration_date),
filing_quarter = quarter(declaration_date),
filing_month = month(declaration_date)
)
head(panel_vat_dates)
# 3. Calculate days between declarations
panel_vat_gaps <- panel_vat_dates %>%
arrange(firm_id, declaration_date) %>%
group_by(firm_id) %>%
mutate(days_since_last = as.numeric(declaration_date - lag(declaration_date))) %>%
ungroup()
panel_vat_gaps %>%
select(firm_id, declaration_date, days_since_last) %>%
head(10)
# 4. Filing analysis
panel_vat_deadlines <- panel_vat_dates %>%
mutate(
quarter_end = ceiling_date(declaration_date, "quarter") - 1,
filing_deadline = quarter_end + 45,
days_late = as.numeric(declaration_date - filing_deadline)
)
panel_vat_deadlines %>%
summarize(
total_declarations = n(),
late_filers = sum(days_late > 0, na.rm = TRUE),
pct_late = round(100 * late_filers / total_declarations, 2)
)
# 5. Recent filers
recent_filers <- panel_vat_dates %>%
filter(declaration_date >= (today() - 180)) %>%
distinct(firm_id)
nrow(recent_filers)
head(recent_filers)
# Combine all transformations
panel_vat_dates <- panel_vat_deadlines %>%
left_join(
panel_vat_gaps %>% select(firm_id, declaration_date, days_since_last),
by = c("firm_id", "declaration_date")
)The real power of dplyr comes from combining operations in a logical sequence:
# Complete workflow example
analysis_result <- panel_vat %>%
# Step 1: Calculate new variables
mutate(
net_vat = vat_outputs - vat_inputs,
filing_quarter = quarter(declaration_date)
) %>%
# Step 2: Filter to recent data
filter(filing_year == 2023) %>%
# Step 3: Analyze by quarter
group_by(filing_quarter) %>%
summarize(
total_vat = sum(net_vat, na.rm = TRUE),
avg_vat = mean(net_vat, na.rm = TRUE),
num_firms = n_distinct(firm_id),
.groups = "drop"
)
analysis_result# A tibble: 4 × 4
filing_quarter total_vat avg_vat num_firms
<int> <int> <dbl> <int>
1 1 442405 5395. 54
2 2 15185 167. 59
3 3 -104600 -1113. 62
4 4 179021 2486. 49
Sometimes it’s better to break long pipelines into smaller pieces:
Good for complex workflows:
# Step 1: Add derived variables
vat_enhanced <- panel_vat %>%
mutate(
net_vat = vat_outputs - vat_inputs,
filing_quarter = quarter(declaration_date)
)
# Step 2: Analyze
vat_summary <- vat_enhanced %>%
group_by(filing_year, filing_quarter) %>%
summarize(avg_vat = mean(net_vat, na.rm = TRUE), .groups = "drop")
head(vat_summary)# A tibble: 6 × 3
filing_year filing_quarter avg_vat
<int> <int> <dbl>
1 2021 1 -519.
2 2021 2 2447.
3 2021 3 219.
4 2021 4 442.
5 2022 1 100.
6 2022 2 -1321.
Tip
Break into steps when:
Build up your analysis with meaningful calculated columns:
# Create a rich dataset with many derived variables
panel_vat_enhanced <- panel_vat %>%
mutate(
# Financial calculations
net_vat = vat_outputs - vat_inputs,
vat_ratio = vat_inputs / vat_outputs,
# Business logic flags
is_refund = if_else(net_vat < 0, TRUE, FALSE),
large_taxpayer = if_else(vat_outputs > 40000, TRUE, FALSE),
high_ratio = if_else(vat_ratio > 0.9, TRUE, FALSE)
)
# Check what we created
panel_vat_enhanced %>%
select(firm_id, net_vat, is_refund, large_taxpayer, high_ratio) %>%
head(5) firm_id net_vat is_refund large_taxpayer high_ratio
<char> <int> <lgcl> <lgcl> <lgcl>
1: FIRM_0096 652 FALSE TRUE TRUE
2: FIRM_0081 20436 FALSE FALSE FALSE
3: FIRM_0024 -5284 TRUE FALSE TRUE
4: FIRM_0091 5063 FALSE FALSE FALSE
5: FIRM_0014 -13939 TRUE FALSE TRUE
Save your enhanced datasets for use in later modules:
# Save enhanced VAT panel to Clean folder
panel_vat_clean <- panel_vat %>%
mutate(
net_vat = vat_outputs - vat_inputs,
filing_quarter = quarter(declaration_date),
is_refund = net_vat < 0,
large_taxpayer = vat_outputs > 40000
)
fwrite(panel_vat_clean,
here("r_training_datax", "Exercises", "data", "Clean", "panel_vat_clean.csv"))
# Save enhanced CIT panel to Clean folder
panel_cit_clean <- panel_cit %>%
mutate(
filing_quarter = quarter(declaration_date),
effective_tax_rate = tax_paid / taxable_income
)
fwrite(panel_cit_clean,
here("r_training_datax", "Exercises", "data", "Clean", "panel_cit_clean.csv"))Keep your files organized:
r_training_datax/
└── Exercises/
└── data/
├── Raw/ # Original, untouched data
│ ├── firm_characteristics.csv
│ ├── vat_declarations.dta
│ └── cit_declarations.xlsx
├── intermediate/ # After Module 2 cleaning
│ ├── dt_firms.csv
│ ├── panel_vat.csv
│ └── panel_cit.csv
└── Clean/ # After Module 3 transformation
├── panel_vat_clean.csv
└── panel_cit_clean.csv
Tip
Save enhanced versions to Clean folder so you can use them in future modules!
You can find the exercise in the folder “Exercises/exercise_03_template.R”
15:00 Your tasks:
Create fully transformed datasets ready for analysis:
net_vat, filing_year, filing_quarter, filing_monthvat_ratio and days_since_last per firmis_refund, large_taxpayer (>40000), high_ratio (>0.9)data/Clean/panel_vat_clean.csv using fwrite()filing_year, filing_quartereffective_tax_rate (tax_paid / taxable_income)has_adjustments (adjustments != 0)data/Clean/panel_cit_clean.csv using fwrite()clean_names()data/Clean/dt_firms_clean.csv using fwrite()file.exists()names() and glimpse()Remember: These cleaned datasets will be used in Modules 4, 5, and 6!
# Load required packages
library(dplyr)
library(lubridate)
library(data.table)
library(here)
library(janitor)
# Load data from intermediate folder
panel_vat <- fread(here("r_training_datax", "Exercises", "data", "intermediate", "panel_vat.csv"))
panel_cit <- fread(here("r_training_datax", "Exercises", "data", "intermediate", "panel_cit.csv"))
dt_firms <- fread(here("r_training_datax", "Exercises", "data", "intermediate", "dt_firms.csv"))
# Convert dates
panel_vat$declaration_date <- as.Date(panel_vat$declaration_date)
panel_cit$declaration_date <- as.Date(panel_cit$declaration_date)
# 1. Transform panel_vat
panel_vat_clean <- panel_vat %>%
# Add date components
mutate(
filing_year = year(declaration_date),
filing_quarter = quarter(declaration_date),
filing_month = month(declaration_date)
) %>%
# Calculate financial metrics
mutate(
net_vat = vat_outputs - vat_inputs,
vat_ratio = vat_inputs / vat_outputs
) %>%
# Create business flags
mutate(
is_refund = if_else(net_vat < 0, TRUE, FALSE),
large_taxpayer = if_else(vat_outputs > 40000, TRUE, FALSE),
high_ratio = if_else(vat_ratio > 0.9, TRUE, FALSE, missing = FALSE)
) %>%
# Calculate days between declarations per firm
arrange(firm_id, declaration_date) %>%
group_by(firm_id) %>%
mutate(days_since_last = as.numeric(declaration_date - lag(declaration_date))) %>%
ungroup()
# Save to Clean folder
fwrite(panel_vat_clean,
here("r_training_datax", "Exercises", "data", "Clean", "panel_vat_clean.csv"))
# 2. Transform panel_cit
panel_cit_clean <- panel_cit %>%
# Add date components
mutate(
filing_year = year(declaration_date),
filing_quarter = quarter(declaration_date)
) %>%
# Calculate financial metrics
mutate(
effective_tax_rate = tax_paid / taxable_income
) %>%
# Create business flags
mutate(
has_adjustments = if_else(adjustments != 0, TRUE, FALSE)
)
# Save to Clean folder
fwrite(panel_cit_clean,
here("r_training_datax", "Exercises", "data", "Clean", "panel_cit_clean.csv"))
# 3. Transform dt_firms
dt_firms_clean <- dt_firms %>%
clean_names()
# Save to Clean folder
fwrite(dt_firms_clean,
here("r_training_datax", "Exercises", "data", "Clean", "dt_firms_clean.csv"))
# 4. Verify the work
file.exists(here("r_training_datax", "Exercises", "data", "Clean", "panel_vat_clean.csv"))
file.exists(here("r_training_datax", "Exercises", "data", "Clean", "panel_cit_clean.csv"))
file.exists(here("r_training_datax", "Exercises", "data", "Clean", "dt_firms_clean.csv"))
# Load them back
vat_verify <- fread(here("r_training_datax", "Exercises", "data", "Clean", "panel_vat_clean.csv"))
cit_verify <- fread(here("r_training_datax", "Exercises", "data", "Clean", "panel_cit_clean.csv"))
firms_verify <- fread(here("r_training_datax", "Exercises", "data", "Clean", "dt_firms_clean.csv"))
# Check columns
names(vat_verify)
names(cit_verify)
names(firms_verify)
# Quick summary
glimpse(vat_verify)
glimpse(cit_verify)
glimpse(firms_verify)Core dplyr verbs:
filter() - select rowsselect() - choose columnsmutate() - create/modify columnsarrange() - sort datasummarize() + group_by() - aggregatecount() - frequenciesdistinct() - remove duplicatesDate operations:
lubridate✅ Always use pipes (%>%) to chain operations clearly
✅ Save your work with <- for variables you’ll use again
✅ Include na.rm = TRUE in calculations
✅ Use meaningful variable names (net_vat, not x1)
✅ Comment your code so others understand it
✅ Save transformed datasets to Clean folder for future modules
✅ Add .groups = "drop" after group_by() + summarize()