Data Wrangling with `dplyr`

R Training

Introduction

Welcome!

In this module, we’ll learn how to wrangle (clean, transform, and prepare) tax data using dplyr.

What does “data wrangling” mean?

Data wrangling is the process of transforming raw data into a format that’s ready for analysis.

Think of it like preparing ingredients before cooking:

Cleaning vegetables 🥕 → Removing bad data
Cutting ingredients 🔪 → Selecting what you need
Mixing ingredients 🍜 → Combining datasets

By the end of this module, you’ll be able to:

Filter, select, and transform tax data
Create new variables and transform existing ones
Work with dates for tax filing analysis
Build complete data processing workflows

Why `dplyr`?

dplyr is part of the tidyverse - a collection of R packages designed to work together seamlessly.

Advantages:

✅ Readable - verbs match what you want to do

✅ Consistent - similar syntax across operations

✅ Beginner-friendly - clear and logical

✅ Well-documented - tons of resources

# English-like code!
tax_data %>%
  filter(filing_year == 2023) %>%
  select(firm_id, vat_outputs) %>%
  arrange(desc(vat_outputs))

# "Take tax_data, THEN
#  filter to 2023, THEN
#  select these columns, THEN
#  arrange by amount"

Note

We’ll explain what %>% means in just a moment!

Storing Your Work: Assignment in R

Before we dive into data wrangling, we need to understand a crucial concept:

Changes Don’t Stick Unless You Save Them!

When you transform data in R, the changes are NOT automatically saved. You must use the assignment operator <- to store results.

# This displays the result but DOESN'T save it
panel_vat %>%
  filter(vat_outputs > 40000)

# Next time you use panel_vat, those changes are GONE!

How Assignment Works

The assignment operator <- stores results in an object:

# Create a new object with filtered data
high_vat <- panel_vat %>%
  filter(vat_outputs > 40000)

# Now we can use high_vat later
nrow(high_vat)  # Check how many rows

[1] 211

You can also overwrite the original:

# This REPLACES panel_vat with the filtered version
panel_vat <- panel_vat %>%
  filter(vat_outputs > 40000)

# Be careful! The original data is now gone.

Warning

Overwriting is permanent! If you’re not sure, create a new object instead.

When to Create New vs. Overwrite

Create a NEW object when:

You want to keep the original data
You’re trying different approaches
You’re exploring or testing

vat_2023 <- panel_vat %>%
  filter(filing_year == 2023)

# Original panel_vat is still there!

Overwrite the SAME object when:

You’re cleaning data step-by-step
You’re adding permanent calculated columns
You’re sure you don’t need the original

# Add year column permanently
panel_vat <- panel_vat %>%
  mutate(filing_year = year(declaration_date))

# Now panel_vat has filing_year column!

Practice: Keyboard Shortcut

The assignment operator <- has a keyboard shortcut:

Tip

Alt + - (minus) (Windows/Linux)

Option + - (minus) (Mac)

This types <- automatically!

Let’s practice the concept:

# Create a variable with the year extracted
panel_vat <- panel_vat %>%
  mutate(filing_year = year(declaration_date))

# Verify it's there
"filing_year" %in% names(panel_vat)

[1] TRUE

The Pipe Operator: `%>%`

The pipe operator (%>%) is the key to writing clear, readable code.

Think of it as “and then”:

# Without pipes - hard to read!
arrange(select(filter(tax_data, filing_year == 2023), firm_id, vat_outputs), desc(vat_outputs))

# With pipes - much clearer!
tax_data %>%
  filter(filing_year == 2023) %>%
  select(firm_id, vat_outputs) %>%
  arrange(desc(vat_outputs))

How to read it:

Start with your data: tax_data
THEN (%>%) filter to 2023
THEN (%>%) select specific columns
THEN (%>%) sort the results

Tip

Keyboard shortcut: Ctrl + Shift + M (Windows) or Cmd + Shift + M (Mac)

Pipe Example: Morning Coffee ☕

Let’s understand pipes with an everyday example:

Making coffee (without pipes):

drink(pour(brew(grind(beans))))
# You have to read from inside-out! Confusing!

Making coffee (with pipes):

beans %>%
  grind() %>%
  brew() %>%
  pour() %>%
  drink()

# Read top to bottom - just like following a recipe!

Same logic for tax data:

panel_vat %>%
  mutate(filing_year = year(declaration_date)) %>%  # Extract year first
  filter(filing_year == 2023) %>%                   # Get 2023 data
  mutate(net_vat = vat_outputs - vat_inputs) %>%    # Calculate net VAT
  arrange(desc(net_vat))                            # Sort by net VAT

Our Tax Datasets

Let’s remind ourselves what data we’re working with:

# Firm characteristics
glimpse(dt_firms)

 chr "/home/runner/work/datax_r/datax_r/data/intermediate/dt_firms.csv"

Our Tax Datasets (continued)

# VAT declarations
glimpse(panel_vat)

Rows: 1,000
Columns: 8
$ firm_id            <chr> "FIRM_0096", "FIRM_0081", "FIRM_0024", "FIRM_0091",…
$ declaration_date   <date> 2023-01-31, 2022-05-31, 2023-03-31, 2023-08-31, 20…
$ vat_inputs         <int> 45262, 9225, 15289, 9347, 47317, 44471, 23806, 4907…
$ vat_outputs        <int> 45914, 29661, 10005, 14410, 33378, 45639, 19855, 25…
$ reimbursement_date <date> 2023-12-31, 2021-07-31, 2023-09-30, 2021-11-30, 20…
$ filing_year        <int> 2023, 2022, 2023, 2023, 2023, 2022, 2021, 2022, 202…
$ filing_quarter     <int> 1, 2, 1, 3, 3, 4, 1, 3, 4, 4, 4, 2, 2, 1, 3, 1, 4, …
$ filing_month       <int> 1, 5, 3, 8, 9, 12, 2, 7, 12, 11, 12, 4, 6, 3, 9, 2,…

Note

glimpse() is a dplyr function that gives us a quick overview of our data - the column names, data types, and first few values.

Core Single-Table Operations

The Six Core Verbs

dplyr has six main verbs (functions) for working with data:

Verb	What it does	Example
`filter()`	Choose rows based on conditions	“Show me only 2023 declarations”
`select()`	Choose columns	“I only need firm ID and VAT amount”
`mutate()`	Create or modify columns	“Calculate net VAT”
`arrange()`	Sort rows	“Order by VAT amount, highest first”
`summarize()`	Calculate summaries	“What’s the average VAT payment?”
`group_by()`	Group data for grouped operations	“Average VAT by industry”

Let’s learn each one with real tax data examples!

Filtering Rows: `filter()`

filter() lets you select specific rows based on conditions.

Basic example:

# Get only declarations with high VAT outputs
panel_vat %>%
  filter(vat_outputs > 40000) %>%
  head(3)  # Show first 3 rows

     firm_id declaration_date vat_inputs vat_outputs reimbursement_date
      <char>           <Date>      <int>       <int>             <Date>
1: FIRM_0096       2023-01-31      45262       45914         2023-12-31
2: FIRM_0090       2022-12-31      44471       45639         2022-10-31
3: FIRM_0078       2021-06-30      46685       46336         2022-04-30
   filing_year filing_quarter filing_month
         <int>          <int>        <int>
1:        2023              1            1
2:        2022              4           12
3:        2021              2            6

Important

Notice we use == (two equal signs) for comparison, not = (one equal sign).

== means “is equal to”
= is for assigning values

Common Filter Conditions

You can filter using various comparison operators:

# Greater than
panel_vat %>%
  filter(vat_outputs > 40000) %>%
  head(3)

     firm_id declaration_date vat_inputs vat_outputs reimbursement_date
      <char>           <Date>      <int>       <int>             <Date>
1: FIRM_0096       2023-01-31      45262       45914         2023-12-31
2: FIRM_0090       2022-12-31      44471       45639         2022-10-31
3: FIRM_0078       2021-06-30      46685       46336         2022-04-30
   filing_year filing_quarter filing_month
         <int>          <int>        <int>
1:        2023              1            1
2:        2022              4           12
3:        2021              2            6

Available operators:

== equal to
!= not equal to
> greater than
< less than
>= greater than or equal to
<= less than or equal to

Multiple Conditions

Combine conditions using & (AND) or | (OR):

# High outputs AND low inputs (suspicious pattern?)
panel_vat %>%
  filter(vat_outputs > 40000 & vat_inputs < 20000) %>%
  head(3)

     firm_id declaration_date vat_inputs vat_outputs reimbursement_date
      <char>           <Date>      <int>       <int>             <Date>
1: FIRM_0095       2023-10-31       8254       43505         2021-12-31
2: FIRM_0015       2022-02-28       1444       44201         2023-01-31
3: FIRM_0099       2023-01-31      16775       47471         2023-05-31
   filing_year filing_quarter filing_month
         <int>          <int>        <int>
1:        2023              4           10
2:        2022              1            2
3:        2023              1            1

# Either very high outputs OR very high inputs
panel_vat %>%
  filter(vat_outputs > 48000 | vat_inputs > 48000) %>%
  head(3)

     firm_id declaration_date vat_inputs vat_outputs reimbursement_date
      <char>           <Date>      <int>       <int>             <Date>
1: FIRM_0080       2022-07-31      49071       25070         2022-06-30
2: FIRM_0040       2022-12-31      43741       49806         2021-07-31
3: FIRM_0100       2021-10-31      48270       14916         2021-06-30
   filing_year filing_quarter filing_month
         <int>          <int>        <int>
1:        2022              3            7
2:        2022              4           12
3:        2021              4           10

Useful Filter Helpers

1. Check if value is in a list: %in%

# Filter to specific firms
firms_of_interest <- c("FIRM_0001", "FIRM_0005", "FIRM_0010")

panel_vat %>%
  filter(firm_id %in% firms_of_interest) %>%
  head(3)

     firm_id declaration_date vat_inputs vat_outputs reimbursement_date
      <char>           <Date>      <int>       <int>             <Date>
1: FIRM_0010       2021-02-28      30557       40239         2022-03-31
2: FIRM_0005       2022-09-30      22196       30998         2022-04-30
3: FIRM_0005       2021-04-30       5016       30653         2021-10-31
   filing_year filing_quarter filing_month
         <int>          <int>        <int>
1:        2021              1            2
2:        2022              3            9
3:        2021              2            4

Tip

%in% is much easier than writing firm_id == "FIRM_0001" | firm_id == "FIRM_0005" | ...

Useful Filter Helpers (continued)

2. Check if value is within a range: between()

# Medium-sized VAT payments (between 20,000 and 40,000)
panel_vat %>%
  filter(between(vat_outputs, 20000, 40000)) %>%
  head(3)

     firm_id declaration_date vat_inputs vat_outputs reimbursement_date
      <char>           <Date>      <int>       <int>             <Date>
1: FIRM_0081       2022-05-31       9225       29661         2021-07-31
2: FIRM_0014       2023-09-30      47317       33378         2021-12-31
3: FIRM_0080       2022-07-31      49071       25070         2022-06-30
   filing_year filing_quarter filing_month
         <int>          <int>        <int>
1:        2022              2            5
2:        2023              3            9
3:        2022              3            7

Selecting Columns: `select()`

select() lets you choose which columns to keep.

Select specific columns:

# Keep only firm ID and VAT amounts
panel_vat %>%
  select(firm_id, vat_inputs, vat_outputs) %>%
  head(3)

     firm_id vat_inputs vat_outputs
      <char>      <int>       <int>
1: FIRM_0096      45262       45914
2: FIRM_0081       9225       29661
3: FIRM_0024      15289       10005

Select a range of columns:

# Select from firm_id to vat_inputs (all columns in between)
panel_vat %>%
  select(firm_id:vat_inputs) %>%
  head(3)

     firm_id declaration_date vat_inputs
      <char>           <Date>      <int>
1: FIRM_0096       2023-01-31      45262
2: FIRM_0081       2022-05-31       9225
3: FIRM_0024       2023-03-31      15289

Dropping Columns

Remove columns with -:

# Keep everything EXCEPT reimbursement_date
panel_vat %>%
  select(-reimbursement_date) %>%
  head(3)

     firm_id declaration_date vat_inputs vat_outputs filing_year filing_quarter
      <char>           <Date>      <int>       <int>       <int>          <int>
1: FIRM_0096       2023-01-31      45262       45914        2023              1
2: FIRM_0081       2022-05-31       9225       29661        2022              2
3: FIRM_0024       2023-03-31      15289       10005        2023              1
   filing_month
          <int>
1:            1
2:            5
3:            3

Drop multiple columns:

# Remove multiple columns
panel_vat %>%
  select(-reimbursement_date, -filing_year) %>%
  head(3)

     firm_id declaration_date vat_inputs vat_outputs filing_quarter
      <char>           <Date>      <int>       <int>          <int>
1: FIRM_0096       2023-01-31      45262       45914              1
2: FIRM_0081       2022-05-31       9225       29661              2
3: FIRM_0024       2023-03-31      15289       10005              1
   filing_month
          <int>
1:            1
2:            5
3:            3

Creating New Columns: `mutate()`

mutate() creates new columns or modifies existing ones.

Basic calculation:

# Calculate net VAT (outputs minus inputs)
panel_vat %>%
  mutate(net_vat = vat_outputs - vat_inputs) %>%
  select(firm_id, vat_outputs, vat_inputs, net_vat) %>%
  head(3)

     firm_id vat_outputs vat_inputs net_vat
      <char>       <int>      <int>   <int>
1: FIRM_0096       45914      45262     652
2: FIRM_0081       29661       9225   20436
3: FIRM_0024       10005      15289   -5284

Note

The original data is NOT changed. mutate() creates a new version with the additional column.

Multiple Calculations

You can create several columns at once:

# Create multiple new columns
panel_vat %>%
  mutate(
    net_vat = vat_outputs - vat_inputs,
    vat_ratio = vat_inputs / vat_outputs,
    large_taxpayer = vat_outputs > 40000
  ) %>%
  select(firm_id, net_vat, vat_ratio, large_taxpayer) %>%
  head(3)

     firm_id net_vat vat_ratio large_taxpayer
      <char>   <int>     <num>         <lgcl>
1: FIRM_0096     652 0.9857995           TRUE
2: FIRM_0081   20436 0.3110145          FALSE
3: FIRM_0024   -5284 1.5281359          FALSE

Conditional Logic: `if_else()`

Create columns based on conditions:

# Flag potential refund cases (negative net VAT)
panel_vat %>%
  mutate(
    net_vat = vat_outputs - vat_inputs,
    is_refund = if_else(net_vat < 0, TRUE, FALSE)
  ) %>%
  select(firm_id, net_vat, is_refund) %>%
  head(5)

     firm_id net_vat is_refund
      <char>   <int>    <lgcl>
1: FIRM_0096     652     FALSE
2: FIRM_0081   20436     FALSE
3: FIRM_0024   -5284      TRUE
4: FIRM_0091    5063     FALSE
5: FIRM_0014  -13939      TRUE

Syntax: if_else(condition, value_if_TRUE, value_if_FALSE)

Tip

You can use text too: if_else(net_vat < 0, "Refund", "Payment")

Sorting Data: `arrange()`

arrange() sorts your data by one or more columns.

Ascending order (smallest to largest):

# Sort by VAT outputs, lowest first
panel_vat %>%
  select(firm_id, vat_outputs) %>%
  arrange(vat_outputs) %>%
  head(3)

     firm_id vat_outputs
      <char>       <int>
1: FIRM_0066        1014
2: FIRM_0044        1045
3: FIRM_0020        1073

Descending order (largest to smallest):

# Sort by VAT outputs, highest first
panel_vat %>%
  select(firm_id, vat_outputs) %>%
  arrange(desc(vat_outputs)) %>%
  head(3)

     firm_id vat_outputs
      <char>       <int>
1: FIRM_0020       49969
2: FIRM_0096       49844
3: FIRM_0040       49806

Sorting by Multiple Columns

# Sort by year, then by VAT outputs (within each year)
panel_vat %>%
  select(firm_id, filing_year, vat_outputs) %>%
  arrange(filing_year, desc(vat_outputs)) %>%
  head(5)

     firm_id filing_year vat_outputs
      <char>       <int>       <int>
1: FIRM_0075        2021       49710
2: FIRM_0056        2021       49409
3: FIRM_0060        2021       49309
4: FIRM_0062        2021       49283
5: FIRM_0051        2021       49280

Note

This is useful for ranking: “Who are the top taxpayers each year?”

Summarizing Data: `summarize()`

summarize() calculates summary statistics and reduces your data to a single row.

Basic example:

# Calculate overall statistics
panel_vat %>%
  summarize(
    total_vat_collected = sum(vat_outputs, na.rm = TRUE),
    average_vat = mean(vat_outputs, na.rm = TRUE),
    median_vat = median(vat_outputs, na.rm = TRUE),
    num_declarations = n()
  )

  total_vat_collected average_vat median_vat num_declarations
1            25883247    25883.25    26212.5             1000

Important

Always use na.rm = TRUE to remove missing values, otherwise you’ll get NA as the result!

Common Summary Functions

Function	What it calculates	Example
`sum()`	Total	Total VAT collected
`mean()`	Average	Average payment
`median()`	Middle value	Median VAT amount
`min()`	Smallest value	Minimum payment
`max()`	Largest value	Maximum payment
`n()`	Count of rows	Number of declarations
`n_distinct()`	Count unique values	Number of unique firms

# More examples
panel_vat %>%
  summarize(
    min_vat = min(vat_outputs, na.rm = TRUE),
    max_vat = max(vat_outputs, na.rm = TRUE),
    unique_firms = n_distinct(firm_id)
  )

  min_vat max_vat unique_firms
1    1014   49969          100

Grouping Data: `group_by()`

group_by() is powerful - it lets you perform operations separately for each group.

Example: Average VAT by year

# Calculate average VAT for each year separately
panel_vat %>%
  group_by(filing_year) %>%
  summarize(
    avg_vat = mean(vat_outputs, na.rm = TRUE),
    num_declarations = n(),
    .groups = "drop"
  )

# A tibble: 3 × 3
  filing_year avg_vat num_declarations
        <int>   <dbl>            <int>
1        2021  26088.              329
2        2022  25410.              332
3        2023  26148.              339

Note

Think of group_by() as creating separate “buckets” - one for each unique value. Then calculations happen within each bucket.

Grouping by Multiple Variables

# Average VAT by year AND quarter
panel_vat %>%
  group_by(filing_year, filing_quarter) %>%
  summarize(
    avg_vat = mean(vat_outputs, na.rm = TRUE),
    num_declarations = n(),
    .groups = "drop"
  ) %>%
  head(8)

# A tibble: 8 × 4
  filing_year filing_quarter avg_vat num_declarations
        <int>          <int>   <dbl>            <int>
1        2021              1  25976.               72
2        2021              2  27633.               77
3        2021              3  25752.               95
4        2021              4  25158.               85
5        2022              1  25747.               90
6        2022              2  24686.               78
7        2022              3  27352.               86
8        2022              4  23604.               78

Why .groups = "drop"?

After summarize(), R keeps the grouping structure by default. Adding .groups = "drop" removes this grouping, which:

Prevents unexpected behavior in later operations
Is considered best practice

Always add .groups = "drop" after group_by() + summarize()!

Counting: `count()`

count() is a shortcut for group_by() + summarize(n = n()).

Simple frequency table:

# How many declarations per year?
panel_vat %>%
  count(filing_year)

   filing_year     n
         <int> <int>
1:        2021   329
2:        2022   332
3:        2023   339

Sort by frequency:

# Most common years
panel_vat %>%
  count(filing_year, sort = TRUE)

   filing_year     n
         <int> <int>
1:        2023   339
2:        2022   332
3:        2021   329

Counting Unique Values

# How many unique firms filed each year?
panel_vat %>%
  group_by(filing_year) %>%
  summarize(unique_firms = n_distinct(firm_id), .groups = "drop")

# A tibble: 3 × 2
  filing_year unique_firms
        <int>        <int>
1        2021           96
2        2022           95
3        2023           97

Removing Duplicates: `distinct()`

distinct() keeps only unique rows.

Example: Get unique firms

# List of unique firms in the data
panel_vat %>%
  distinct(firm_id) %>%
  head(5)

     firm_id
      <char>
1: FIRM_0096
2: FIRM_0081
3: FIRM_0024
4: FIRM_0091
5: FIRM_0014

Keep all columns:

# Remove duplicate firm-year combinations, keep first occurrence
panel_vat %>%
  distinct(firm_id, filing_year, .keep_all = TRUE) %>%
  head(3)

     firm_id declaration_date vat_inputs vat_outputs reimbursement_date
      <char>           <Date>      <int>       <int>             <Date>
1: FIRM_0096       2023-01-31      45262       45914         2023-12-31
2: FIRM_0081       2022-05-31       9225       29661         2021-07-31
3: FIRM_0024       2023-03-31      15289       10005         2023-09-30
   filing_year filing_quarter filing_month
         <int>          <int>        <int>
1:        2023              1            1
2:        2022              2            5
3:        2023              1            3

Tip

.keep_all = TRUE means “keep all the other columns too, not just the ones I specified”.

Putting It All Together

Let’s combine multiple operations in one pipeline:

# Complete analysis: Top 5 firms by average VAT in 2023
panel_vat %>%
  filter(filing_year == 2023) %>%                       # Only 2023
  mutate(net_vat = vat_outputs - vat_inputs) %>%        # Calculate net VAT
  group_by(firm_id) %>%                                 # Group by firm
  summarize(avg_net_vat = mean(net_vat, na.rm = TRUE), .groups = "drop") %>%  # Average per firm
  arrange(desc(avg_net_vat)) %>%                        # Sort highest first
  head(5)                                               # Top 5

# A tibble: 5 × 2
  firm_id   avg_net_vat
  <chr>           <dbl>
1 FIRM_0060       40856
2 FIRM_0053       27003
3 FIRM_0012       21663
4 FIRM_0022       21581
5 FIRM_0018       21551

Note

Read this from top to bottom like a recipe - each step builds on the previous one!

Exercise 1: Core dplyr Operations

You can find the exercise in the folder “Exercises/exercise_01_template.R”

10:00

Your tasks:

Filtering: Filter panel_vat to declarations with vat_outputs greater than 30,000 and save as high_vat. Check how many rows using nrow().
Selecting: From panel_vat, select only firm_id, declaration_date, and vat_outputs. Display the first few rows with head().
Mutating: Create a new column called net_vat (vat_outputs - vat_inputs) and a flag is_refund (TRUE if net_vat < 0). Display the first few rows.
Arranging: Sort panel_cit by tax_paid (highest first) and show the top 10 taxpayers.
Summarizing: Calculate total, average, and median taxable_income from panel_cit.
Grouping: Calculate average vat_outputs by filing_year and count declarations per year using group_by() and summarize().
Challenge: Find the top 3 firms by total net VAT paid in 2022, showing firm_id and total amount.

Exercise 1: Solutions

# Load required packages
library(dplyr)
library(data.table)
library(here)
library(lubridate)

# Load data
panel_vat <- fread(here("r_training_datax", "Exercises", "data", "intermediate", "panel_vat.csv"))
panel_cit <- fread(here("r_training_datax", "Exercises", "data", "intermediate", "panel_cit.csv"))

# Convert dates
panel_vat$declaration_date <- as.Date(panel_vat$declaration_date)
panel_cit$declaration_date <- as.Date(panel_cit$declaration_date)

# 1. Filtering
high_vat <- panel_vat %>%
  filter(vat_outputs > 30000)

nrow(high_vat)

# 2. Selecting
vat_selected <- panel_vat %>%
  select(firm_id, declaration_date, vat_outputs)

head(vat_selected)

# 3. Mutating
vat_with_flags <- panel_vat %>%
  mutate(
    net_vat = vat_outputs - vat_inputs,
    is_refund = if_else(net_vat < 0, TRUE, FALSE)
  )

head(vat_with_flags)

# 4. Arranging
top_taxpayers <- panel_cit %>%
  arrange(desc(tax_paid)) %>%
  head(10)

top_taxpayers

# 5. Summarizing
cit_summary <- panel_cit %>%
  summarize(
    total_income = sum(taxable_income, na.rm = TRUE),
    avg_income = mean(taxable_income, na.rm = TRUE),
    median_income = median(taxable_income, na.rm = TRUE)
  )

cit_summary

# 6. Grouping
vat_by_year <- panel_vat %>%
  mutate(filing_year = year(declaration_date)) %>%
  group_by(filing_year) %>%
  summarize(
    avg_vat = mean(vat_outputs, na.rm = TRUE),
    num_declarations = n(),
    .groups = "drop"
  )

vat_by_year

# 7. Challenge
top_firms_2022 <- panel_vat %>%
  mutate(filing_year = year(declaration_date)) %>%
  filter(filing_year == 2022) %>%
  mutate(net_vat = vat_outputs - vat_inputs) %>%
  group_by(firm_id) %>%
  summarize(total_vat = sum(net_vat, na.rm = TRUE), .groups = "drop") %>%
  arrange(desc(total_vat)) %>%
  head(3)

top_firms_2022

Working with Dates

Why Date Operations Matter in Tax Data

Tax data is inherently time-based:

📆 Filing dates - to check if declarations are late

📆 Tax periods - to calculate quarterly, annual totals

📆 Payment deadlines - to calculate penalties

📆 Time gaps - to identify irregular filing patterns

We need to extract, calculate, and analyze dates properly!

Date Operations: The `lubridate` Package

lubridate makes working with dates much easier!

Common tasks:

Parse dates from text
Extract parts of dates (year, month, quarter)
Calculate differences between dates
Add/subtract time periods

Parsing Dates from Text

Dates are often stored as text. We need to convert them to Date format:

# Different date formats
dates_ymd <- c("2023-01-15", "2023-06-20", "2023-12-31")
dates_dmy <- c("15/01/2023", "20/06/2023", "31/12/2023")
dates_mdy <- c("01/15/2023", "06/20/2023", "12/31/2023")

# Parse them (lubridate automatically handles the format!)
ymd(dates_ymd)

[1] "2023-01-15" "2023-06-20" "2023-12-31"

dmy(dates_dmy)

[1] "2023-01-15" "2023-06-20" "2023-12-31"

mdy(dates_mdy)

[1] "2023-01-15" "2023-06-20" "2023-12-31"

Tip

Use the function that matches your format: ymd() for year-month-day, dmy() for day-month-year, etc.

Extracting Date Components

Pull out specific parts of a date:

# Example date
example_date <- ymd("2023-06-15")

# Extract components
year(example_date)

[1] 2023

month(example_date)

[1] 6

quarter(example_date)

[1] 2

day(example_date)

[1] 15

Extracting Date Components (Real Data)

# Add year, month, quarter to VAT data
panel_vat %>%
  select(firm_id, declaration_date, filing_year, filing_month, filing_quarter) %>%
  head(5)

     firm_id declaration_date filing_year filing_month filing_quarter
      <char>           <Date>       <int>        <int>          <int>
1: FIRM_0096       2023-01-31        2023            1              1
2: FIRM_0081       2022-05-31        2022            5              2
3: FIRM_0024       2023-03-31        2023            3              1
4: FIRM_0091       2023-08-31        2023            8              3
5: FIRM_0014       2023-09-30        2023            9              3

Note

This is useful for grouping declarations by time period!

Date Arithmetic: Calculating Differences

Calculate how many days between dates:

# Example: How late was the filing?
deadline <- ymd("2023-04-30")
filing_date <- ymd("2023-05-10")

# Days late
filing_date - deadline

Time difference of 10 days

With real data:

# How many days between two dates in our data?
first_date <- min(panel_vat$declaration_date)
last_date <- max(panel_vat$declaration_date)

# Total days covered by our data
last_date - first_date

Time difference of 1064 days

Note

Date arithmetic in R is straightforward - just subtract one date from another!

Date Arithmetic: Adding/Subtracting Time

# Add 30 days to a date
filing_deadline <- ymd("2023-04-30")
payment_deadline <- filing_deadline + 30
payment_deadline

[1] "2023-05-30"

# Subtract days
ymd("2023-12-31") - 90  # 90 days before Dec 31

[1] "2023-10-02"

Exercise 2: Date Operations

You can find the exercise in the folder “Exercises/exercise_02_template.R”

10:00

Your tasks:

Parse dates: Load panel_vat and convert declaration_date to proper Date format using as.Date(). Check with class().
Extract components: Create three new columns:
- filing_year using year()
- filing_quarter using quarter()
- filing_month using month()
Save as panel_vat_dates and display first few rows.
Calculate differences: For each firm, calculate days_since_last (days between consecutive declarations) using arrange(), group_by(), lag(), and ungroup(). Display results.
Filing analysis: Calculate quarter_end, filing_deadline (45 days after quarter end), and days_late. Then summarize: how many declarations were late?
Recent filers: Find firms that filed in the last 180 days from today() using filter() and distinct().

Exercise 2: Solutions

# Load required packages
library(dplyr)
library(lubridate)
library(data.table)
library(here)

# Load data
panel_vat <- fread(here("r_training_datax", "Exercises", "data", "intermediate", "panel_vat.csv"))

# 1. Parse dates
panel_vat$declaration_date <- as.Date(panel_vat$declaration_date)
class(panel_vat$declaration_date)

# 2. Extract date components
panel_vat_dates <- panel_vat %>%
  mutate(
    filing_year = year(declaration_date),
    filing_quarter = quarter(declaration_date),
    filing_month = month(declaration_date)
  )

head(panel_vat_dates)

# 3. Calculate days between declarations
panel_vat_gaps <- panel_vat_dates %>%
  arrange(firm_id, declaration_date) %>%
  group_by(firm_id) %>%
  mutate(days_since_last = as.numeric(declaration_date - lag(declaration_date))) %>%
  ungroup()

panel_vat_gaps %>%
  select(firm_id, declaration_date, days_since_last) %>%
  head(10)

# 4. Filing analysis
panel_vat_deadlines <- panel_vat_dates %>%
  mutate(
    quarter_end = ceiling_date(declaration_date, "quarter") - 1,
    filing_deadline = quarter_end + 45,
    days_late = as.numeric(declaration_date - filing_deadline)
  )

panel_vat_deadlines %>%
  summarize(
    total_declarations = n(),
    late_filers = sum(days_late > 0, na.rm = TRUE),
    pct_late = round(100 * late_filers / total_declarations, 2)
  )

# 5. Recent filers
recent_filers <- panel_vat_dates %>%
  filter(declaration_date >= (today() - 180)) %>%
  distinct(firm_id)

nrow(recent_filers)
head(recent_filers)

# Combine all transformations
panel_vat_dates <- panel_vat_deadlines %>%
  left_join(
    panel_vat_gaps %>% select(firm_id, declaration_date, days_since_last),
    by = c("firm_id", "declaration_date")
  )

Building Complete Workflows

Chaining Multiple Operations

The real power of dplyr comes from combining operations in a logical sequence:

# Complete workflow example
analysis_result <- panel_vat %>%
  # Step 1: Calculate new variables
  mutate(
    net_vat = vat_outputs - vat_inputs,
    filing_quarter = quarter(declaration_date)
  ) %>%
  
  # Step 2: Filter to recent data
  filter(filing_year == 2023) %>%
  
  # Step 3: Analyze by quarter
  group_by(filing_quarter) %>%
  summarize(
    total_vat = sum(net_vat, na.rm = TRUE),
    avg_vat = mean(net_vat, na.rm = TRUE),
    num_firms = n_distinct(firm_id),
    .groups = "drop"
  )

analysis_result

# A tibble: 4 × 4
  filing_quarter total_vat avg_vat num_firms
           <int>     <int>   <dbl>     <int>
1              1    442405   5395.        54
2              2     15185    167.        59
3              3   -104600  -1113.        62
4              4    179021   2486.        49

When to Break Into Steps

Sometimes it’s better to break long pipelines into smaller pieces:

Good for complex workflows:

# Step 1: Add derived variables
vat_enhanced <- panel_vat %>%
  mutate(
    net_vat = vat_outputs - vat_inputs,
    filing_quarter = quarter(declaration_date)
  )

# Step 2: Analyze
vat_summary <- vat_enhanced %>%
  group_by(filing_year, filing_quarter) %>%
  summarize(avg_vat = mean(net_vat, na.rm = TRUE), .groups = "drop")

head(vat_summary)

# A tibble: 6 × 3
  filing_year filing_quarter avg_vat
        <int>          <int>   <dbl>
1        2021              1   -519.
2        2021              2   2447.
3        2021              3    219.
4        2021              4    442.
5        2022              1    100.
6        2022              2  -1321.

Tip

Break into steps when:

The pipeline is getting too long (>10 operations)
You want to check intermediate results
You’ll reuse intermediate datasets

Creating Derived Variables

Build up your analysis with meaningful calculated columns:

# Create a rich dataset with many derived variables
panel_vat_enhanced <- panel_vat %>%
  mutate(
    # Financial calculations
    net_vat = vat_outputs - vat_inputs,
    vat_ratio = vat_inputs / vat_outputs,
    
    # Business logic flags
    is_refund = if_else(net_vat < 0, TRUE, FALSE),
    large_taxpayer = if_else(vat_outputs > 40000, TRUE, FALSE),
    high_ratio = if_else(vat_ratio > 0.9, TRUE, FALSE)
  )

# Check what we created
panel_vat_enhanced %>%
  select(firm_id, net_vat, is_refund, large_taxpayer, high_ratio) %>%
  head(5)

     firm_id net_vat is_refund large_taxpayer high_ratio
      <char>   <int>    <lgcl>         <lgcl>     <lgcl>
1: FIRM_0096     652     FALSE           TRUE       TRUE
2: FIRM_0081   20436     FALSE          FALSE      FALSE
3: FIRM_0024   -5284      TRUE          FALSE       TRUE
4: FIRM_0091    5063     FALSE          FALSE      FALSE
5: FIRM_0014  -13939      TRUE          FALSE       TRUE

Saving Transformed Outputs

Save your enhanced datasets for use in later modules:

# Save enhanced VAT panel to Clean folder
panel_vat_clean <- panel_vat %>%
  mutate(
    net_vat = vat_outputs - vat_inputs,
    filing_quarter = quarter(declaration_date),
    is_refund = net_vat < 0,
    large_taxpayer = vat_outputs > 40000
  )

fwrite(panel_vat_clean,
       here("r_training_datax", "Exercises", "data", "Clean", "panel_vat_clean.csv"))

# Save enhanced CIT panel to Clean folder
panel_cit_clean <- panel_cit %>%
  mutate(
    filing_quarter = quarter(declaration_date),
    effective_tax_rate = tax_paid / taxable_income
  )

fwrite(panel_cit_clean,
       here("r_training_datax", "Exercises", "data", "Clean", "panel_cit_clean.csv"))

File Organization Best Practice

Keep your files organized:

r_training_datax/
└── Exercises/
    └── data/
        ├── Raw/              # Original, untouched data
        │   ├── firm_characteristics.csv
        │   ├── vat_declarations.dta
        │   └── cit_declarations.xlsx
        ├── intermediate/     # After Module 2 cleaning
        │   ├── dt_firms.csv
        │   ├── panel_vat.csv
        │   └── panel_cit.csv
        └── Clean/            # After Module 3 transformation
            ├── panel_vat_clean.csv
            └── panel_cit_clean.csv

Tip

Save enhanced versions to Clean folder so you can use them in future modules!

Exercise 3: Complete Transformation Workflow

You can find the exercise in the folder “Exercises/exercise_03_template.R”

15:00

Your tasks:

Create fully transformed datasets ready for analysis:

Transform panel_vat:
- Add net_vat, filing_year, filing_quarter, filing_month
- Calculate vat_ratio and days_since_last per firm
- Create flags: is_refund, large_taxpayer (>40000), high_ratio (>0.9)
- Save as data/Clean/panel_vat_clean.csv using fwrite()
Transform panel_cit:
- Add filing_year, filing_quarter
- Calculate effective_tax_rate (tax_paid / taxable_income)
- Create flag: has_adjustments (adjustments != 0)
- Save as data/Clean/panel_cit_clean.csv using fwrite()
Transform dt_firms:
- Rename columns to lowercase using clean_names()
- Save as data/Clean/dt_firms_clean.csv using fwrite()
Verify your work:
- Check that all three files exist in the Clean folder using file.exists()
- Load them back and verify the new columns are present using names() and glimpse()

Remember: These cleaned datasets will be used in Modules 4, 5, and 6!

Exercise 3: Solutions

# Load required packages
library(dplyr)
library(lubridate)
library(data.table)
library(here)
library(janitor)

# Load data from intermediate folder
panel_vat <- fread(here("r_training_datax", "Exercises", "data", "intermediate", "panel_vat.csv"))
panel_cit <- fread(here("r_training_datax", "Exercises", "data", "intermediate", "panel_cit.csv"))
dt_firms <- fread(here("r_training_datax", "Exercises", "data", "intermediate", "dt_firms.csv"))

# Convert dates
panel_vat$declaration_date <- as.Date(panel_vat$declaration_date)
panel_cit$declaration_date <- as.Date(panel_cit$declaration_date)

# 1. Transform panel_vat
panel_vat_clean <- panel_vat %>%
  # Add date components
  mutate(
    filing_year = year(declaration_date),
    filing_quarter = quarter(declaration_date),
    filing_month = month(declaration_date)
  ) %>%
  # Calculate financial metrics
  mutate(
    net_vat = vat_outputs - vat_inputs,
    vat_ratio = vat_inputs / vat_outputs
  ) %>%
  # Create business flags
  mutate(
    is_refund = if_else(net_vat < 0, TRUE, FALSE),
    large_taxpayer = if_else(vat_outputs > 40000, TRUE, FALSE),
    high_ratio = if_else(vat_ratio > 0.9, TRUE, FALSE, missing = FALSE)
  ) %>%
  # Calculate days between declarations per firm
  arrange(firm_id, declaration_date) %>%
  group_by(firm_id) %>%
  mutate(days_since_last = as.numeric(declaration_date - lag(declaration_date))) %>%
  ungroup()

# Save to Clean folder
fwrite(panel_vat_clean,
       here("r_training_datax", "Exercises", "data", "Clean", "panel_vat_clean.csv"))

# 2. Transform panel_cit
panel_cit_clean <- panel_cit %>%
  # Add date components
  mutate(
    filing_year = year(declaration_date),
    filing_quarter = quarter(declaration_date)
  ) %>%
  # Calculate financial metrics
  mutate(
    effective_tax_rate = tax_paid / taxable_income
  ) %>%
  # Create business flags
  mutate(
    has_adjustments = if_else(adjustments != 0, TRUE, FALSE)
  )

# Save to Clean folder
fwrite(panel_cit_clean,
       here("r_training_datax", "Exercises", "data", "Clean", "panel_cit_clean.csv"))

# 3. Transform dt_firms
dt_firms_clean <- dt_firms %>%
  clean_names()

# Save to Clean folder
fwrite(dt_firms_clean,
       here("r_training_datax", "Exercises", "data", "Clean", "dt_firms_clean.csv"))

# 4. Verify the work
file.exists(here("r_training_datax", "Exercises", "data", "Clean", "panel_vat_clean.csv"))
file.exists(here("r_training_datax", "Exercises", "data", "Clean", "panel_cit_clean.csv"))
file.exists(here("r_training_datax", "Exercises", "data", "Clean", "dt_firms_clean.csv"))

# Load them back
vat_verify <- fread(here("r_training_datax", "Exercises", "data", "Clean", "panel_vat_clean.csv"))
cit_verify <- fread(here("r_training_datax", "Exercises", "data", "Clean", "panel_cit_clean.csv"))
firms_verify <- fread(here("r_training_datax", "Exercises", "data", "Clean", "dt_firms_clean.csv"))

# Check columns
names(vat_verify)
names(cit_verify)
names(firms_verify)

# Quick summary
glimpse(vat_verify)
glimpse(cit_verify)
glimpse(firms_verify)

What We’ve Learned

Core dplyr verbs:

filter() - select rows
select() - choose columns
mutate() - create/modify columns
arrange() - sort data
summarize() + group_by() - aggregate
count() - frequencies
distinct() - remove duplicates

Date operations:

Parse dates with lubridate
Extract date components
Calculate date differences
Create time-based variables

Best Practices Recap

✅ Always use pipes (%>%) to chain operations clearly

✅ Save your work with <- for variables you’ll use again

✅ Include na.rm = TRUE in calculations

✅ Use meaningful variable names (net_vat, not x1)

✅ Comment your code so others understand it

✅ Save transformed datasets to Clean folder for future modules

✅ Add .groups = "drop" after group_by() + summarize()

Data Wrangling with dplyr

Introduction

Welcome!

Why dplyr?

Storing Your Work: Assignment in R

How Assignment Works

When to Create New vs. Overwrite

Practice: Keyboard Shortcut

The Pipe Operator: %>%

Pipe Example: Morning Coffee ☕

Our Tax Datasets

Our Tax Datasets (continued)

Core Single-Table Operations

The Six Core Verbs

Filtering Rows: filter()

Common Filter Conditions

Multiple Conditions

Useful Filter Helpers

Useful Filter Helpers (continued)

Selecting Columns: select()

Dropping Columns

Creating New Columns: mutate()

Multiple Calculations

Conditional Logic: if_else()

Sorting Data: arrange()

Sorting by Multiple Columns

Summarizing Data: summarize()

Common Summary Functions

Grouping Data: group_by()

Grouping by Multiple Variables

Counting: count()

Counting Unique Values

Removing Duplicates: distinct()

Putting It All Together

Exercise 1: Core dplyr Operations

Exercise 1: Solutions

Working with Dates

Why Date Operations Matter in Tax Data

Date Operations: The lubridate Package

Parsing Dates from Text

Extracting Date Components

Extracting Date Components (Real Data)

Date Arithmetic: Calculating Differences

Date Arithmetic: Adding/Subtracting Time

Exercise 2: Date Operations

Exercise 2: Solutions

Building Complete Workflows

Chaining Multiple Operations

When to Break Into Steps

Creating Derived Variables

Saving Transformed Outputs

File Organization Best Practice

Exercise 3: Complete Transformation Workflow

Exercise 3: Solutions

What We’ve Learned

Best Practices Recap

Data Wrangling with `dplyr`

Why `dplyr`?

The Pipe Operator: `%>%`

Filtering Rows: `filter()`

Selecting Columns: `select()`

Creating New Columns: `mutate()`

Conditional Logic: `if_else()`

Sorting Data: `arrange()`

Summarizing Data: `summarize()`

Grouping Data: `group_by()`

Counting: `count()`

Removing Duplicates: `distinct()`

Date Operations: The `lubridate` Package