Skip to main content

R for Data Analysis

Get started with R, tidyverse, and statistical analysis workflows

120 minutes
4 min read
Updated January 15, 2026

R for Data Analysis

R was built by statisticians, for statisticians. It's the language behind academic research, clinical trials, and peer-reviewed journals worldwide. By the end of this guide, you'll be analyzing real data with the same tools used in research labs at Stanford and the CDC.

What You'll Build
A complete R analysis workflow

Prerequisites

Before starting, make sure you have:

  • R installed (Mac or Windows)
  • RStudio or VS Code with R extension
  • Basic terminal knowledge

Part 1: Setting Up R

R has a steeper setup than Python, but the payoff is worth it. Let's get everything running.

  1. 1

    Install Core Packages

    Open RStudio and run this in the console:

    R
    # The tidyverse is your toolkit - includes dplyr, ggplot2, and more
    install.packages("tidyverse")
    # Additional helpful packages
    install.packages(c("readxl", "writexl", "lubridate"))

    This takes a few minutes. Grab coffee.

  2. 2

    Verify Installation

    R
    # Load tidyverse
    library(tidyverse)
    # You should see a message about loaded packages

    If you see the tidyverse welcome message, you're ready.

  3. 3

    VS Code Setup (Optional)

    If you prefer VS Code over RStudio:

    1. Install the "R" extension by REditorSupport
    2. Open Command Palette and configure R path
    3. Create .R files and run with Cmd+Enter (Mac) or Ctrl+Enter (Windows)

Part 2: R Fundamentals

R has some quirks that trip up beginners. Let's cover them upfront.

Variables and Assignment

R uses <- for assignment (not =):

R
# The R way
project_budget <- 50000
team_size <- 8
cost_per_person <- project_budget / team_size
# This also works, but is less idiomatic
name = "Analysis Project"
# Print values
print(project_budget)
print(paste("Cost per person:", cost_per_person))

Data Types

R
# Numeric (integers and decimals)
sample_size <- 150
p_value <- 0.042
# Character (text/strings)
study_name <- "Customer Retention Analysis"
department <- "Marketing"
# Logical (TRUE/FALSE)
is_significant <- TRUE
needs_review <- FALSE
# Check types with class()
class(sample_size) # "numeric"
class(study_name) # "character"
class(is_significant) # "logical"

Vectors: R's Building Block

Everything in R is a vector. Even single values are vectors of length 1.

R
# Create vectors with c() - "combine"
temperatures <- c(72, 75, 68, 71, 73, 76, 74)
days <- c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
sunny <- c(TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE)
# Vector operations are automatic
fahrenheit <- c(72, 75, 68)
celsius <- (fahrenheit - 32) * 5/9 # Applies to ALL elements
# Built-in statistics
mean(temperatures) # 72.71...
max(temperatures) # 76
min(temperatures) # 68
sd(temperatures) # Standard deviation
R
# Indexing
days[1] # "Mon" (not [0]!)
days[6:7] # "Sat", "Sun"
days[-1] # Everything EXCEPT first element
# Logical indexing
temperatures[temperatures > 73] # Values above 73

Data Frames: Your Main Tool

Data frames are tables—like Excel, but programmable.

R
# Create a data frame
weather <- data.frame(
day = days,
temp = temperatures,
sunny = sunny
)
# View it
print(weather)
head(weather, 3) # First 3 rows
str(weather) # Structure (types, dimensions)
summary(weather) # Statistics for each column
# Access columns
weather$temp
weather$day
# Filter rows
weather[weather$sunny == TRUE, ]
weather[weather$temp > 72, ]
Data Frame Mental Model

Part 3: The Tidyverse Way

Base R works, but tidyverse makes data analysis readable and fast.

The Tidyverse Ecosystem

Creating Sample Data

R
library(tidyverse)
# Create a realistic sales dataset
set.seed(42) # For reproducibility
sales <- tibble(
date = seq(as.Date("2025-01-01"), as.Date("2025-01-31"), by = "day"),
product = sample(c("Laptop", "Phone", "Tablet"), 31, replace = TRUE),
quantity = sample(1:10, 31, replace = TRUE),
unit_price = case_when(
product == "Laptop" ~ 999,
product == "Phone" ~ 699,
product == "Tablet" ~ 399
),
region = sample(c("West", "East", "North", "South"), 31, replace = TRUE)
)
# tibble is tidyverse's improved data.frame
glimpse(sales)

The Pipe: %>%

The pipe (%>%) chains operations together. Read it as "then."

R
# Without pipe (nested, hard to read)
arrange(filter(select(sales, date, product, quantity), product == "Laptop"), desc(quantity))
# With pipe (clear, step-by-step)
sales %>%
select(date, product, quantity) %>%
filter(product == "Laptop") %>%
arrange(desc(quantity))

The Five Core dplyr Verbs

dplyr Verbs

1. filter() — Keep Rows

R
# Single condition
laptops <- sales %>%
filter(product == "Laptop")
# Multiple conditions (AND)
big_laptop_sales <- sales %>%
filter(product == "Laptop", quantity > 5)
# OR conditions
mobile_devices <- sales %>%
filter(product == "Phone" | product == "Tablet")
# Date filtering
first_week <- sales %>%
filter(date <= as.Date("2025-01-07"))

2. select() — Keep Columns

R
# Keep specific columns
sales %>%
select(date, product, quantity)
# Drop columns
sales %>%
select(-region)
# Select range
sales %>%
select(product:unit_price)
# Rename while selecting
sales %>%
select(sale_date = date, item = product)

3. mutate() — Create Columns

R
# Add calculated column
sales %>%
mutate(revenue = quantity * unit_price)
# Multiple columns at once
sales %>%
mutate(
revenue = quantity * unit_price,
weekday = weekdays(date),
month = format(date, "%B")
)
# Conditional column
sales %>%
mutate(
size = case_when(
quantity >= 8 ~ "Large",
quantity >= 4 ~ "Medium",
TRUE ~ "Small"
)
)

4. group_by() + summarize() — Aggregate

This is where R shines for analysis:

R
# Revenue by product
sales %>%
mutate(revenue = quantity * unit_price) %>%
group_by(product) %>%
summarize(
total_sales = n(),
total_units = sum(quantity),
total_revenue = sum(revenue),
avg_order_size = mean(quantity)
)
# Multiple grouping variables
sales %>%
mutate(revenue = quantity * unit_price) %>%
group_by(region, product) %>%
summarize(
total_revenue = sum(revenue),
.groups = "drop" # Ungroup after
)

5. arrange() — Sort Rows

R
# Sort ascending
sales %>%
arrange(quantity)
# Sort descending
sales %>%
arrange(desc(quantity))
# Multiple sort keys
sales %>%
arrange(product, desc(quantity))

Putting It All Together

R
# Complete analysis pipeline
sales_analysis <- sales %>%
# Calculate revenue
mutate(revenue = quantity * unit_price) %>%
# Focus on laptops
filter(product == "Laptop") %>%
# Group by region
group_by(region) %>%
# Summary statistics
summarize(
orders = n(),
units = sum(quantity),
revenue = sum(revenue),
avg_order = mean(quantity)
) %>%
# Sort by revenue
arrange(desc(revenue))
print(sales_analysis)

Part 4: Visualization with ggplot2

ggplot2 creates publication-quality graphics. The syntax follows "grammar of graphics."

ggplot2 Structure

The Basic Pattern

R
# Every ggplot follows this pattern:
ggplot(data, aes(x = variable1, y = variable2)) +
geom_something() +
labs(title = "Title", x = "X Label", y = "Y Label") +
theme_minimal()

Common Plot Types

Scatter Plot

R
# Basic scatter
ggplot(sales, aes(x = date, y = quantity)) +
geom_point() +
labs(title = "Daily Sales Quantity")
# With color by product
ggplot(sales, aes(x = date, y = quantity, color = product)) +
geom_point(size = 3) +
labs(title = "Sales by Product") +
theme_minimal()

Bar Chart

R
# Revenue by product
sales %>%
mutate(revenue = quantity * unit_price) %>%
group_by(product) %>%
summarize(total = sum(revenue)) %>%
ggplot(aes(x = product, y = total, fill = product)) +
geom_col() +
labs(
title = "Total Revenue by Product",
y = "Revenue ($)"
) +
theme_minimal() +
scale_fill_brewer(palette = "Set2")

Line Chart

R
# Daily revenue trend
sales %>%
mutate(revenue = quantity * unit_price) %>%
group_by(date) %>%
summarize(daily_revenue = sum(revenue)) %>%
ggplot(aes(x = date, y = daily_revenue)) +
geom_line(color = "steelblue", linewidth = 1) +
geom_point(color = "darkblue", size = 2) +
labs(title = "Daily Revenue Trend") +
theme_minimal()

Box Plot

R
# Distribution by product
ggplot(sales, aes(x = product, y = quantity, fill = product)) +
geom_boxplot() +
labs(title = "Order Size Distribution") +
theme_minimal()

Histogram

R
# Distribution of order sizes
ggplot(sales, aes(x = quantity)) +
geom_histogram(bins = 10, fill = "steelblue", color = "white") +
labs(title = "Order Size Distribution") +
theme_minimal()

Faceting: Multiple Plots

R
# Separate plot for each product
sales %>%
mutate(revenue = quantity * unit_price) %>%
ggplot(aes(x = date, y = revenue)) +
geom_line() +
facet_wrap(~ product) +
labs(title = "Revenue by Product") +
theme_minimal()

Saving Plots

R
# Save the last plot
ggsave("revenue_by_product.png", width = 10, height = 6, dpi = 300)
# Save a specific plot
my_plot <- ggplot(sales, aes(x = date, y = quantity)) + geom_point()
ggsave("my_plot.png", plot = my_plot)

Part 5: Reading and Writing Data

CSV Files

R
library(readr)
# Write
write_csv(sales, "january_sales.csv")
# Read
imported_sales <- read_csv("january_sales.csv")

Excel Files

R
library(readxl)
library(writexl)
# Write
write_xlsx(sales, "january_sales.xlsx")
# Read
excel_data <- read_excel("january_sales.xlsx")
# Read specific sheet
read_excel("workbook.xlsx", sheet = "Sales")

Multiple Files

R
# Read all CSVs in a folder
library(purrr)
all_files <- list.files("data/", pattern = "*.csv", full.names = TRUE)
combined_data <- map_dfr(all_files, read_csv)

Part 6: Complete Analysis Pipeline

Let's put everything together:

R
# Full Analysis Script
# -------------------
library(tidyverse)
# 1. Load Data
sales <- read_csv("data/sales.csv")
# 2. Clean & Transform
sales_clean <- sales %>%
# Handle missing values
drop_na() %>%
# Create calculated columns
mutate(
revenue = quantity * unit_price,
weekday = weekdays(date),
is_weekend = weekday %in% c("Saturday", "Sunday")
) %>%
# Remove obvious errors
filter(quantity > 0, unit_price > 0)
# 3. Analysis
summary_stats <- sales_clean %>%
group_by(product) %>%
summarize(
total_revenue = sum(revenue),
avg_order_size = mean(quantity),
order_count = n(),
.groups = "drop"
) %>%
arrange(desc(total_revenue))
print(summary_stats)
# 4. Visualization
revenue_plot <- summary_stats %>%
ggplot(aes(x = reorder(product, total_revenue), y = total_revenue, fill = product)) +
geom_col() +
coord_flip() +
labs(
title = "Revenue by Product",
x = NULL,
y = "Total Revenue ($)"
) +
theme_minimal() +
theme(legend.position = "none")
ggsave("outputs/revenue_by_product.png", revenue_plot, width = 8, height = 5, dpi = 300)
# 5. Report
cat("\n=== KEY FINDINGS ===\n")
cat(paste("Total Revenue:", scales::dollar(sum(sales_clean$revenue)), "\n"))
cat(paste("Top Product:", summary_stats$product[1], "\n"))
cat(paste("Total Orders:", nrow(sales_clean), "\n"))

Troubleshooting


Next Steps

You've got the R fundamentals. Here's where to go deeper:

Want to...Learn
Automate reportsRMarkdown, Quarto
Build dashboardsShiny
Machine learningtidymodels
Bigger datadata.table, arrow

Resources


Share this article