R for Data Analysis

R was built by statisticians, for statisticians. It's the language behind academic research, clinical trials, and peer-reviewed journals worldwide. By the end of this guide, you'll be analyzing real data with the same tools used in research labs at Stanford and the CDC.

What You'll Build

A complete R analysis workflow

Prerequisites

Before starting, make sure you have:

R installed (Mac or Windows)
RStudio or VS Code with R extension
Basic terminal knowledge

Part 1: Setting Up R

R has a steeper setup than Python, but the payoff is worth it. Let's get everything running.

Install Core Packages

Open RStudio and run this in the console:

1# The tidyverse is your toolkit - includes dplyr, ggplot2, and more
2install.packages("tidyverse")
3 
4# Additional helpful packages
5install.packages(c("readxl", "writexl", "lubridate"))

This takes a few minutes. Grab coffee.

Verify Installation

# Load tidyverse
library(tidyverse)
 
# You should see a message about loaded packages

If you see the tidyverse welcome message, you're ready.

3
VS Code Setup (Optional)
If you prefer VS Code over RStudio:
1. Install the "R" extension by REditorSupport
2. Open Command Palette and configure R path
3. Create .R files and run with Cmd+Enter (Mac) or Ctrl+Enter (Windows)

Pro Tip

RStudio vs VS Code: RStudio is purpose-built for R with better debugging and plot viewing. VS Code is better if you also code in Python/JavaScript. Many data scientists use both.

Part 2: R Fundamentals

R has some quirks that trip up beginners. Let's cover them upfront.

Variables and Assignment

R uses <- for assignment (not =):

1# The R way
2project_budget <- 50000
3team_size <- 8
4cost_per_person <- project_budget / team_size
5 
6# This also works, but is less idiomatic
7name = "Analysis Project"
8 
9# Print values
10print(project_budget)
11print(paste("Cost per person:", cost_per_person))

Note

Why <- instead of =? Historical reasons. R predates many modern languages. Use <- for assignment and = for function arguments. Claude Code can help convert code between styles.

Data Types

1# Numeric (integers and decimals)
2sample_size <- 150
3p_value <- 0.042
4 
5# Character (text/strings)
6study_name <- "Customer Retention Analysis"
7department <- "Marketing"
8 
9# Logical (TRUE/FALSE)
10is_significant <- TRUE
11needs_review <- FALSE
12 
13# Check types with class()
14class(sample_size)   # "numeric"
15class(study_name)    # "character"
16class(is_significant) # "logical"

Vectors: R's Building Block

Everything in R is a vector. Even single values are vectors of length 1.

1# Create vectors with c() - "combine"
2temperatures <- c(72, 75, 68, 71, 73, 76, 74)
3days <- c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
4sunny <- c(TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE)
5 
6# Vector operations are automatic
7fahrenheit <- c(72, 75, 68)
8celsius <- (fahrenheit - 32) * 5/9  # Applies to ALL elements
9 
10# Built-in statistics
11mean(temperatures)  # 72.71...
12max(temperatures)   # 76
13min(temperatures)   # 68
14sd(temperatures)    # Standard deviation

Warning

R uses 1-based indexing! The first element is [1], not [0]. This trips up Python users constantly.

1# Indexing
2days[1]      # "Mon" (not [0]!)
3days[6:7]    # "Sat", "Sun"
4days[-1]     # Everything EXCEPT first element
5 
6# Logical indexing
7temperatures[temperatures > 73]  # Values above 73

Data Frames: Your Main Tool

Data frames are tables—like Excel, but programmable.

1# Create a data frame
2weather <- data.frame(
3  day = days,
4  temp = temperatures,
5  sunny = sunny
6)
7 
8# View it
9print(weather)
10head(weather, 3)   # First 3 rows
11str(weather)       # Structure (types, dimensions)
12summary(weather)   # Statistics for each column
13 
14# Access columns
15weather$temp
16weather$day
17 
18# Filter rows
19weather[weather$sunny == TRUE, ]
20weather[weather$temp > 72, ]

Data Frame Mental Model

Part 3: The Tidyverse Way

Base R works, but tidyverse makes data analysis readable and fast.

The Tidyverse Ecosystem

Creating Sample Data

1library(tidyverse)
2 
3# Create a realistic sales dataset
4set.seed(42)  # For reproducibility
5 
6sales <- tibble(
7  date = seq(as.Date("2025-01-01"), as.Date("2025-01-31"), by = "day"),
8  product = sample(c("Laptop", "Phone", "Tablet"), 31, replace = TRUE),
9  quantity = sample(1:10, 31, replace = TRUE),
10  unit_price = case_when(
11    product == "Laptop" ~ 999,
12    product == "Phone" ~ 699,
13    product == "Tablet" ~ 399
14  ),
15  region = sample(c("West", "East", "North", "South"), 31, replace = TRUE)
16)
17 
18# tibble is tidyverse's improved data.frame
19glimpse(sales)

The Pipe: %>%

The pipe (%>%) chains operations together. Read it as "then."

1# Without pipe (nested, hard to read)
2arrange(filter(select(sales, date, product, quantity), product == "Laptop"), desc(quantity))
3 
4# With pipe (clear, step-by-step)
5sales %>%
6  select(date, product, quantity) %>%
7  filter(product == "Laptop") %>%
8  arrange(desc(quantity))

Pro Tip

Keyboard shortcut: Type %>% with Cmd+Shift+M (Mac) or Ctrl+Shift+M (Windows) in RStudio.

The Five Core dplyr Verbs

dplyr Verbs

1. filter() — Keep Rows

1# Single condition
2laptops <- sales %>%
3  filter(product == "Laptop")
4 
5# Multiple conditions (AND)
6big_laptop_sales <- sales %>%
7  filter(product == "Laptop", quantity > 5)
8 
9# OR conditions
10mobile_devices <- sales %>%
11  filter(product == "Phone" | product == "Tablet")
12 
13# Date filtering
14first_week <- sales %>%
15  filter(date <= as.Date("2025-01-07"))

2. select() — Keep Columns

1# Keep specific columns
2sales %>%
3  select(date, product, quantity)
4 
5# Drop columns
6sales %>%
7  select(-region)
8 
9# Select range
10sales %>%
11  select(product:unit_price)
12 
13# Rename while selecting
14sales %>%
15  select(sale_date = date, item = product)

3. mutate() — Create Columns

1# Add calculated column
2sales %>%
3  mutate(revenue = quantity * unit_price)
4 
5# Multiple columns at once
6sales %>%
7  mutate(
8    revenue = quantity * unit_price,
9    weekday = weekdays(date),
10    month = format(date, "%B")
11  )
12 
13# Conditional column
14sales %>%
15  mutate(
16    size = case_when(
17      quantity >= 8 ~ "Large",
18      quantity >= 4 ~ "Medium",
19      TRUE ~ "Small"
20    )
21  )

4. group_by() + summarize() — Aggregate

This is where R shines for analysis:

1# Revenue by product
2sales %>%
3  mutate(revenue = quantity * unit_price) %>%
4  group_by(product) %>%
5  summarize(
6    total_sales = n(),
7    total_units = sum(quantity),
8    total_revenue = sum(revenue),
9    avg_order_size = mean(quantity)
10  )
11 
12# Multiple grouping variables
13sales %>%
14  mutate(revenue = quantity * unit_price) %>%
15  group_by(region, product) %>%
16  summarize(
17    total_revenue = sum(revenue),
18    .groups = "drop"  # Ungroup after
19  )

5. arrange() — Sort Rows

1# Sort ascending
2sales %>%
3  arrange(quantity)
4 
5# Sort descending
6sales %>%
7  arrange(desc(quantity))
8 
9# Multiple sort keys
10sales %>%
11  arrange(product, desc(quantity))

Putting It All Together

1# Complete analysis pipeline
2sales_analysis <- sales %>%
3  # Calculate revenue
4  mutate(revenue = quantity * unit_price) %>%
5  # Focus on laptops
6  filter(product == "Laptop") %>%
7  # Group by region
8  group_by(region) %>%
9  # Summary statistics
10  summarize(
11    orders = n(),
12    units = sum(quantity),
13    revenue = sum(revenue),
14    avg_order = mean(quantity)
15  ) %>%
16  # Sort by revenue
17  arrange(desc(revenue))
18 
19print(sales_analysis)

Part 4: Visualization with ggplot2

ggplot2 creates publication-quality graphics. The syntax follows "grammar of graphics."

ggplot2 Structure

The Basic Pattern

1# Every ggplot follows this pattern:
2ggplot(data, aes(x = variable1, y = variable2)) +
3  geom_something() +
4  labs(title = "Title", x = "X Label", y = "Y Label") +
5  theme_minimal()

Common Plot Types

Scatter Plot

1# Basic scatter
2ggplot(sales, aes(x = date, y = quantity)) +
3  geom_point() +
4  labs(title = "Daily Sales Quantity")
5 
6# With color by product
7ggplot(sales, aes(x = date, y = quantity, color = product)) +
8  geom_point(size = 3) +
9  labs(title = "Sales by Product") +
10  theme_minimal()

Bar Chart

1# Revenue by product
2sales %>%
3  mutate(revenue = quantity * unit_price) %>%
4  group_by(product) %>%
5  summarize(total = sum(revenue)) %>%
6  ggplot(aes(x = product, y = total, fill = product)) +
7  geom_col() +
8  labs(
9    title = "Total Revenue by Product",
10    y = "Revenue ($)"
11  ) +
12  theme_minimal() +
13  scale_fill_brewer(palette = "Set2")

Line Chart

1# Daily revenue trend
2sales %>%
3  mutate(revenue = quantity * unit_price) %>%
4  group_by(date) %>%
5  summarize(daily_revenue = sum(revenue)) %>%
6  ggplot(aes(x = date, y = daily_revenue)) +
7  geom_line(color = "steelblue", linewidth = 1) +
8  geom_point(color = "darkblue", size = 2) +
9  labs(title = "Daily Revenue Trend") +
10  theme_minimal()

Box Plot

1# Distribution by product
2ggplot(sales, aes(x = product, y = quantity, fill = product)) +
3  geom_boxplot() +
4  labs(title = "Order Size Distribution") +
5  theme_minimal()

Histogram

1# Distribution of order sizes
2ggplot(sales, aes(x = quantity)) +
3  geom_histogram(bins = 10, fill = "steelblue", color = "white") +
4  labs(title = "Order Size Distribution") +
5  theme_minimal()

Faceting: Multiple Plots

1# Separate plot for each product
2sales %>%
3  mutate(revenue = quantity * unit_price) %>%
4  ggplot(aes(x = date, y = revenue)) +
5  geom_line() +
6  facet_wrap(~ product) +
7  labs(title = "Revenue by Product") +
8  theme_minimal()

Saving Plots

1# Save the last plot
2ggsave("revenue_by_product.png", width = 10, height = 6, dpi = 300)
3 
4# Save a specific plot
5my_plot <- ggplot(sales, aes(x = date, y = quantity)) + geom_point()
6ggsave("my_plot.png", plot = my_plot)

Part 5: Reading and Writing Data

CSV Files

1library(readr)
2 
3# Write
4write_csv(sales, "january_sales.csv")
5 
6# Read
7imported_sales <- read_csv("january_sales.csv")

Excel Files

1library(readxl)
2library(writexl)
3 
4# Write
5write_xlsx(sales, "january_sales.xlsx")
6 
7# Read
8excel_data <- read_excel("january_sales.xlsx")
9 
10# Read specific sheet
11read_excel("workbook.xlsx", sheet = "Sales")

Multiple Files

1# Read all CSVs in a folder
2library(purrr)
3 
4all_files <- list.files("data/", pattern = "*.csv", full.names = TRUE)
5combined_data <- map_dfr(all_files, read_csv)

Part 6: Complete Analysis Pipeline

Let's put everything together:

1# Full Analysis Script
2# -------------------
3 
4library(tidyverse)
5 
6# 1. Load Data
7sales <- read_csv("data/sales.csv")
8 
9# 2. Clean & Transform
10sales_clean <- sales %>%
11  # Handle missing values
12  drop_na() %>%
13  # Create calculated columns
14  mutate(
15    revenue = quantity * unit_price,
16    weekday = weekdays(date),
17    is_weekend = weekday %in% c("Saturday", "Sunday")
18  ) %>%
19  # Remove obvious errors
20  filter(quantity > 0, unit_price > 0)
21 
22# 3. Analysis
23summary_stats <- sales_clean %>%
24  group_by(product) %>%
25  summarize(
26    total_revenue = sum(revenue),
27    avg_order_size = mean(quantity),
28    order_count = n(),
29    .groups = "drop"
30  ) %>%
31  arrange(desc(total_revenue))
32 
33print(summary_stats)
34 
35# 4. Visualization
36revenue_plot <- summary_stats %>%
37  ggplot(aes(x = reorder(product, total_revenue), y = total_revenue, fill = product)) +
38  geom_col() +
39  coord_flip() +
40  labs(
41    title = "Revenue by Product",
42    x = NULL,
43    y = "Total Revenue ($)"
44  ) +
45  theme_minimal() +
46  theme(legend.position = "none")
47 
48ggsave("outputs/revenue_by_product.png", revenue_plot, width = 8, height = 5, dpi = 300)
49 
50# 5. Report
51cat("\n=== KEY FINDINGS ===\n")
52cat(paste("Total Revenue:", scales::dollar(sum(sales_clean$revenue)), "\n"))
53cat(paste("Top Product:", summary_stats$product[1], "\n"))
54cat(paste("Total Orders:", nrow(sales_clean), "\n"))

Troubleshooting

Next Steps

You've got the R fundamentals. Here's where to go deeper:

Want to...	Learn
Automate reports	RMarkdown, Quarto
Build dashboards	Shiny
Machine learning	tidymodels
Bigger data	data.table, arrow

Resources

R for Data Science — Free online book
tidyverse documentation — Official reference
ggplot2 gallery — Visual examples
RStudio cheatsheets — Quick reference PDFs

Success

You're ready! Open RStudio, load some data, and start exploring. When you get stuck, Claude can help debug R code just like any other language.