R for Data Analysis
Get started with R, tidyverse, and statistical analysis workflows
R for Data Analysis
R was built by statisticians, for statisticians. It's the language behind academic research, clinical trials, and peer-reviewed journals worldwide. By the end of this guide, you'll be analyzing real data with the same tools used in research labs at Stanford and the CDC.
Prerequisites
Before starting, make sure you have:
Part 1: Setting Up R
R has a steeper setup than Python, but the payoff is worth it. Let's get everything running.
- 1
Install Core Packages
Open RStudio and run this in the console:
R# The tidyverse is your toolkit - includes dplyr, ggplot2, and moreinstall.packages("tidyverse")# Additional helpful packagesinstall.packages(c("readxl", "writexl", "lubridate"))This takes a few minutes. Grab coffee.
- 2
Verify Installation
R# Load tidyverselibrary(tidyverse)# You should see a message about loaded packagesIf you see the tidyverse welcome message, you're ready.
- 3
VS Code Setup (Optional)
If you prefer VS Code over RStudio:
- Install the "R" extension by REditorSupport
- Open Command Palette and configure R path
- Create
.Rfiles and run withCmd+Enter(Mac) orCtrl+Enter(Windows)
Pro Tip
RStudio vs VS Code: RStudio is purpose-built for R with better debugging and plot viewing. VS Code is better if you also code in Python/JavaScript. Many data scientists use both.
Part 2: R Fundamentals
R has some quirks that trip up beginners. Let's cover them upfront.
Variables and Assignment
R uses <- for assignment (not =):
# The R wayproject_budget <- 50000team_size <- 8cost_per_person <- project_budget / team_size # This also works, but is less idiomaticname = "Analysis Project" # Print valuesprint(project_budget)print(paste("Cost per person:", cost_per_person))Note
Why <- instead of =? Historical reasons. R predates many modern languages. Use <- for assignment and = for function arguments. Claude Code can help convert code between styles.
Data Types
# Numeric (integers and decimals)sample_size <- 150p_value <- 0.042 # Character (text/strings)study_name <- "Customer Retention Analysis"department <- "Marketing" # Logical (TRUE/FALSE)is_significant <- TRUEneeds_review <- FALSE # Check types with class()class(sample_size) # "numeric"class(study_name) # "character"class(is_significant) # "logical"Vectors: R's Building Block
Everything in R is a vector. Even single values are vectors of length 1.
# Create vectors with c() - "combine"temperatures <- c(72, 75, 68, 71, 73, 76, 74)days <- c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")sunny <- c(TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE) # Vector operations are automaticfahrenheit <- c(72, 75, 68)celsius <- (fahrenheit - 32) * 5/9 # Applies to ALL elements # Built-in statisticsmean(temperatures) # 72.71...max(temperatures) # 76min(temperatures) # 68sd(temperatures) # Standard deviationWarning
R uses 1-based indexing! The first element is [1], not [0]. This trips up Python users constantly.
# Indexingdays[1] # "Mon" (not [0]!)days[6:7] # "Sat", "Sun"days[-1] # Everything EXCEPT first element # Logical indexingtemperatures[temperatures > 73] # Values above 73Data Frames: Your Main Tool
Data frames are tables—like Excel, but programmable.
# Create a data frameweather <- data.frame( day = days, temp = temperatures, sunny = sunny) # View itprint(weather)head(weather, 3) # First 3 rowsstr(weather) # Structure (types, dimensions)summary(weather) # Statistics for each column # Access columnsweather$tempweather$day # Filter rowsweather[weather$sunny == TRUE, ]weather[weather$temp > 72, ]Part 3: The Tidyverse Way
Base R works, but tidyverse makes data analysis readable and fast.
Creating Sample Data
library(tidyverse) # Create a realistic sales datasetset.seed(42) # For reproducibility sales <- tibble( date = seq(as.Date("2025-01-01"), as.Date("2025-01-31"), by = "day"), product = sample(c("Laptop", "Phone", "Tablet"), 31, replace = TRUE), quantity = sample(1:10, 31, replace = TRUE), unit_price = case_when( product == "Laptop" ~ 999, product == "Phone" ~ 699, product == "Tablet" ~ 399 ), region = sample(c("West", "East", "North", "South"), 31, replace = TRUE)) # tibble is tidyverse's improved data.frameglimpse(sales)The Pipe: %>%
The pipe (%>%) chains operations together. Read it as "then."
# Without pipe (nested, hard to read)arrange(filter(select(sales, date, product, quantity), product == "Laptop"), desc(quantity)) # With pipe (clear, step-by-step)sales %>% select(date, product, quantity) %>% filter(product == "Laptop") %>% arrange(desc(quantity))Pro Tip
Keyboard shortcut: Type %>% with Cmd+Shift+M (Mac) or Ctrl+Shift+M (Windows) in RStudio.
The Five Core dplyr Verbs
1. filter() — Keep Rows
# Single conditionlaptops <- sales %>% filter(product == "Laptop") # Multiple conditions (AND)big_laptop_sales <- sales %>% filter(product == "Laptop", quantity > 5) # OR conditionsmobile_devices <- sales %>% filter(product == "Phone" | product == "Tablet") # Date filteringfirst_week <- sales %>% filter(date <= as.Date("2025-01-07"))2. select() — Keep Columns
# Keep specific columnssales %>% select(date, product, quantity) # Drop columnssales %>% select(-region) # Select rangesales %>% select(product:unit_price) # Rename while selectingsales %>% select(sale_date = date, item = product)3. mutate() — Create Columns
# Add calculated columnsales %>% mutate(revenue = quantity * unit_price) # Multiple columns at oncesales %>% mutate( revenue = quantity * unit_price, weekday = weekdays(date), month = format(date, "%B") ) # Conditional columnsales %>% mutate( size = case_when( quantity >= 8 ~ "Large", quantity >= 4 ~ "Medium", TRUE ~ "Small" ) )4. group_by() + summarize() — Aggregate
This is where R shines for analysis:
# Revenue by productsales %>% mutate(revenue = quantity * unit_price) %>% group_by(product) %>% summarize( total_sales = n(), total_units = sum(quantity), total_revenue = sum(revenue), avg_order_size = mean(quantity) ) # Multiple grouping variablessales %>% mutate(revenue = quantity * unit_price) %>% group_by(region, product) %>% summarize( total_revenue = sum(revenue), .groups = "drop" # Ungroup after )5. arrange() — Sort Rows
# Sort ascendingsales %>% arrange(quantity) # Sort descendingsales %>% arrange(desc(quantity)) # Multiple sort keyssales %>% arrange(product, desc(quantity))Putting It All Together
# Complete analysis pipelinesales_analysis <- sales %>% # Calculate revenue mutate(revenue = quantity * unit_price) %>% # Focus on laptops filter(product == "Laptop") %>% # Group by region group_by(region) %>% # Summary statistics summarize( orders = n(), units = sum(quantity), revenue = sum(revenue), avg_order = mean(quantity) ) %>% # Sort by revenue arrange(desc(revenue)) print(sales_analysis)Part 4: Visualization with ggplot2
ggplot2 creates publication-quality graphics. The syntax follows "grammar of graphics."
The Basic Pattern
# Every ggplot follows this pattern:ggplot(data, aes(x = variable1, y = variable2)) + geom_something() + labs(title = "Title", x = "X Label", y = "Y Label") + theme_minimal()Common Plot Types
Scatter Plot
# Basic scatterggplot(sales, aes(x = date, y = quantity)) + geom_point() + labs(title = "Daily Sales Quantity") # With color by productggplot(sales, aes(x = date, y = quantity, color = product)) + geom_point(size = 3) + labs(title = "Sales by Product") + theme_minimal()Bar Chart
# Revenue by productsales %>% mutate(revenue = quantity * unit_price) %>% group_by(product) %>% summarize(total = sum(revenue)) %>% ggplot(aes(x = product, y = total, fill = product)) + geom_col() + labs( title = "Total Revenue by Product", y = "Revenue ($)" ) + theme_minimal() + scale_fill_brewer(palette = "Set2")Line Chart
# Daily revenue trendsales %>% mutate(revenue = quantity * unit_price) %>% group_by(date) %>% summarize(daily_revenue = sum(revenue)) %>% ggplot(aes(x = date, y = daily_revenue)) + geom_line(color = "steelblue", linewidth = 1) + geom_point(color = "darkblue", size = 2) + labs(title = "Daily Revenue Trend") + theme_minimal()Box Plot
# Distribution by productggplot(sales, aes(x = product, y = quantity, fill = product)) + geom_boxplot() + labs(title = "Order Size Distribution") + theme_minimal()Histogram
# Distribution of order sizesggplot(sales, aes(x = quantity)) + geom_histogram(bins = 10, fill = "steelblue", color = "white") + labs(title = "Order Size Distribution") + theme_minimal()Faceting: Multiple Plots
# Separate plot for each productsales %>% mutate(revenue = quantity * unit_price) %>% ggplot(aes(x = date, y = revenue)) + geom_line() + facet_wrap(~ product) + labs(title = "Revenue by Product") + theme_minimal()Saving Plots
# Save the last plotggsave("revenue_by_product.png", width = 10, height = 6, dpi = 300) # Save a specific plotmy_plot <- ggplot(sales, aes(x = date, y = quantity)) + geom_point()ggsave("my_plot.png", plot = my_plot)Part 5: Reading and Writing Data
CSV Files
library(readr) # Writewrite_csv(sales, "january_sales.csv") # Readimported_sales <- read_csv("january_sales.csv")Excel Files
library(readxl)library(writexl) # Writewrite_xlsx(sales, "january_sales.xlsx") # Readexcel_data <- read_excel("january_sales.xlsx") # Read specific sheetread_excel("workbook.xlsx", sheet = "Sales")Multiple Files
# Read all CSVs in a folderlibrary(purrr) all_files <- list.files("data/", pattern = "*.csv", full.names = TRUE)combined_data <- map_dfr(all_files, read_csv)Part 6: Complete Analysis Pipeline
Let's put everything together:
# Full Analysis Script# ------------------- library(tidyverse) # 1. Load Datasales <- read_csv("data/sales.csv") # 2. Clean & Transformsales_clean <- sales %>% # Handle missing values drop_na() %>% # Create calculated columns mutate( revenue = quantity * unit_price, weekday = weekdays(date), is_weekend = weekday %in% c("Saturday", "Sunday") ) %>% # Remove obvious errors filter(quantity > 0, unit_price > 0) # 3. Analysissummary_stats <- sales_clean %>% group_by(product) %>% summarize( total_revenue = sum(revenue), avg_order_size = mean(quantity), order_count = n(), .groups = "drop" ) %>% arrange(desc(total_revenue)) print(summary_stats) # 4. Visualizationrevenue_plot <- summary_stats %>% ggplot(aes(x = reorder(product, total_revenue), y = total_revenue, fill = product)) + geom_col() + coord_flip() + labs( title = "Revenue by Product", x = NULL, y = "Total Revenue ($)" ) + theme_minimal() + theme(legend.position = "none") ggsave("outputs/revenue_by_product.png", revenue_plot, width = 8, height = 5, dpi = 300) # 5. Reportcat("\n=== KEY FINDINGS ===\n")cat(paste("Total Revenue:", scales::dollar(sum(sales_clean$revenue)), "\n"))cat(paste("Top Product:", summary_stats$product[1], "\n"))cat(paste("Total Orders:", nrow(sales_clean), "\n"))Troubleshooting
Next Steps
You've got the R fundamentals. Here's where to go deeper:
| Want to... | Learn |
|---|---|
| Automate reports | RMarkdown, Quarto |
| Build dashboards | Shiny |
| Machine learning | tidymodels |
| Bigger data | data.table, arrow |
Resources
- R for Data Science — Free online book
- tidyverse documentation — Official reference
- ggplot2 gallery — Visual examples
- RStudio cheatsheets — Quick reference PDFs
Success
You're ready! Open RStudio, load some data, and start exploring. When you get stuck, Claude can help debug R code just like any other language.