Skip to main content

Python for Data Analysis

Get started with Python, pandas, and data analysis workflows

120 minutes
3 min read
Updated January 15, 2026

Python for Data Analysis

You're about to learn the same tools used by data scientists at Netflix, Airbnb, and Google. By the end, you'll analyze real data, create visualizations, and actually understand what your code is doing.

What You'll Build
A complete analysis pipeline

Prerequisites

Before starting, make sure you have:

  • Python installed (Mac or Windows)
  • VS Code with Claude Code extension
  • Basic terminal knowledge (cd, ls, mkdir)

Part 1: Project Setup

Every analysis starts with a clean, organized project. Let's build one.

  1. 1

    Create Project Folder

    Bash
    mkdir sales-analysis
    cd sales-analysis
  2. 2

    Initialize Git

    Bash
    git init

    Track your work from the start. You'll thank yourself later.

  3. 3

    Create Virtual Environment

    Virtual environments isolate your project's packages. No more "it works on my machine" problems.

    You'll see (venv) in your prompt—that means it's active.

  4. 4

    Install the Data Science Stack

    Bash
    pip install pandas numpy matplotlib jupyter seaborn
    PackageWhat It Does
    pandasData manipulation (your main tool)
    numpyNumerical operations
    matplotlibBasic plotting
    seabornBeautiful statistical plots
    jupyterInteractive notebooks
  5. 5

    Create Project Structure

    Bash
    mkdir data notebooks scripts outputs
    touch README.md CLAUDE.md .gitignore
      • README.md
      • CLAUDE.md
      • .gitignore
  6. 6

    Set Up .gitignore

    Don't commit things that shouldn't be committed:

    Bash
    venv/
    __pycache__/
    *.pyc
    .ipynb_checkpoints/
    data/*.csv
    !data/sample.csv
    outputs/
    .env

Part 2: Loading Data

Let's load some data and see what we're working with.

The Core Workflow

Data Loading Flow

Your First Script

Create scripts/load_data.py:

Python
import pandas as pd
# Load the data
df = pd.read_csv("data/sales.csv")
# What do we have?
print(f"Rows: {len(df):,}")
print(f"Columns: {list(df.columns)}")
print()
# First look
print(df.head())
print()
# Data types and missing values
print(df.info())

Run It

Bash
python scripts/load_data.py

Part 3: Cleaning Data

Real data is messy. Here's how to fix common problems.

Common Issues (And How to Fix Them)

ProblemSolution
Missing valuesdf.dropna() or df.fillna(value)
Wrong data typespd.to_datetime(df['date'])
Duplicatesdf.drop_duplicates()
Inconsistent textdf['name'].str.lower().str.strip()
OutliersFilter or cap values

Ask Claude to Clean Your Data

Bash
I have sales data with these issues:
- Date column is a string like "Jan 15, 2024"
- Some revenue values are missing
- Product names have inconsistent capitalization
- There are duplicate rows
Write a cleaning function that fixes all of these.

Example Cleaning Script

Python
import pandas as pd
def clean_sales_data(df):
"""Clean the raw sales data."""
# Make a copy (don't modify original)
df = df.copy()
# Fix dates
df['date'] = pd.to_datetime(df['date'])
# Handle missing revenue (fill with median)
df['revenue'] = df['revenue'].fillna(df['revenue'].median())
# Standardize product names
df['product'] = df['product'].str.lower().str.strip()
# Remove duplicates
df = df.drop_duplicates()
# Remove obvious errors (negative quantities)
df = df[df['quantity'] > 0]
return df
# Usage
df = pd.read_csv("data/sales.csv")
df_clean = clean_sales_data(df)
print(f"Rows before: {len(df)}, after: {len(df_clean)}")

Part 4: Exploring Data

Before you analyze, you need to understand. Here's the exploration toolkit.

Quick Summary

Python
# Overview
df.describe() # Statistics for numeric columns
df.info() # Data types, missing values
df.shape # (rows, columns)
# Specific columns
df['category'].value_counts() # Count by category
df['revenue'].mean() # Average revenue
df['date'].min(), df['date'].max() # Date range

Grouping and Aggregation

This is where pandas shines:

Python
# Total revenue by product
df.groupby('product')['revenue'].sum()
# Multiple aggregations
df.groupby('category').agg({
'revenue': 'sum',
'quantity': 'mean',
'order_id': 'count'
})
# Monthly totals
df.groupby(df['date'].dt.month)['revenue'].sum()

Ask Claude for Exploration Help

Bash
I have sales data with: date, product, category, quantity, price, region.
What are the most important things to explore first?
Write the pandas code for each.

Part 5: Visualizations

A good chart is worth a thousand .head() calls.

Choosing the Right Chart

Basic Plots

Python
import matplotlib.pyplot as plt
import seaborn as sns
# Set style
plt.style.use('seaborn-v0_8-whitegrid')
# Line chart (trends)
df.groupby('date')['revenue'].sum().plot(kind='line')
plt.title('Daily Revenue')
plt.savefig('outputs/daily_revenue.png')
plt.show()
# Bar chart (comparisons)
df.groupby('category')['revenue'].sum().plot(kind='bar')
plt.title('Revenue by Category')
plt.savefig('outputs/category_revenue.png')
plt.show()
# Histogram (distribution)
df['revenue'].hist(bins=30)
plt.title('Revenue Distribution')
plt.savefig('outputs/revenue_dist.png')
plt.show()

Ask Claude for Better Charts

Bash
Create a visualization showing monthly revenue trends.
Make it:
- Easy to read (larger fonts)
- Professional looking (clean style)
- Saved as PNG at 300 DPI
- Include a trend line

Part 6: Putting It Together

Let's build a complete analysis workflow.

The Full Script

Python
"""
Sales Analysis Pipeline
Run with: python scripts/analyze.py
"""
import pandas as pd
import matplotlib.pyplot as plt
# 1. Load
print("Loading data...")
df = pd.read_csv("data/sales.csv")
# 2. Clean
print("Cleaning data...")
df['date'] = pd.to_datetime(df['date'])
df = df.dropna()
df = df[df['quantity'] > 0]
# 3. Analyze
print("Analyzing...")
monthly = df.groupby(df['date'].dt.to_period('M')).agg({
'revenue': 'sum',
'quantity': 'sum',
'order_id': 'count'
}).rename(columns={'order_id': 'num_orders'})
# 4. Visualize
print("Creating charts...")
fig, ax = plt.subplots(figsize=(10, 6))
monthly['revenue'].plot(ax=ax)
ax.set_title('Monthly Revenue', fontsize=14)
ax.set_xlabel('Month')
ax.set_ylabel('Revenue ($)')
plt.tight_layout()
plt.savefig('outputs/monthly_revenue.png', dpi=300)
# 5. Report
print("\n=== KEY FINDINGS ===")
print(f"Total Revenue: ${monthly['revenue'].sum():,.2f}")
print(f"Best Month: {monthly['revenue'].idxmax()}")
print(f"Average Monthly Revenue: ${monthly['revenue'].mean():,.2f}")
print(f"\nChart saved to outputs/monthly_revenue.png")

Commit Your Work

Bash
git add scripts/analyze.py outputs/
git commit -m "feat: add complete sales analysis pipeline"

Part 7: Jupyter Notebooks

For interactive exploration, Jupyter notebooks are your friend.

Start Jupyter

Bash
jupyter notebook

This opens a browser. Create a new Python 3 notebook.

Notebook Best Practices

DoDon't
Use markdown headersRun cells out of order
Keep cells small and focusedPut all code in one cell
Restart and run all before sharingLeave broken cells
Include explanatory textAssume code is self-explanatory

Common Patterns

Keep these in your back pocket:

Filter Rows

Python
# Single condition
high_revenue = df[df['revenue'] > 1000]
# Multiple conditions
q4_big_sales = df[(df['date'].dt.quarter == 4) & (df['revenue'] > 500)]

Create New Columns

Python
# From calculation
df['profit'] = df['revenue'] - df['cost']
df['profit_margin'] = df['profit'] / df['revenue']
# From date
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.day_name()
# From categories
df['size'] = pd.cut(df['revenue'], bins=[0, 100, 500, float('inf')],
labels=['small', 'medium', 'large'])

Join DataFrames

Python
# Merge on common column
df_full = pd.merge(orders, customers, on='customer_id')
# Concatenate vertically
all_months = pd.concat([jan_df, feb_df, mar_df])

Troubleshooting


Next Steps

You've got the fundamentals. Here's where to go deeper:

Want to...Learn
Build dashboardsStreamlit, Dash
More statisticsscipy, statsmodels
Machine learningscikit-learn
Bigger dataDask, PySpark
Automate reportsAutomation Track

Resources


Share this article