Python for Data Analysis
Get started with Python, pandas, and data analysis workflows
Python for Data Analysis
You're about to learn the same tools used by data scientists at Netflix, Airbnb, and Google. By the end, you'll analyze real data, create visualizations, and actually understand what your code is doing.
Prerequisites
Before starting, make sure you have:
- Python installed (Mac or Windows)
- VS Code with Claude Code extension
- Basic terminal knowledge (cd, ls, mkdir)
Part 1: Project Setup
Every analysis starts with a clean, organized project. Let's build one.
- 1
Create Project Folder
Bashmkdir sales-analysiscd sales-analysis - 2
Initialize Git
Bashgit initTrack your work from the start. You'll thank yourself later.
- 3
Create Virtual Environment
Virtual environments isolate your project's packages. No more "it works on my machine" problems.
You'll see
(venv)in your prompt—that means it's active. - 4
Install the Data Science Stack
Bashpip install pandas numpy matplotlib jupyter seabornPackage What It Does pandas Data manipulation (your main tool) numpy Numerical operations matplotlib Basic plotting seaborn Beautiful statistical plots jupyter Interactive notebooks - 5
Create Project Structure
Bashmkdir data notebooks scripts outputstouch README.md CLAUDE.md .gitignore- README.md
- CLAUDE.md
- .gitignore
- 6
Set Up .gitignore
Don't commit things that shouldn't be committed:
Bashvenv/__pycache__/*.pyc.ipynb_checkpoints/data/*.csv!data/sample.csvoutputs/.env
Pro Tip
Ask Claude to generate your CLAUDE.md:
Create a CLAUDE.md for a sales data analysis project using Python and pandas.Include common commands and example prompts.Part 2: Loading Data
Let's load some data and see what we're working with.
The Core Workflow
Your First Script
Create scripts/load_data.py:
import pandas as pd # Load the datadf = pd.read_csv("data/sales.csv") # What do we have?print(f"Rows: {len(df):,}")print(f"Columns: {list(df.columns)}")print() # First lookprint(df.head())print() # Data types and missing valuesprint(df.info())Run It
python scripts/load_data.pyPro Tip
No data yet? Ask Claude:
Generate sample sales data with columns: date, product, category, quantity, price, region.Save it as a CSV I can use for practice.Part 3: Cleaning Data
Real data is messy. Here's how to fix common problems.
Common Issues (And How to Fix Them)
| Problem | Solution |
|---|---|
| Missing values | df.dropna() or df.fillna(value) |
| Wrong data types | pd.to_datetime(df['date']) |
| Duplicates | df.drop_duplicates() |
| Inconsistent text | df['name'].str.lower().str.strip() |
| Outliers | Filter or cap values |
Ask Claude to Clean Your Data
I have sales data with these issues:- Date column is a string like "Jan 15, 2024"- Some revenue values are missing- Product names have inconsistent capitalization- There are duplicate rows Write a cleaning function that fixes all of these.Example Cleaning Script
import pandas as pd def clean_sales_data(df): """Clean the raw sales data.""" # Make a copy (don't modify original) df = df.copy() # Fix dates df['date'] = pd.to_datetime(df['date']) # Handle missing revenue (fill with median) df['revenue'] = df['revenue'].fillna(df['revenue'].median()) # Standardize product names df['product'] = df['product'].str.lower().str.strip() # Remove duplicates df = df.drop_duplicates() # Remove obvious errors (negative quantities) df = df[df['quantity'] > 0] return df # Usagedf = pd.read_csv("data/sales.csv")df_clean = clean_sales_data(df)print(f"Rows before: {len(df)}, after: {len(df_clean)}")Part 4: Exploring Data
Before you analyze, you need to understand. Here's the exploration toolkit.
Quick Summary
# Overviewdf.describe() # Statistics for numeric columnsdf.info() # Data types, missing valuesdf.shape # (rows, columns) # Specific columnsdf['category'].value_counts() # Count by categorydf['revenue'].mean() # Average revenuedf['date'].min(), df['date'].max() # Date rangeGrouping and Aggregation
This is where pandas shines:
# Total revenue by productdf.groupby('product')['revenue'].sum() # Multiple aggregationsdf.groupby('category').agg({ 'revenue': 'sum', 'quantity': 'mean', 'order_id': 'count'}) # Monthly totalsdf.groupby(df['date'].dt.month)['revenue'].sum()Ask Claude for Exploration Help
I have sales data with: date, product, category, quantity, price, region. What are the most important things to explore first?Write the pandas code for each.Part 5: Visualizations
A good chart is worth a thousand .head() calls.
Basic Plots
import matplotlib.pyplot as pltimport seaborn as sns # Set styleplt.style.use('seaborn-v0_8-whitegrid') # Line chart (trends)df.groupby('date')['revenue'].sum().plot(kind='line')plt.title('Daily Revenue')plt.savefig('outputs/daily_revenue.png')plt.show() # Bar chart (comparisons)df.groupby('category')['revenue'].sum().plot(kind='bar')plt.title('Revenue by Category')plt.savefig('outputs/category_revenue.png')plt.show() # Histogram (distribution)df['revenue'].hist(bins=30)plt.title('Revenue Distribution')plt.savefig('outputs/revenue_dist.png')plt.show()Ask Claude for Better Charts
Create a visualization showing monthly revenue trends. Make it:- Easy to read (larger fonts)- Professional looking (clean style)- Saved as PNG at 300 DPI- Include a trend linePart 6: Putting It Together
Let's build a complete analysis workflow.
The Full Script
"""Sales Analysis PipelineRun with: python scripts/analyze.py"""import pandas as pdimport matplotlib.pyplot as plt # 1. Loadprint("Loading data...")df = pd.read_csv("data/sales.csv") # 2. Cleanprint("Cleaning data...")df['date'] = pd.to_datetime(df['date'])df = df.dropna()df = df[df['quantity'] > 0] # 3. Analyzeprint("Analyzing...")monthly = df.groupby(df['date'].dt.to_period('M')).agg({ 'revenue': 'sum', 'quantity': 'sum', 'order_id': 'count'}).rename(columns={'order_id': 'num_orders'}) # 4. Visualizeprint("Creating charts...")fig, ax = plt.subplots(figsize=(10, 6))monthly['revenue'].plot(ax=ax)ax.set_title('Monthly Revenue', fontsize=14)ax.set_xlabel('Month')ax.set_ylabel('Revenue ($)')plt.tight_layout()plt.savefig('outputs/monthly_revenue.png', dpi=300) # 5. Reportprint("\n=== KEY FINDINGS ===")print(f"Total Revenue: ${monthly['revenue'].sum():,.2f}")print(f"Best Month: {monthly['revenue'].idxmax()}")print(f"Average Monthly Revenue: ${monthly['revenue'].mean():,.2f}")print(f"\nChart saved to outputs/monthly_revenue.png")Commit Your Work
git add scripts/analyze.py outputs/git commit -m "feat: add complete sales analysis pipeline"Part 7: Jupyter Notebooks
For interactive exploration, Jupyter notebooks are your friend.
Start Jupyter
jupyter notebookThis opens a browser. Create a new Python 3 notebook.
Notebook Best Practices
| Do | Don't |
|---|---|
| Use markdown headers | Run cells out of order |
| Keep cells small and focused | Put all code in one cell |
| Restart and run all before sharing | Leave broken cells |
| Include explanatory text | Assume code is self-explanatory |
Pro Tip
Ask Claude to structure your notebook:
Create a Jupyter notebook structure for analyzing customer purchase patterns.Include markdown sections and placeholder code cells.Common Patterns
Keep these in your back pocket:
Filter Rows
# Single conditionhigh_revenue = df[df['revenue'] > 1000] # Multiple conditionsq4_big_sales = df[(df['date'].dt.quarter == 4) & (df['revenue'] > 500)]Create New Columns
# From calculationdf['profit'] = df['revenue'] - df['cost']df['profit_margin'] = df['profit'] / df['revenue'] # From datedf['month'] = df['date'].dt.monthdf['day_of_week'] = df['date'].dt.day_name() # From categoriesdf['size'] = pd.cut(df['revenue'], bins=[0, 100, 500, float('inf')], labels=['small', 'medium', 'large'])Join DataFrames
# Merge on common columndf_full = pd.merge(orders, customers, on='customer_id') # Concatenate verticallyall_months = pd.concat([jan_df, feb_df, mar_df])Troubleshooting
Next Steps
You've got the fundamentals. Here's where to go deeper:
| Want to... | Learn |
|---|---|
| Build dashboards | Streamlit, Dash |
| More statistics | scipy, statsmodels |
| Machine learning | scikit-learn |
| Bigger data | Dask, PySpark |
| Automate reports | Automation Track |
Resources
- pandas documentation — Official reference
- Python Data Science Handbook — Free online book
- Kaggle — Practice datasets and competitions
- Real Python — Quality tutorials
Success
You're ready! Grab a dataset, open VS Code, and start exploring. When you get stuck, Claude is there to help.