From raw CSV to actionable insights — learn data cleaning, exploration, and visualization with Pandas and Matplotlib.
Start with pd.read_csv() to load your dataset. Use df.head(), df.info(), and df.describe() to understand shape, types, and statistical summary.
import pandas as pd
df = pd.read_csv('data.csv')
print(df.shape) # rows × columns
print(df.info()) # dtypes + nulls
print(df.describe()) # statistical summaryUse df.isnull().sum() to find gaps. Drop rows with dropna() or fill them with fillna(). Choose based on how much data you can afford to lose.
Dropping rows when missing values exceed 20% of a column will heavily bias your analysis. Consider imputation instead.
groupby() is the most powerful tool in Pandas. Combine it with agg() to compute multiple statistics per group in a single pass.
result = df.groupby('category').agg(
total_sales=('revenue', 'sum'),
avg_order=('revenue', 'mean'),
order_count=('order_id', 'count')
).reset_index()Use Matplotlib for quick plots and Seaborn for statistical visualisations. Always label axes and titles — a chart without context is useless in a report.
For datasets over 1M rows, consider switching from Pandas to Polars — it runs on Rust and is 5–10x faster for most operations.
Data scientist and Python educator.
Join 15,000+ Indian developers and creators receiving our curated newsletter every Sunday morning.
No spam. Only high-quality content. Unsubscribe anytime.