Python Data Cleaning Cookbook

Data cleaning is the unsung hero of data science. While machine learning models and visualization dashboards receive the most attention, the reality is that 80% of a data scientist’s time is spent on cleaning and preparing data. This isn’t just busywork—it’s critical infrastructure.

Dirty data leads to flawed insights, inaccurate predictions, and costly business decisions. A single missing value in the wrong place can skew your entire analysis. Duplicate records can inflate your metrics. Outliers can significantly impact your machine learning models.

In this comprehensive Python data cleaning cookbook, you’ll learn practical techniques to detect and remove dirty data using pandas, machine learning algorithms, and even ChatGPT. Whether you’re a data analyst, data scientist, or Python developer, these battle-tested methods will help you transform messy datasets into clean, analysis-ready data.

Understanding Common Data Quality Issues

Before diving into solutions, let’s identify the enemies you’ll face in real-world datasets:

Missing Values: Empty cells, NaN values, or placeholder values like -999 or “N/A” that represent absent data.

Duplicate Records: Identical or near-identical rows that can artificially inflate your analysis results.

Outliers: Extreme values that deviate significantly from the normal pattern, which may be errors or legitimate anomalies.

Inconsistent Formatting: Dates in different formats, inconsistent capitalization, or varying units of measurement.

Data Type Issues: Numbers stored as strings, dates stored as objects, or categorical data encoded incorrectly.

Invalid Values: Data that violates business rules or logical constraints (e.g., negative ages, future birthdates).

Setting Up Your Python Data Cleaning Environment

First, let’s install and import the essential libraries for data cleaning:

# Installation (run in terminal)
# pip install pandas numpy scikit-learn matplotlib seaborn

# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import IsolationForest
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Set display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
Python Data Cleaning Cookbook
Python Data Cleaning Cookbook

Download:

Loading and Initial Data Assessment

The first step in any data cleaning project is understanding what you’re working with:

# Load your dataset
df = pd.read_csv('your_dataset.csv')

# Initial data exploration
print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())

# Get comprehensive data information
print("\nData Info:")
print(df.info())

# Statistical summary
print("\nStatistical Summary:")
print(df.describe(include='all'))

# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

# Calculate missing percentage
missing_percentage = (df.isnull().sum() / len(df)) * 100
print("\nMissing Percentage:")
print(missing_percentage[missing_percentage > 0])

Handling Missing Data with Pandas

Missing data is the most common data quality issue. Pandas provides powerful methods to detect and handle it effectively.

Strategy 1: Remove Missing Data

Use this approach when missing data is minimal (typically < 5% of your dataset):

# Remove rows with any missing values
df_cleaned = df.dropna()

# Remove rows where specific columns have missing values
df_cleaned = df.dropna(subset=['important_column1', 'important_column2'])

# Remove columns with more than 50% missing values
threshold = len(df) * 0.5
df_cleaned = df.dropna(thresh=threshold, axis=1)

# Remove rows where all values are missing
df_cleaned = df.dropna(how='all')

Strategy 2: Fill Missing Data

When you can’t afford to lose data, intelligent imputation is the answer:

# Fill with a constant value
df['column_name'].fillna(0, inplace=True)

# Fill with mean (for numerical data)
df['age'].fillna(df['age'].mean(), inplace=True)

# Fill with median (more robust to outliers)
df['income'].fillna(df['income'].median(), inplace=True)

# Fill with mode (for categorical data)
df['category'].fillna(df['category'].mode()[0], inplace=True)

# Forward fill (carry forward the last valid observation)
df['time_series_data'].fillna(method='ffill', inplace=True)

# Backward fill
df['time_series_data'].fillna(method='bfill', inplace=True)

# Fill with interpolation (for time series)
df['temperature'].interpolate(method='linear', inplace=True)

Strategy 3: Advanced Imputation with Machine Learning

For sophisticated missing data handling, use predictive imputation:

from sklearn.impute import SimpleImputer, KNNImputer

# Simple imputer with strategy
imputer = SimpleImputer(strategy='mean')  # or 'median', 'most_frequent', 'constant'
df[['age', 'income']] = imputer.fit_transform(df[['age', 'income']])

# KNN Imputer (uses k-nearest neighbors to predict missing values)
knn_imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(
    knn_imputer.fit_transform(df.select_dtypes(include=[np.number])),
    columns=df.select_dtypes(include=[np.number]).columns
)

Detecting and Removing Duplicate Records

Duplicates can severely distort your analysis by overcounting observations:

# Identify duplicate rows
duplicates = df.duplicated()
print(f"Number of duplicate rows: {duplicates.sum()}")

# View duplicate rows
duplicate_rows = df[df.duplicated(keep=False)]
print(duplicate_rows)

# Remove duplicate rows (keep first occurrence)
df_no_duplicates = df.drop_duplicates()

# Remove duplicates based on specific columns
df_no_duplicates = df.drop_duplicates(subset=['user_id', 'transaction_date'], keep='first')

# Keep last occurrence instead
df_no_duplicates = df.drop_duplicates(keep='last')

# Identify duplicates in specific columns only
duplicate_ids = df[df.duplicated(subset=['customer_id'], keep=False)]
print(f"Customers with duplicate records: {duplicate_ids['customer_id'].nunique()}")

Outlier Detection Using Statistical Methods

Statistical approaches are fast and effective for univariate outlier detection:

# Method 1: Z-Score (for normally distributed data)
from scipy import stats

def detect_outliers_zscore(df, column, threshold=3):
    """Detect outliers using z-score method"""
    z_scores = np.abs(stats.zscore(df[column].dropna()))
    outliers = z_scores > threshold
    return df[column][outliers]

outliers = detect_outliers_zscore(df, 'price', threshold=3)
print(f"Outliers detected: {len(outliers)}")

# Method 2: IQR (Interquartile Range) - more robust
def detect_outliers_iqr(df, column):
    """Detect outliers using IQR method"""
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    return outliers

outliers_iqr = detect_outliers_iqr(df, 'price')
print(f"Outliers using IQR: {len(outliers_iqr)}")

# Visualize outliers with boxplot
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, y='price')
plt.title('Outlier Detection with Boxplot')
plt.show()

# Remove outliers based on IQR
def remove_outliers_iqr(df, columns):
    """Remove outliers from specified columns"""
    df_clean = df.copy()
    for column in columns:
        Q1 = df_clean[column].quantile(0.25)
        Q3 = df_clean[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        df_clean = df_clean[(df_clean[column] >= lower_bound) & 
                            (df_clean[column] <= upper_bound)]
    return df_clean

df_no_outliers = remove_outliers_iqr(df, ['price', 'quantity', 'revenue'])

Machine Learning for Outlier Detection

For multivariate outlier detection and anomaly detection in complex datasets, machine learning excels:

Isolation Forest Algorithm

Isolation Forest is highly effective for detecting anomalies in high-dimensional data:

from sklearn.ensemble import IsolationForest

# Select numerical features for outlier detection
numerical_features = df.select_dtypes(include=[np.number]).columns
X = df[numerical_features].dropna()

# Initialize Isolation Forest
iso_forest = IsolationForest(
    contamination=0.05,  # Expected proportion of outliers (5%)
    random_state=42,
    n_estimators=100
)

# Fit and predict (-1 for outliers, 1 for inliers)
outlier_predictions = iso_forest.fit_predict(X)

# Add predictions to dataframe
df['outlier'] = outlier_predictions
outliers_ml = df[df['outlier'] == -1]

print(f"Outliers detected by Isolation Forest: {len(outliers_ml)}")
print("\nOutlier statistics:")
print(outliers_ml[numerical_features].describe())

# Visualize outliers (for 2D data)
plt.figure(figsize=(12, 6))
plt.scatter(df[df['outlier'] == 1]['feature1'], 
           df[df['outlier'] == 1]['feature2'], 
           c='blue', label='Normal', alpha=0.6)
plt.scatter(df[df['outlier'] == -1]['feature1'], 
           df[df['outlier'] == -1]['feature2'], 
           c='red', label='Outlier', alpha=0.6)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Outlier Detection with Isolation Forest')
plt.legend()
plt.show()

DBSCAN Clustering for Outlier Detection

DBSCAN identifies outliers as points that don’t belong to any cluster:

from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

# Prepare and scale data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[numerical_features].dropna())

# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
clusters = dbscan.fit_predict(X_scaled)

# Points labeled as -1 are outliers
df['cluster'] = clusters
outliers_dbscan = df[df['cluster'] == -1]

print(f"Outliers detected by DBSCAN: {len(outliers_dbscan)}")
print(f"Number of clusters found: {len(set(clusters)) - (1 if -1 in clusters else 0)}")

# Visualize clusters and outliers
plt.figure(figsize=(12, 6))
for cluster_id in set(clusters):
    if cluster_id == -1:
        cluster_data = X_scaled[clusters == cluster_id]
        plt.scatter(cluster_data[:, 0], cluster_data[:, 1], 
                   c='red', label='Outliers', marker='x', s=100)
    else:
        cluster_data = X_scaled[clusters == cluster_id]
        plt.scatter(cluster_data[:, 0], cluster_data[:, 1], 
                   label=f'Cluster {cluster_id}', alpha=0.6)
plt.title('DBSCAN Clustering for Outlier Detection')
plt.legend()
plt.show()

Leveraging ChatGPT for Data Cleaning Insights

ChatGPT can be a powerful assistant in your data cleaning workflow. Here’s how to use it effectively:

1. Analyzing Data Quality Reports

Share your data quality summary with ChatGPT to get insights:

# Generate a comprehensive data quality report
def generate_data_quality_report(df):
    """Generate a detailed data quality report"""
    report = {
        'total_rows': len(df),
        'total_columns': len(df.columns),
        'missing_values': df.isnull().sum().to_dict(),
        'missing_percentage': ((df.isnull().sum() / len(df)) * 100).to_dict(),
        'duplicate_rows': df.duplicated().sum(),
        'data_types': df.dtypes.astype(str).to_dict(),
        'unique_values': {col: df[col].nunique() for col in df.columns},
        'numerical_summary': df.describe().to_dict()
    }
    return report

report = generate_data_quality_report(df)
print("Data Quality Report:")
print(report)

# Prompt for ChatGPT:
# "I have a dataset with the following data quality issues: [paste report]
# Can you suggest a data cleaning strategy prioritizing the most critical issues?"

2. Generating Custom Cleaning Functions

Ask ChatGPT to create specific data cleaning functions:

Example Prompt: “Create a Python function that standardizes phone numbers in different formats (e.g., (555) 123-4567, 555-123-4567, 5551234567) into a single format.”

ChatGPT Response (example of what you’d receive):

import re

def standardize_phone_numbers(phone):
    """Standardize phone numbers to format: (XXX) XXX-XXXX"""
    if pd.isna(phone):
        return None
    
    # Remove all non-digit characters
    digits = re.sub(r'\D', '', str(phone))
    
    # Check if we have 10 digits
    if len(digits) == 10:
        return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
    elif len(digits) == 11 and digits[0] == '1':
        # Remove leading 1 for US numbers
        return f"({digits[1:4]}) {digits[4:7]}-{digits[7:]}"
    else:
        return phone  # Return original if format is unexpected

# Apply the function
df['phone_standardized'] = df['phone'].apply(standardize_phone_numbers)

3. Interpreting Outliers and Anomalies

Use ChatGPT to understand whether detected outliers are errors or legitimate extreme values:

Example Prompt: “I found outliers in my e-commerce dataset where some transactions have prices 10x higher than average. The product is ‘iPhone 13’. Could these be legitimate or data errors?”

This contextual analysis helps you decide whether to remove, cap, or keep outliers.

4. Generating Data Validation Rules

# Prompt ChatGPT: "Generate Python validation rules for a customer dataset 
# with columns: age, email, income, signup_date"

def validate_customer_data(df):
    """Validate customer data based on business rules"""
    validation_results = {
        'invalid_age': df[(df['age'] < 0) | (df['age'] > 120)],
        'invalid_email': df[~df['email'].str.contains('@', na=False)],
        'invalid_income': df[df['income'] < 0],
        'future_signup_date': df[df['signup_date'] > pd.Timestamp.now()],
        'missing_required_fields': df[df[['age', 'email']].isnull().any(axis=1)]
    }
    
    for issue, invalid_rows in validation_results.items():
        if len(invalid_rows) > 0:
            print(f"\n{issue}: {len(invalid_rows)} records")
            print(invalid_rows.head())
    
    return validation_results

validation_results = validate_customer_data(df)

Handling Inconsistent Data Formatting

Real-world datasets often have formatting inconsistencies that need standardization:

# Standardize text data
df['name'] = df['name'].str.strip()  # Remove whitespace
df['name'] = df['name'].str.title()  # Capitalize properly
df['category'] = df['category'].str.lower()  # Lowercase for consistency

# Standardize date formats
df['date'] = pd.to_datetime(df['date'], errors='coerce')

# Handle multiple date formats
def parse_multiple_date_formats(date_string):
    """Try multiple date formats"""
    formats = ['%Y-%m-%d', '%m/%d/%Y', '%d-%m-%Y', '%Y/%m/%d']
    for fmt in formats:
        try:
            return pd.to_datetime(date_string, format=fmt)
        except:
            continue
    return pd.NaT

df['date'] = df['date'].apply(parse_multiple_date_formats)

# Standardize currency values
def clean_currency(value):
    """Remove currency symbols and convert to float"""
    if pd.isna(value):
        return None
    value_str = str(value).replace('$', '').replace(',', '').strip()
    try:
        return float(value_str)
    except:
        return None

df['price'] = df['price'].apply(clean_currency)

# Handle boolean variations
boolean_mapping = {
    'yes': True, 'no': False, 'y': True, 'n': False,
    'true': True, 'false': False, '1': True, '0': False,
    1: True, 0: False
}
df['is_active'] = df['is_active'].map(boolean_mapping)

Data Type Conversion and Validation

Ensuring correct data types is crucial for analysis:

# Convert data types explicitly
df['user_id'] = df['user_id'].astype(str)
df['age'] = pd.to_numeric(df['age'], errors='coerce')
df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce')
df['category'] = df['category'].astype('category')

# Validate data types
def validate_data_types(df, expected_types):
    """Validate that columns have expected data types"""
    type_issues = {}
    for column, expected_type in expected_types.items():
        if column in df.columns:
            actual_type = df[column].dtype
            if actual_type != expected_type:
                type_issues[column] = {
                    'expected': expected_type,
                    'actual': actual_type
                }
    return type_issues

expected_types = {
    'user_id': 'object',
    'age': 'int64',
    'income': 'float64',
    'signup_date': 'datetime64[ns]'
}

type_issues = validate_data_types(df, expected_types)
if type_issues:
    print("Data type issues found:", type_issues)

Creating a Complete Data Cleaning Pipeline

Combine all techniques into a reusable pipeline:

class DataCleaningPipeline:
    """Complete data cleaning pipeline"""
    
    def __init__(self, df):
        self.df = df.copy()
        self.cleaning_log = []
    
    def remove_duplicates(self, subset=None):
        """Remove duplicate rows"""
        initial_rows = len(self.df)
        self.df = self.df.drop_duplicates(subset=subset)
        removed = initial_rows - len(self.df)
        self.cleaning_log.append(f"Removed {removed} duplicate rows")
        return self
    
    def handle_missing_values(self, strategy='drop', columns=None):
        """Handle missing values with specified strategy"""
        if strategy == 'drop':
            self.df = self.df.dropna(subset=columns)
            self.cleaning_log.append(f"Dropped rows with missing values in {columns}")
        elif strategy == 'fill_mean':
            for col in columns:
                self.df[col].fillna(self.df[col].mean(), inplace=True)
            self.cleaning_log.append(f"Filled missing values with mean for {columns}")
        elif strategy == 'fill_median':
            for col in columns:
                self.df[col].fillna(self.df[col].median(), inplace=True)
            self.cleaning_log.append(f"Filled missing values with median for {columns}")
        return self
    
    def remove_outliers(self, columns, method='iqr'):
        """Remove outliers using specified method"""
        initial_rows = len(self.df)
        if method == 'iqr':
            for col in columns:
                Q1 = self.df[col].quantile(0.25)
                Q3 = self.df[col].quantile(0.75)
                IQR = Q3 - Q1
                lower = Q1 - 1.5 * IQR
                upper = Q3 + 1.5 * IQR
                self.df = self.df[(self.df[col] >= lower) & (self.df[col] <= upper)]
        removed = initial_rows - len(self.df)
        self.cleaning_log.append(f"Removed {removed} outliers from {columns}")
        return self
    
    def standardize_text(self, columns):
        """Standardize text columns"""
        for col in columns:
            self.df[col] = self.df[col].str.strip().str.lower()
        self.cleaning_log.append(f"Standardized text in {columns}")
        return self
    
    def convert_data_types(self, type_mapping):
        """Convert columns to specified data types"""
        for col, dtype in type_mapping.items():
            if dtype == 'datetime':
                self.df[col] = pd.to_datetime(self.df[col], errors='coerce')
            else:
                self.df[col] = self.df[col].astype(dtype)
        self.cleaning_log.append(f"Converted data types for {list(type_mapping.keys())}")
        return self
    
    def get_clean_data(self):
        """Return cleaned dataframe"""
        return self.df
    
    def get_cleaning_report(self):
        """Return cleaning log"""
        return self.cleaning_log

# Use the pipeline
pipeline = DataCleaningPipeline(df)
cleaned_df = (pipeline
              .remove_duplicates()
              .handle_missing_values(strategy='fill_median', columns=['age', 'income'])
              .remove_outliers(columns=['price', 'quantity'], method='iqr')
              .standardize_text(columns=['name', 'category'])
              .convert_data_types({'date': 'datetime', 'user_id': str})
              .get_clean_data())

print("\nCleaning Report:")
for log in pipeline.get_cleaning_report():
    print(f"- {log}")

Extracting Key Insights from Cleaned Data

Once your data is clean, you can extract meaningful insights:

# Summary statistics after cleaning
print("Clean Data Summary:")
print(cleaned_df.describe())

# Value distribution
print("\nCategory Distribution:")
print(cleaned_df['category'].value_counts())

# Correlation analysis
correlation_matrix = cleaned_df.select_dtypes(include=[np.number]).corr()
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix of Numerical Features')
plt.show()

# Time-based analysis (if you have datetime columns)
cleaned_df['month'] = cleaned_df['date'].dt.month
monthly_trends = cleaned_df.groupby('month').agg({
    'revenue': 'sum',
    'quantity': 'sum',
    'user_id': 'nunique'
})
print("\nMonthly Trends:")
print(monthly_trends)

# Segment analysis
segment_analysis = cleaned_df.groupby('category').agg({
    'price': ['mean', 'median', 'std'],
    'quantity': 'sum',
    'revenue': 'sum'
})
print("\nSegment Analysis:")
print(segment_analysis)

Best Practices for Data Cleaning

Follow these expert guidelines to ensure robust data cleaning:

  1. Always Keep a Backup: Never overwrite your original raw data. Work on copies.
  2. Document Everything: Maintain a detailed log of all cleaning operations performed.
  3. Validate After Cleaning: Always check that your cleaning operations produced expected results.
  4. Set Thresholds Intelligently: Use domain knowledge to set appropriate thresholds for outlier detection.
  5. Handle Missing Data Appropriately: Understand why data is missing before deciding how to handle it.
  6. Automate Repetitive Tasks: Create reusable functions and pipelines for common cleaning operations.
  7. Visualize Before and After: Use plots to understand the impact of your cleaning operations.
  8. Test on Subsets First: Test cleaning operations on small data samples before applying to entire datasets.

Conclusion: Building Better Models with Clean Data

Data cleaning isn’t glamorous, but it’s the foundation of every successful data science project. By mastering pandas manipulation, leveraging machine learning for outlier detection, and using ChatGPT as an intelligent assistant, you can transform messy datasets into reliable sources of insight.

The techniques in this cookbook—from handling missing values to detecting outliers with Isolation Forest—will save you countless hours and prevent costly analytical mistakes. Remember that data cleaning is an iterative process. As you analyze your data, you’ll discover new quality issues that require attention.

Start implementing these methods today, and you’ll see immediate improvements in your model performance, analysis accuracy, and confidence in your data-driven decisions. Clean data isn’t just about removing errors—it’s about unlocking the true potential hidden within your datasets.

Quick Reference: Essential Data Cleaning Commands

# Missing data
df.isnull().sum()                          # Count missing values
df.dropna()                                 # Remove rows with missing data
df.fillna(value)                           # Fill missing data
df['col'].fillna(df['col'].mean())        # Fill with mean

# Duplicates
df.duplicated().sum()                      # Count duplicates
df.drop_duplicates()                       # Remove duplicates

# Outliers
Q1 = df['col'].quantile(0.25)             # First quartile
Q3 = df['col'].quantile(0.75)             # Third quartile
IQR = Q3 - Q1                              # Interquartile range

# Data types
df.dtypes                                  # Check data types
df['col'].astype(type)                    # Convert type
pd.to_datetime(df['col'])                 # Convert to datetime

# Text cleaning
df['col'].str.strip()                     # Remove whitespace
df['col'].str.lower()                     # Convert to lowercase
df['col'].str.replace(old, new)           # Replace text

Master these techniques, and you’ll be well-equipped to handle any data cleaning challenge that comes your way.

Download (PDF)

Learn More: Pandas: Powerful Python Data Analysis toolkit

Leave a Comment