Data cleaning is the unsung hero of data science. While machine learning models and visualization dashboards receive the most attention, the reality is that 80% of a data scientist’s time is spent on cleaning and preparing data. This isn’t just busywork—it’s critical infrastructure.
Dirty data leads to flawed insights, inaccurate predictions, and costly business decisions. A single missing value in the wrong place can skew your entire analysis. Duplicate records can inflate your metrics. Outliers can significantly impact your machine learning models.
In this comprehensive Python data cleaning cookbook, you’ll learn practical techniques to detect and remove dirty data using pandas, machine learning algorithms, and even ChatGPT. Whether you’re a data analyst, data scientist, or Python developer, these battle-tested methods will help you transform messy datasets into clean, analysis-ready data.
Understanding Common Data Quality Issues
Before diving into solutions, let’s identify the enemies you’ll face in real-world datasets:
Missing Values: Empty cells, NaN values, or placeholder values like -999 or “N/A” that represent absent data.
Duplicate Records: Identical or near-identical rows that can artificially inflate your analysis results.
Outliers: Extreme values that deviate significantly from the normal pattern, which may be errors or legitimate anomalies.
Inconsistent Formatting: Dates in different formats, inconsistent capitalization, or varying units of measurement.
Data Type Issues: Numbers stored as strings, dates stored as objects, or categorical data encoded incorrectly.
Invalid Values: Data that violates business rules or logical constraints (e.g., negative ages, future birthdates).
Setting Up Your Python Data Cleaning Environment
First, let’s install and import the essential libraries for data cleaning:
# Installation (run in terminal)
# pip install pandas numpy scikit-learn matplotlib seaborn
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import IsolationForest
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')
# Set display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

Loading and Initial Data Assessment
The first step in any data cleaning project is understanding what you’re working with:
# Load your dataset
df = pd.read_csv('your_dataset.csv')
# Initial data exploration
print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
# Get comprehensive data information
print("\nData Info:")
print(df.info())
# Statistical summary
print("\nStatistical Summary:")
print(df.describe(include='all'))
# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())
# Calculate missing percentage
missing_percentage = (df.isnull().sum() / len(df)) * 100
print("\nMissing Percentage:")
print(missing_percentage[missing_percentage > 0])
Handling Missing Data with Pandas
Missing data is the most common data quality issue. Pandas provides powerful methods to detect and handle it effectively.
Strategy 1: Remove Missing Data
Use this approach when missing data is minimal (typically < 5% of your dataset):
# Remove rows with any missing values
df_cleaned = df.dropna()
# Remove rows where specific columns have missing values
df_cleaned = df.dropna(subset=['important_column1', 'important_column2'])
# Remove columns with more than 50% missing values
threshold = len(df) * 0.5
df_cleaned = df.dropna(thresh=threshold, axis=1)
# Remove rows where all values are missing
df_cleaned = df.dropna(how='all')
Strategy 2: Fill Missing Data
When you can’t afford to lose data, intelligent imputation is the answer:
# Fill with a constant value
df['column_name'].fillna(0, inplace=True)
# Fill with mean (for numerical data)
df['age'].fillna(df['age'].mean(), inplace=True)
# Fill with median (more robust to outliers)
df['income'].fillna(df['income'].median(), inplace=True)
# Fill with mode (for categorical data)
df['category'].fillna(df['category'].mode()[0], inplace=True)
# Forward fill (carry forward the last valid observation)
df['time_series_data'].fillna(method='ffill', inplace=True)
# Backward fill
df['time_series_data'].fillna(method='bfill', inplace=True)
# Fill with interpolation (for time series)
df['temperature'].interpolate(method='linear', inplace=True)
Strategy 3: Advanced Imputation with Machine Learning
For sophisticated missing data handling, use predictive imputation:
from sklearn.impute import SimpleImputer, KNNImputer
# Simple imputer with strategy
imputer = SimpleImputer(strategy='mean') # or 'median', 'most_frequent', 'constant'
df[['age', 'income']] = imputer.fit_transform(df[['age', 'income']])
# KNN Imputer (uses k-nearest neighbors to predict missing values)
knn_imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(
knn_imputer.fit_transform(df.select_dtypes(include=[np.number])),
columns=df.select_dtypes(include=[np.number]).columns
)
Detecting and Removing Duplicate Records
Duplicates can severely distort your analysis by overcounting observations:
# Identify duplicate rows
duplicates = df.duplicated()
print(f"Number of duplicate rows: {duplicates.sum()}")
# View duplicate rows
duplicate_rows = df[df.duplicated(keep=False)]
print(duplicate_rows)
# Remove duplicate rows (keep first occurrence)
df_no_duplicates = df.drop_duplicates()
# Remove duplicates based on specific columns
df_no_duplicates = df.drop_duplicates(subset=['user_id', 'transaction_date'], keep='first')
# Keep last occurrence instead
df_no_duplicates = df.drop_duplicates(keep='last')
# Identify duplicates in specific columns only
duplicate_ids = df[df.duplicated(subset=['customer_id'], keep=False)]
print(f"Customers with duplicate records: {duplicate_ids['customer_id'].nunique()}")
Outlier Detection Using Statistical Methods
Statistical approaches are fast and effective for univariate outlier detection:
# Method 1: Z-Score (for normally distributed data)
from scipy import stats
def detect_outliers_zscore(df, column, threshold=3):
"""Detect outliers using z-score method"""
z_scores = np.abs(stats.zscore(df[column].dropna()))
outliers = z_scores > threshold
return df[column][outliers]
outliers = detect_outliers_zscore(df, 'price', threshold=3)
print(f"Outliers detected: {len(outliers)}")
# Method 2: IQR (Interquartile Range) - more robust
def detect_outliers_iqr(df, column):
"""Detect outliers using IQR method"""
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
return outliers
outliers_iqr = detect_outliers_iqr(df, 'price')
print(f"Outliers using IQR: {len(outliers_iqr)}")
# Visualize outliers with boxplot
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, y='price')
plt.title('Outlier Detection with Boxplot')
plt.show()
# Remove outliers based on IQR
def remove_outliers_iqr(df, columns):
"""Remove outliers from specified columns"""
df_clean = df.copy()
for column in columns:
Q1 = df_clean[column].quantile(0.25)
Q3 = df_clean[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df_clean = df_clean[(df_clean[column] >= lower_bound) &
(df_clean[column] <= upper_bound)]
return df_clean
df_no_outliers = remove_outliers_iqr(df, ['price', 'quantity', 'revenue'])
Machine Learning for Outlier Detection
For multivariate outlier detection and anomaly detection in complex datasets, machine learning excels:
Isolation Forest Algorithm
Isolation Forest is highly effective for detecting anomalies in high-dimensional data:
from sklearn.ensemble import IsolationForest
# Select numerical features for outlier detection
numerical_features = df.select_dtypes(include=[np.number]).columns
X = df[numerical_features].dropna()
# Initialize Isolation Forest
iso_forest = IsolationForest(
contamination=0.05, # Expected proportion of outliers (5%)
random_state=42,
n_estimators=100
)
# Fit and predict (-1 for outliers, 1 for inliers)
outlier_predictions = iso_forest.fit_predict(X)
# Add predictions to dataframe
df['outlier'] = outlier_predictions
outliers_ml = df[df['outlier'] == -1]
print(f"Outliers detected by Isolation Forest: {len(outliers_ml)}")
print("\nOutlier statistics:")
print(outliers_ml[numerical_features].describe())
# Visualize outliers (for 2D data)
plt.figure(figsize=(12, 6))
plt.scatter(df[df['outlier'] == 1]['feature1'],
df[df['outlier'] == 1]['feature2'],
c='blue', label='Normal', alpha=0.6)
plt.scatter(df[df['outlier'] == -1]['feature1'],
df[df['outlier'] == -1]['feature2'],
c='red', label='Outlier', alpha=0.6)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Outlier Detection with Isolation Forest')
plt.legend()
plt.show()
DBSCAN Clustering for Outlier Detection
DBSCAN identifies outliers as points that don’t belong to any cluster:
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
# Prepare and scale data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[numerical_features].dropna())
# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
clusters = dbscan.fit_predict(X_scaled)
# Points labeled as -1 are outliers
df['cluster'] = clusters
outliers_dbscan = df[df['cluster'] == -1]
print(f"Outliers detected by DBSCAN: {len(outliers_dbscan)}")
print(f"Number of clusters found: {len(set(clusters)) - (1 if -1 in clusters else 0)}")
# Visualize clusters and outliers
plt.figure(figsize=(12, 6))
for cluster_id in set(clusters):
if cluster_id == -1:
cluster_data = X_scaled[clusters == cluster_id]
plt.scatter(cluster_data[:, 0], cluster_data[:, 1],
c='red', label='Outliers', marker='x', s=100)
else:
cluster_data = X_scaled[clusters == cluster_id]
plt.scatter(cluster_data[:, 0], cluster_data[:, 1],
label=f'Cluster {cluster_id}', alpha=0.6)
plt.title('DBSCAN Clustering for Outlier Detection')
plt.legend()
plt.show()
Leveraging ChatGPT for Data Cleaning Insights
ChatGPT can be a powerful assistant in your data cleaning workflow. Here’s how to use it effectively:
1. Analyzing Data Quality Reports
Share your data quality summary with ChatGPT to get insights:
# Generate a comprehensive data quality report
def generate_data_quality_report(df):
"""Generate a detailed data quality report"""
report = {
'total_rows': len(df),
'total_columns': len(df.columns),
'missing_values': df.isnull().sum().to_dict(),
'missing_percentage': ((df.isnull().sum() / len(df)) * 100).to_dict(),
'duplicate_rows': df.duplicated().sum(),
'data_types': df.dtypes.astype(str).to_dict(),
'unique_values': {col: df[col].nunique() for col in df.columns},
'numerical_summary': df.describe().to_dict()
}
return report
report = generate_data_quality_report(df)
print("Data Quality Report:")
print(report)
# Prompt for ChatGPT:
# "I have a dataset with the following data quality issues: [paste report]
# Can you suggest a data cleaning strategy prioritizing the most critical issues?"
2. Generating Custom Cleaning Functions
Ask ChatGPT to create specific data cleaning functions:
Example Prompt: “Create a Python function that standardizes phone numbers in different formats (e.g., (555) 123-4567, 555-123-4567, 5551234567) into a single format.”
ChatGPT Response (example of what you’d receive):
import re
def standardize_phone_numbers(phone):
"""Standardize phone numbers to format: (XXX) XXX-XXXX"""
if pd.isna(phone):
return None
# Remove all non-digit characters
digits = re.sub(r'\D', '', str(phone))
# Check if we have 10 digits
if len(digits) == 10:
return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
elif len(digits) == 11 and digits[0] == '1':
# Remove leading 1 for US numbers
return f"({digits[1:4]}) {digits[4:7]}-{digits[7:]}"
else:
return phone # Return original if format is unexpected
# Apply the function
df['phone_standardized'] = df['phone'].apply(standardize_phone_numbers)
3. Interpreting Outliers and Anomalies
Use ChatGPT to understand whether detected outliers are errors or legitimate extreme values:
Example Prompt: “I found outliers in my e-commerce dataset where some transactions have prices 10x higher than average. The product is ‘iPhone 13’. Could these be legitimate or data errors?”
This contextual analysis helps you decide whether to remove, cap, or keep outliers.
4. Generating Data Validation Rules
# Prompt ChatGPT: "Generate Python validation rules for a customer dataset
# with columns: age, email, income, signup_date"
def validate_customer_data(df):
"""Validate customer data based on business rules"""
validation_results = {
'invalid_age': df[(df['age'] < 0) | (df['age'] > 120)],
'invalid_email': df[~df['email'].str.contains('@', na=False)],
'invalid_income': df[df['income'] < 0],
'future_signup_date': df[df['signup_date'] > pd.Timestamp.now()],
'missing_required_fields': df[df[['age', 'email']].isnull().any(axis=1)]
}
for issue, invalid_rows in validation_results.items():
if len(invalid_rows) > 0:
print(f"\n{issue}: {len(invalid_rows)} records")
print(invalid_rows.head())
return validation_results
validation_results = validate_customer_data(df)
Handling Inconsistent Data Formatting
Real-world datasets often have formatting inconsistencies that need standardization:
# Standardize text data
df['name'] = df['name'].str.strip() # Remove whitespace
df['name'] = df['name'].str.title() # Capitalize properly
df['category'] = df['category'].str.lower() # Lowercase for consistency
# Standardize date formats
df['date'] = pd.to_datetime(df['date'], errors='coerce')
# Handle multiple date formats
def parse_multiple_date_formats(date_string):
"""Try multiple date formats"""
formats = ['%Y-%m-%d', '%m/%d/%Y', '%d-%m-%Y', '%Y/%m/%d']
for fmt in formats:
try:
return pd.to_datetime(date_string, format=fmt)
except:
continue
return pd.NaT
df['date'] = df['date'].apply(parse_multiple_date_formats)
# Standardize currency values
def clean_currency(value):
"""Remove currency symbols and convert to float"""
if pd.isna(value):
return None
value_str = str(value).replace('$', '').replace(',', '').strip()
try:
return float(value_str)
except:
return None
df['price'] = df['price'].apply(clean_currency)
# Handle boolean variations
boolean_mapping = {
'yes': True, 'no': False, 'y': True, 'n': False,
'true': True, 'false': False, '1': True, '0': False,
1: True, 0: False
}
df['is_active'] = df['is_active'].map(boolean_mapping)
Data Type Conversion and Validation
Ensuring correct data types is crucial for analysis:
# Convert data types explicitly
df['user_id'] = df['user_id'].astype(str)
df['age'] = pd.to_numeric(df['age'], errors='coerce')
df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce')
df['category'] = df['category'].astype('category')
# Validate data types
def validate_data_types(df, expected_types):
"""Validate that columns have expected data types"""
type_issues = {}
for column, expected_type in expected_types.items():
if column in df.columns:
actual_type = df[column].dtype
if actual_type != expected_type:
type_issues[column] = {
'expected': expected_type,
'actual': actual_type
}
return type_issues
expected_types = {
'user_id': 'object',
'age': 'int64',
'income': 'float64',
'signup_date': 'datetime64[ns]'
}
type_issues = validate_data_types(df, expected_types)
if type_issues:
print("Data type issues found:", type_issues)
Creating a Complete Data Cleaning Pipeline
Combine all techniques into a reusable pipeline:
class DataCleaningPipeline:
"""Complete data cleaning pipeline"""
def __init__(self, df):
self.df = df.copy()
self.cleaning_log = []
def remove_duplicates(self, subset=None):
"""Remove duplicate rows"""
initial_rows = len(self.df)
self.df = self.df.drop_duplicates(subset=subset)
removed = initial_rows - len(self.df)
self.cleaning_log.append(f"Removed {removed} duplicate rows")
return self
def handle_missing_values(self, strategy='drop', columns=None):
"""Handle missing values with specified strategy"""
if strategy == 'drop':
self.df = self.df.dropna(subset=columns)
self.cleaning_log.append(f"Dropped rows with missing values in {columns}")
elif strategy == 'fill_mean':
for col in columns:
self.df[col].fillna(self.df[col].mean(), inplace=True)
self.cleaning_log.append(f"Filled missing values with mean for {columns}")
elif strategy == 'fill_median':
for col in columns:
self.df[col].fillna(self.df[col].median(), inplace=True)
self.cleaning_log.append(f"Filled missing values with median for {columns}")
return self
def remove_outliers(self, columns, method='iqr'):
"""Remove outliers using specified method"""
initial_rows = len(self.df)
if method == 'iqr':
for col in columns:
Q1 = self.df[col].quantile(0.25)
Q3 = self.df[col].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
self.df = self.df[(self.df[col] >= lower) & (self.df[col] <= upper)]
removed = initial_rows - len(self.df)
self.cleaning_log.append(f"Removed {removed} outliers from {columns}")
return self
def standardize_text(self, columns):
"""Standardize text columns"""
for col in columns:
self.df[col] = self.df[col].str.strip().str.lower()
self.cleaning_log.append(f"Standardized text in {columns}")
return self
def convert_data_types(self, type_mapping):
"""Convert columns to specified data types"""
for col, dtype in type_mapping.items():
if dtype == 'datetime':
self.df[col] = pd.to_datetime(self.df[col], errors='coerce')
else:
self.df[col] = self.df[col].astype(dtype)
self.cleaning_log.append(f"Converted data types for {list(type_mapping.keys())}")
return self
def get_clean_data(self):
"""Return cleaned dataframe"""
return self.df
def get_cleaning_report(self):
"""Return cleaning log"""
return self.cleaning_log
# Use the pipeline
pipeline = DataCleaningPipeline(df)
cleaned_df = (pipeline
.remove_duplicates()
.handle_missing_values(strategy='fill_median', columns=['age', 'income'])
.remove_outliers(columns=['price', 'quantity'], method='iqr')
.standardize_text(columns=['name', 'category'])
.convert_data_types({'date': 'datetime', 'user_id': str})
.get_clean_data())
print("\nCleaning Report:")
for log in pipeline.get_cleaning_report():
print(f"- {log}")
Extracting Key Insights from Cleaned Data
Once your data is clean, you can extract meaningful insights:
# Summary statistics after cleaning
print("Clean Data Summary:")
print(cleaned_df.describe())
# Value distribution
print("\nCategory Distribution:")
print(cleaned_df['category'].value_counts())
# Correlation analysis
correlation_matrix = cleaned_df.select_dtypes(include=[np.number]).corr()
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix of Numerical Features')
plt.show()
# Time-based analysis (if you have datetime columns)
cleaned_df['month'] = cleaned_df['date'].dt.month
monthly_trends = cleaned_df.groupby('month').agg({
'revenue': 'sum',
'quantity': 'sum',
'user_id': 'nunique'
})
print("\nMonthly Trends:")
print(monthly_trends)
# Segment analysis
segment_analysis = cleaned_df.groupby('category').agg({
'price': ['mean', 'median', 'std'],
'quantity': 'sum',
'revenue': 'sum'
})
print("\nSegment Analysis:")
print(segment_analysis)
Best Practices for Data Cleaning
Follow these expert guidelines to ensure robust data cleaning:
- Always Keep a Backup: Never overwrite your original raw data. Work on copies.
- Document Everything: Maintain a detailed log of all cleaning operations performed.
- Validate After Cleaning: Always check that your cleaning operations produced expected results.
- Set Thresholds Intelligently: Use domain knowledge to set appropriate thresholds for outlier detection.
- Handle Missing Data Appropriately: Understand why data is missing before deciding how to handle it.
- Automate Repetitive Tasks: Create reusable functions and pipelines for common cleaning operations.
- Visualize Before and After: Use plots to understand the impact of your cleaning operations.
- Test on Subsets First: Test cleaning operations on small data samples before applying to entire datasets.
Conclusion: Building Better Models with Clean Data
Data cleaning isn’t glamorous, but it’s the foundation of every successful data science project. By mastering pandas manipulation, leveraging machine learning for outlier detection, and using ChatGPT as an intelligent assistant, you can transform messy datasets into reliable sources of insight.
The techniques in this cookbook—from handling missing values to detecting outliers with Isolation Forest—will save you countless hours and prevent costly analytical mistakes. Remember that data cleaning is an iterative process. As you analyze your data, you’ll discover new quality issues that require attention.
Start implementing these methods today, and you’ll see immediate improvements in your model performance, analysis accuracy, and confidence in your data-driven decisions. Clean data isn’t just about removing errors—it’s about unlocking the true potential hidden within your datasets.
Quick Reference: Essential Data Cleaning Commands
# Missing data
df.isnull().sum() # Count missing values
df.dropna() # Remove rows with missing data
df.fillna(value) # Fill missing data
df['col'].fillna(df['col'].mean()) # Fill with mean
# Duplicates
df.duplicated().sum() # Count duplicates
df.drop_duplicates() # Remove duplicates
# Outliers
Q1 = df['col'].quantile(0.25) # First quartile
Q3 = df['col'].quantile(0.75) # Third quartile
IQR = Q3 - Q1 # Interquartile range
# Data types
df.dtypes # Check data types
df['col'].astype(type) # Convert type
pd.to_datetime(df['col']) # Convert to datetime
# Text cleaning
df['col'].str.strip() # Remove whitespace
df['col'].str.lower() # Convert to lowercase
df['col'].str.replace(old, new) # Replace text
Master these techniques, and you’ll be well-equipped to handle any data cleaning challenge that comes your way.
Learn More: Pandas: Powerful Python Data Analysis toolkit