Data Mapping & Validation with Python
Data mapping and validation are essential steps in every data engineering workflow. Whether integrating sources, transforming data, or migrating between systems, ensuring that information is accurately mapped and rigorously validated is crucial for maintaining data quality and trust across business processes.
This hands-on tutorial will guide you through setting up a Python environment, mapping your data fields for clarity and consistency, and applying robust rule-based validation checks; all with clear, practical code examples you can use immediately in your projects.
Step 1: Create Project Folder and Set Up Python Environment
Create and enter a project directory:
mkdir data-mapping-validation
cd data-mapping-validation
Create a virtual environment named data-validation:
python3 -m venv data-validation
Activate:
macOS/Linux:
source data-validation/bin/activateWindows:
data-validation\Scripts\activate
Confirm Python path:
which python
Step 2: Install Dependencies
pip install pandas cerberus
Step 3: Create Python Scripts Using Nano
Data Mapping (step3_data_mapping.py)
Create the file with Nano:
nano step3_data_mapping.py
Paste:
import pandas as pd
try:
data = {
'CountryCode': ['US', 'GB', 'FR'],
'Sales': [1500, 1200, 1300]
}
df = pd.DataFrame(data)
country_map = {
'US': 'United States',
'GB': 'United Kingdom',
'FR': 'France'
}
df['CountryName'] = df['CountryCode'].map(country_map)
print("Data mapping result:")
print(df)
except Exception as e:
print(f"An error occurred during data mapping: {e}")
Save (Ctrl+O), then exit (Ctrl+X).
Run:
python step3_data_mapping.py
Data Validation (step4_data_validation.py)
Create:
nano step4_data_validation.py
Paste:
import pandas as pd
from cerberus import Validator
try:
data = {
'CountryCode': ['US', 'GB', 'FR'],
'Sales': [1500, 1200, 1300],
'CountryName': ['United States', 'United Kingdom', 'France']
}
df = pd.DataFrame(data)
schema = {
'CountryCode': {'type': 'string', 'regex': '^[A-Z]{2}$', 'required': True},
'Sales': {'type': 'integer', 'min': 0, 'required': True},
'CountryName': {'type': 'string', 'required': True}
}
v = Validator(schema)
for index, row in df.iterrows():
record = row.to_dict()
if v.validate(record):
print(f"Row {index} is valid.")
else:
print(f"Row {index} errors: {v.errors}")
except Exception as e:
print(f"An error occurred during data validation: {e}")
Save and exit.
Run:
python step4_data_validation.py
Step 4: Prepare Input CSV File Manually
Create input_data.csv:
nano input_data.csv
Paste:
StatusCode,Priority
N,3
O,2
C,5
X,1
N,0
Save and exit.
Step 5: Integrated Workflow Script With Error Handling (data_mapping_validation_workflow.py)
Create:
nano data_mapping_validation_workflow.py
Paste:
import pandas as pd
from cerberus import Validator
try:
print("Loading 'input_data.csv'...")
df = pd.read_csv('input_data.csv')
print("CSV loaded successfully:")
print(df.head())
status_map = {
'N': 'New',
'O': 'Open',
'C': 'Closed'
}
# Map StatusCode to descriptions
df['StatusDescription'] = df['StatusCode'].map(status_map)
print("\nMapped StatusDescription column added.")
schema = {
'StatusCode': {
'type': 'string',
'allowed': list(status_map.keys()),
'required': True
},
'Priority': {
'type': 'integer',
'min': 1,
'required': True
},
'StatusDescription': {
'type': 'string',
'required': True,
'nullable': False
}
}
v = Validator(schema)
valid_rows = []
invalid_rows = []
print("\nStarting validation...")
for index, row in df.iterrows():
record = row.to_dict()
if v.validate(record):
valid_rows.append(row)
else:
print(f"Row {index} invalid: {v.errors}")
invalid_rows.append((index, row, v.errors))
valid_df = pd.DataFrame(valid_rows)
print(f"\nValidation complete. Valid records: {len(valid_df)}")
print(f"Invalid records: {len(invalid_rows)}")
if invalid_rows:
print("\nInvalid rows details:")
for idx, row, errors in invalid_rows:
print(f"Row {idx}: {row.to_dict()}")
print(f"Errors: {errors}\n")
except FileNotFoundError:
print("Error: 'input_data.csv' not found. Please create the CSV file in the working directory.")
except pd.errors.EmptyDataError:
print("Error: 'input_data.csv' is empty.")
except Exception as e:
print(f"An unexpected error occurred: {e}")
Save and exit.
Step 6: Run the Integrated Workflow
Make sure environment is active, then run:
python data_mapping_validation_workflow.py
You will see:
- Confirmation that the CSV file loaded and its content preview.
- Confirmation when mapping column added.
- Validation progress with details on any invalid rows and error messages.
- A summary count of valid and invalid records.
Why Adding Error Handling and Feedback Matters
- Error messages help quickly identify missing files, empty data, or logic issues.
- Progress prints provide a clear picture of the workflow’s state so you know what happens at each step.
- Validation feedback shows exactly which rows fail and why, enabling effective data correction.
- These make your data pipeline robust, transparent, and easier to maintain, increasing trust in data integrity.
Why This Matters for Data Quality and Trust
- Mapping makes your data more understandable by replacing obscure codes with descriptive names. This helps clarify data context to users and systems.
- Validation enforces rules like correct formats, required fields, and acceptable values, catching errors before they propagate downstream. This prevents corrupt or misleading data from impacting decisions.
- Separating valid and invalid records lets you review data quality issues and maintain reliable pipelines.
References
Pandas Documentation — Data manipulation with Python, including mapping and transformations
Cerberus: Data validation for Python — Official documentation and usage examples for schema-based validation
Nano Editor Manual — Guide to creating and editing files in the Nano text editor
Python Virtual Environments — Best practices for creating isolated environments and managing dependencies
https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/
Effective Data Validation Techniques — Blog on strategies for validating data to ensure quality and integrity
Python Mapping and Transformation Techniques — Tutorial on using Python map functions and pandas for data transformations
