Data Mapping & Validation with Python

Data mapping and validation are essential steps in every data engineering workflow. Whether integrating sources, transforming data, or migrating between systems, ensuring that information is accurately mapped and rigorously validated is crucial for maintaining data quality and trust across business processes.

This hands-on tutorial will guide you through setting up a Python environment, mapping your data fields for clarity and consistency, and applying robust rule-based validation checks; all with clear, practical code examples you can use immediately in your projects.

Step 1: Create Project Folder and Set Up Python Environment

Create and enter a project directory:

mkdir data-mapping-validation
cd data-mapping-validation

Create a virtual environment named data-validation:

python3 -m venv data-validation

Activate:

  • macOS/Linux:

    source data-validation/bin/activate
  • Windows:

    data-validation\Scripts\activate

Confirm Python path:

which python

Step 2: Install Dependencies

pip install pandas cerberus

Step 3: Create Python Scripts Using Nano

Data Mapping (step3_data_mapping.py)

Create the file with Nano:

nano step3_data_mapping.py

Paste:

import pandas as pd

try:
    data = {
        'CountryCode': ['US', 'GB', 'FR'],
        'Sales': [1500, 1200, 1300]
    }
    df = pd.DataFrame(data)

    country_map = {
        'US': 'United States',
        'GB': 'United Kingdom',
        'FR': 'France'
    }

    df['CountryName'] = df['CountryCode'].map(country_map)

    print("Data mapping result:")
    print(df)

except Exception as e:
    print(f"An error occurred during data mapping: {e}")

Save (Ctrl+O), then exit (Ctrl+X).

Run:

python step3_data_mapping.py

Data Validation (step4_data_validation.py)

Create:

nano step4_data_validation.py

Paste:

import pandas as pd
from cerberus import Validator

try:
    data = {
        'CountryCode': ['US', 'GB', 'FR'],
        'Sales': [1500, 1200, 1300],
        'CountryName': ['United States', 'United Kingdom', 'France']
    }
    df = pd.DataFrame(data)

    schema = {
        'CountryCode': {'type': 'string', 'regex': '^[A-Z]{2}$', 'required': True},
        'Sales': {'type': 'integer', 'min': 0, 'required': True},
        'CountryName': {'type': 'string', 'required': True}
    }

    v = Validator(schema)

    for index, row in df.iterrows():
        record = row.to_dict()
        if v.validate(record):
            print(f"Row {index} is valid.")
        else:
            print(f"Row {index} errors: {v.errors}")

except Exception as e:
    print(f"An error occurred during data validation: {e}")

Save and exit.

Run:

python step4_data_validation.py

Step 4: Prepare Input CSV File Manually

Create input_data.csv:

nano input_data.csv

Paste:

StatusCode,Priority
N,3
O,2
C,5
X,1
N,0

Save and exit.


Step 5: Integrated Workflow Script With Error Handling (data_mapping_validation_workflow.py)

Create:

nano data_mapping_validation_workflow.py

Paste:

import pandas as pd
from cerberus import Validator

try:
    print("Loading 'input_data.csv'...")
    df = pd.read_csv('input_data.csv')
    print("CSV loaded successfully:")
    print(df.head())

    status_map = {
        'N': 'New',
        'O': 'Open',
        'C': 'Closed'
    }

    # Map StatusCode to descriptions
    df['StatusDescription'] = df['StatusCode'].map(status_map)
    print("\nMapped StatusDescription column added.")

    schema = {
        'StatusCode': {
            'type': 'string',
            'allowed': list(status_map.keys()),
            'required': True
        },
        'Priority': {
            'type': 'integer',
            'min': 1,
            'required': True
        },
        'StatusDescription': {
            'type': 'string',
            'required': True,
            'nullable': False
        }
    }

    v = Validator(schema)
    valid_rows = []
    invalid_rows = []

    print("\nStarting validation...")
    for index, row in df.iterrows():
        record = row.to_dict()
        if v.validate(record):
            valid_rows.append(row)
        else:
            print(f"Row {index} invalid: {v.errors}")
            invalid_rows.append((index, row, v.errors))

    valid_df = pd.DataFrame(valid_rows)

    print(f"\nValidation complete. Valid records: {len(valid_df)}")
    print(f"Invalid records: {len(invalid_rows)}")

    if invalid_rows:
        print("\nInvalid rows details:")
        for idx, row, errors in invalid_rows:
            print(f"Row {idx}: {row.to_dict()}")
            print(f"Errors: {errors}\n")

except FileNotFoundError:
    print("Error: 'input_data.csv' not found. Please create the CSV file in the working directory.")
except pd.errors.EmptyDataError:
    print("Error: 'input_data.csv' is empty.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Save and exit.


Step 6: Run the Integrated Workflow

Make sure environment is active, then run:

python data_mapping_validation_workflow.py

You will see:

  • Confirmation that the CSV file loaded and its content preview.
  • Confirmation when mapping column added.
  • Validation progress with details on any invalid rows and error messages.
  • A summary count of valid and invalid records.

Why Adding Error Handling and Feedback Matters

  • Error messages help quickly identify missing files, empty data, or logic issues.
  • Progress prints provide a clear picture of the workflow’s state so you know what happens at each step.
  • Validation feedback shows exactly which rows fail and why, enabling effective data correction.
  • These make your data pipeline robust, transparent, and easier to maintain, increasing trust in data integrity.

Why This Matters for Data Quality and Trust

  • Mapping makes your data more understandable by replacing obscure codes with descriptive names. This helps clarify data context to users and systems.
  • Validation enforces rules like correct formats, required fields, and acceptable values, catching errors before they propagate downstream. This prevents corrupt or misleading data from impacting decisions.
  • Separating valid and invalid records lets you review data quality issues and maintain reliable pipelines.

References

  1. Pandas Documentation — Data manipulation with Python, including mapping and transformations

    https://pandas.pydata.org/docs/

  2. Cerberus: Data validation for Python — Official documentation and usage examples for schema-based validation

    https://docs.python-cerberus.org/en/stable/

  3. Nano Editor Manual — Guide to creating and editing files in the Nano text editor

    https://www.nano-editor.org/docs.php

  4. Python Virtual Environments — Best practices for creating isolated environments and managing dependencies

    https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/

  5. Effective Data Validation Techniques — Blog on strategies for validating data to ensure quality and integrity

    https://www.dataquest.io/blog/data-validation-in-python/

  6. Python Mapping and Transformation Techniques — Tutorial on using Python map functions and pandas for data transformations

    https://realpython.com/python-mappings/

Next
Next

Generating ETL Test Data with Python Faker