Generating ETL Test Data with Python Faker

18 Sept

Generating realistic test data is crucial for developing and testing ETL (Extract, Transform, Load) pipelines. However, creating comprehensive datasets manually can be time-consuming and error-prone. This tutorial introduces you to the Python Faker library, a powerful tool that lets you easily generate synthetic data that looks and behaves like real-world data.

Using Faker, you can quickly create large volumes of customer records including names, emails, addresses, dates of birth, and other relevant fields. This generated data can be used as reliable source data for ETL processes. We will also cover how to apply simple transformation logic, validate the data to ensure correctness, and export it to CSV files for downstream testing or integration.

By following this step-by-step guide, you will build a reusable workflow to automate ETL testing with realistic sample data, helping improve data quality and pipeline robustness while saving development time.

Step 1: Create a Python Virtual Environment Named `faker-test`

mkdir etl_test_project
cd etl_test_project
python3 -m venv faker-test

Activate the environment:

macOS/Linux:
```
source faker-test/bin/activate
```
Windows:
```
.\faker-test\Scripts\activate
```

Step 2: Install Dependencies

pip install faker pandas

Step 3: Create the Python Test Script from Terminal

Use a terminal text editor like nano or vim to create the file.

Using nano:

nano generate_test_data.py

This opens the nano editor. Paste the following Python code into the editor:

from faker import Faker
import pandas as pd
from datetime import datetime

fake = Faker()

def generate_source_data(num_records=100):
    data = []
    for _ in range(num_records):
        record = {
            "customer_id": fake.unique.random_int(min=1000, max=9999),
            "name": fake.name(),
            "email": fake.email(),
            "address": fake.address().replace("\n", ", "),
            "date_of_birth": fake.date_of_birth(minimum_age=18, maximum_age=80).strftime("%Y-%m-%d"),
            "join_date": fake.date_between(start_date='-5y', end_date='today').strftime("%Y-%m-%d"),
        }
        data.append(record)
    return pd.DataFrame(data)

def transform_data(df):
    df = df.copy()
    today = datetime.today()
    df['age'] = df['date_of_birth'].apply(lambda dob: today.year - datetime.strptime(dob, "%Y-%m-%d").year)
    df['email'] = df['email'].str.lower()
    return df

def validate_data(df):
    errors = []
    for idx, row in df.iterrows():
        dob_year = int(row['date_of_birth'].split("-")[0])
        calculated_age = datetime.today().year - dob_year
        if row['age'] != calculated_age:
            errors.append(f"Age mismatch at index {idx}")
        if row['email'] != row['email'].lower():
            errors.append(f"Email case error at index {idx}")

    if errors:
        print("Validation errors found:")
        for error in errors:
            print(error)
    else:
        print("All validations passed!")

if __name__ == "__main__":
    source_df = generate_source_data(10)
    print("Source Data Sample:")
    print(source_df.head())

    transformed_df = transform_data(source_df)
    print("\nTransformed Data Sample:")
    print(transformed_df.head())

    validate_data(transformed_df)

# Export source data to CSV
    source_df.to_csv("source_data.csv", index=False)
    print("Source data exported to source_data.csv")

# Export transformed data to CSV
    transformed_df.to_csv("transformed_data.csv", index=False)
    print("Transformed data exported to transformed_data.csv")

To save and exit nano, press:

Ctrl + O then Enter (to save)
Ctrl + X (to exit)

Step 4: Run the Python Script from Terminal

Run the script while your virtual environment is activated:

python generate_test_data.py

You will see the output showing source data, transformed data, and validation checks. The data will also be exported to two files in your etl_test_project directory to files named:

source_data.csv

transformed_data.csv