Generating ETL Test Data with Python Faker

Generating realistic test data is crucial for developing and testing ETL (Extract, Transform, Load) pipelines. However, creating comprehensive datasets manually can be time-consuming and error-prone. This tutorial introduces you to the Python Faker library, a powerful tool that lets you easily generate synthetic data that looks and behaves like real-world data.

Using Faker, you can quickly create large volumes of customer records including names, emails, addresses, dates of birth, and other relevant fields. This generated data can be used as reliable source data for ETL processes. We will also cover how to apply simple transformation logic, validate the data to ensure correctness, and export it to CSV files for downstream testing or integration.

By following this step-by-step guide, you will build a reusable workflow to automate ETL testing with realistic sample data, helping improve data quality and pipeline robustness while saving development time.

Step 1: Create a Python Virtual Environment Named faker-test

mkdir etl_test_project
cd etl_test_project
python3 -m venv faker-test

Activate the environment:

  • macOS/Linux:
    source faker-test/bin/activate
  • Windows:
    .\faker-test\Scripts\activate

Step 2: Install Dependencies

pip install faker pandas

Step 3: Create the Python Test Script from Terminal

Use a terminal text editor like nano or vim to create the file.

Using nano:

nano generate_test_data.py

This opens the nano editor. Paste the following Python code into the editor:

from faker import Faker
import pandas as pd
from datetime import datetime

fake = Faker()

def generate_source_data(num_records=100):
    data = []
    for _ in range(num_records):
        record = {
            "customer_id": fake.unique.random_int(min=1000, max=9999),
            "name": fake.name(),
            "email": fake.email(),
            "address": fake.address().replace("\n", ", "),
            "date_of_birth": fake.date_of_birth(minimum_age=18, maximum_age=80).strftime("%Y-%m-%d"),
            "join_date": fake.date_between(start_date='-5y', end_date='today').strftime("%Y-%m-%d"),
        }
        data.append(record)
    return pd.DataFrame(data)

def transform_data(df):
    df = df.copy()
    today = datetime.today()
    df['age'] = df['date_of_birth'].apply(lambda dob: today.year - datetime.strptime(dob, "%Y-%m-%d").year)
    df['email'] = df['email'].str.lower()
    return df

def validate_data(df):
    errors = []
    for idx, row in df.iterrows():
        dob_year = int(row['date_of_birth'].split("-")[0])
        calculated_age = datetime.today().year - dob_year
        if row['age'] != calculated_age:
            errors.append(f"Age mismatch at index {idx}")
        if row['email'] != row['email'].lower():
            errors.append(f"Email case error at index {idx}")

    if errors:
        print("Validation errors found:")
        for error in errors:
            print(error)
    else:
        print("All validations passed!")

if __name__ == "__main__":
    source_df = generate_source_data(10)
    print("Source Data Sample:")
    print(source_df.head())

    transformed_df = transform_data(source_df)
    print("\nTransformed Data Sample:")
    print(transformed_df.head())

    validate_data(transformed_df)

# Export source data to CSV
    source_df.to_csv("source_data.csv", index=False)
    print("Source data exported to source_data.csv")

# Export transformed data to CSV
    transformed_df.to_csv("transformed_data.csv", index=False)
    print("Transformed data exported to transformed_data.csv")

To save and exit nano, press:

  • Ctrl + O then Enter (to save)
  • Ctrl + X (to exit)

Step 4: Run the Python Script from Terminal

Run the script while your virtual environment is activated:

python generate_test_data.py

You will see the output showing source data, transformed data, and validation checks. The data will also be exported to two files in your etl_test_project directory to files named:

source_data.csv

transformed_data.csv


Summary

  • Create your project folder and virtual environment (faker-test).
  • Activate the environment and install required packages.
  • Use nano or vim from terminal to create generate_test_data.py.
  • Paste the ETL testing code into the file and save it.
  • Run the script with Python to test and view generated data.
  • Export the data to a csv

Sources

[1] How can I create and open a file from terminal with a single ... https://stackoverflow.com/questions/11584704/how-can-i-create-and-open-a-file-from-terminal-with-a-single-command

[2] someone help me how to create a py file in terminal linux https://www.reddit.com/r/learnpython/comments/15wkpol/someone_help_me_how_to_create_a_py_file_in/

[3] How to Run Your Python Scripts and Code https://realpython.com/run-python-scripts/

[4] How to Create and Run a Python Script with Terminal ( ... https://www.jcchouinard.com/create-python-script-from-terminal/

[5] Create Python File From Terminal (Command Line) https://www.youtube.com/watch?v=1fxc-RpEQJo

[6] How to Run a Python Script in Terminal Step by Step Guide https://vteams.com/blog/how-to-run-a-python-script-in-terminal/

[7] Creating a Python command line tool https://kleinembedded.com/creating-a-python-command-line-tool/

[8] How to Open and Run Python Files in the Terminal https://learnpython.com/blog/open-python-files-in-terminal/

[9] How To Save Python Scripts In Linux Via The Terminal? https://www.geeksforgeeks.org/python/how-to-save-python-scripts-in-linux-via-the-terminal/

[10] pandas to_csv() method https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html