--- author: paul-scanlon enableTableOfContents: true createdAt: '2025-02-14T00:00:00.000Z' updatedOn: '2025-02-14T00:00:00.000Z' title: How to Create a Reliable Testing Dataset with pg_dump and pg_restore subtitle: A practical guide to extracting a test dataset from Postgres using pg_dump, pg_restore and psql --- As your Postgres database grows, you'll likely need a way to generate a smaller, 'good-enough' dataset that preserves the structure and referential integrity of production but is better suited for testing. There are several ways to do this, but here's a straightforward approach using `pg_dump`, `pg_restore`, `psql` and GitHub Actions. ## Running partial data dumps inside GitHub Actions You can run `pg_dump`, `pg_restore`, and `psql` from the command line, but sometimes, an automated, reproducible approach is more convenient. To better control when data dumps occur, I use a [scheduled GitHub Action](https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/events-that-trigger-workflows#schedule) to export data from my production database and restore it to a testing database. This method works across different Postgres database providers, but if you're looking for a cost-effective testing environment, consider trying Neon. Check out our [getting started guide](/docs/get-started/signing-up#sign-up) to see how easy it is to set up. ## What is a scheduled GitHub Action? A scheduled GitHub Action runs automatically at a time you define. Since you're dumping data from a production database, you'll likely want to run this job when the system isn't under heavy load, typically outside of business hours. I usually schedule these jobs for midnight, but since I don't want to stay up that late, a scheduled GitHub Action takes care of it while I sleep. ## Getting started with GitHub Actions To create a GitHub Action, you'll need a GitHub repository to store it. If you don't have one yet, create one now and clone it to your machine for local development. In the root of your project, create a `.github` directory. Inside it, add another directory called `workflows`. Then, within `workflows`, create a new file named `dump-test-data.yml`, for example: ``` .github |-- workflows |-- dump-test-data.yml ``` Now add the following code. There's a lot happening here, so before I get to the `pg_dump`, `pg_restore` and `psql` steps, let me briefly explain what this first part does. ```yml name: Dump Test Data on: schedule: - cron: '0 0 * * *' # Runs at midnight UTC workflow_dispatch: env: PROD_DATABASE_URL: ${{ secrets.PROD_DATABASE_URL }} # Production or staging database DEV_DATABASE_URL: ${{ secrets.DEV_DATABASE_URL }} # Development or testing database PG_VERSION: '17' jobs: dump-and-restore: runs-on: ubuntu-latest steps: - name: Install PostgreSQL run: | sudo apt update yes '' | sudo /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh sudo apt install -y postgresql-${{ env.PG_VERSION }} - name: Set PostgreSQL binary path run: echo "POSTGRES=/usr/lib/postgresql/${{ env.PG_VERSION }}/bin" >> $GITHUB_ENV ``` ### name This name will appear in the **Actions** section of the GitHub UI. Regardless of what you name your file, this is the name that will be displayed. ## on This section of the workflow determines when the Action will run. The `schedule` field includes a `cron` expression, which uses [POSIX cron syntax](https://pubs.opengroup.org/onlinepubs/9699919799/utilities/crontab.html#tag_20_25_07) to specify how often the Action should execute. I've also included the `workflow_dispatch` field, which lets you manually trigger this Action from the GitHub UI—a useful feature while developing and testing the workflow. ### env This workflow requires two environment variables. The first is the Postgres connection string for the source database, typically your production or staging database. The second is the connection string for the target database, which will serve as your testing database. Both need to use the same version of Postgres. Both of these variables will also need to be added to your GitHub repositories secrets. To do this, navigate to **Settings** > **Settings and variables** > **Actions** and add them under **Repository secrets**. ![Screenshot of GitHub repository secrets](/guides/images/reliable-testing-dataset-with-pg-dump-and-pg-restore/screenshot-of-github-respository-secrets.jpg) The last variable defines the Postgres version to install in the Action environment. Since `pg_dump`, `pg_restore`, and `psql` depend on Postgres, you'll need to install it within the Action—I’ll cover this in more detail later. It’s also worth noting the version of Postgres you install here should be the same version used by both your source and target database. In my example, all use [Postgres 17](/blog/postgres-17). ### jobs/steps The job is named `dump-and-restore`, which will be displayed in the GitHub UI when the Action is running. You can choose any name you prefer. The first step in the job is to install Postgres. While there are various methods and alternative options available in the [GitHub Marketplace](https://github.com/marketplace?query=Postgres), I prefer to install directly from Apt (Advanced Packaging Tool) for added security, especially since you're providing direct access to your production database. The next step is to define a variable that is needed when using `pg_dump`, `pg_restore`, and `psql`. This variable is named `POSTGRES` and will be referenced later as `$POSTGRES/pg_dump`. Before I jump into the dump/restore parts, I'll quickly explain the schema I've used in this example. It’s important to note the foreign key relationships between the tables. In my example, the foreign key relationships are as follows: - The **transactions** table has a foreign key `user_id` that references the `user_id` column in the **users** table. This establishes a relationship where each transaction is linked to a specific user. - The **transactions** table is linked to the **products** table through the `product_id` foreign key. This establishes a relationship where each transaction is associated with a specific product. ### users This is the schema used to create the `users` table. ```sql CREATE TABLE users ( user_id SERIAL PRIMARY KEY, first_name VARCHAR(100) NOT NULL, last_name VARCHAR(100) NOT NULL, email VARCHAR(150) UNIQUE NOT NULL, password_hash VARCHAR(255) NOT NULL, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); ``` ### products This is the schema used to create the `products` table. ```sql CREATE TABLE products ( product_id SERIAL PRIMARY KEY, name VARCHAR(255) NOT NULL, description TEXT, price DECIMAL(10, 2) NOT NULL, stock_quantity INT DEFAULT 0, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); ``` ### transactions This is the schema used to create the `transactions` table. ```sql CREATE TABLE transactions ( transaction_id SERIAL PRIMARY KEY, user_id INT REFERENCES users(user_id) ON DELETE CASCADE, product_id INT REFERENCES products(product_id) ON DELETE CASCADE, quantity INT NOT NULL, total_amount DECIMAL(10, 2) NOT NULL, transaction_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP, status VARCHAR(50) NOT NULL CHECK (status IN ('pending', 'completed', 'failed')) ); ``` The **transactions** table in my example relies on data from both the **users** and **products** tables. When performing a partial data dump, it's important that transaction rows can reference either a `user_id` from the **users** table or a `product_id` from the **products** table. With this in mind, I'll start with the `transactions` table when deciding which data to include in the partial dump. ## Dump and restore partial data Add the following code after the **Set PostgreSQL binary path** step. ```yml {23-46} name: Dump Test Data on: schedule: - cron: '0 0 * * *' # Runs at midnight UTC workflow_dispatch: env: PROD_DATABASE_URL: ${{ secrets.PROD_DATABASE_URL }} # Production or staging database DEV_DATABASE_URL: ${{ secrets.DEV_DATABASE_URL }} # Development or testing database PG_VERSION: '17' jobs: dump-and-restore: runs-on: ubuntu-latest steps: - name: Install PostgreSQL run: | sudo apt update yes '' | sudo /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh sudo apt install -y postgresql-${{ env.PG_VERSION }} - name: Set PostgreSQL binary path run: echo "POSTGRES=/usr/lib/postgresql/${{ env.PG_VERSION }}/bin" >> $GITHUB_ENV - name: Dump schema run: | $POSTGRES/w "${{ github.workspace }}/all-schema.bak" "${{ env.PROD_DATABASE_URL }}" - name: Dump data run: | $POSTGRES/psql "${{ env.PROD_DATABASE_URL }}" -c "\copy (SELECT * FROM transactions ORDER BY transaction_id DESC LIMIT 50) TO '${{ github.workspace }}/transactions-subset.csv' WITH CSV HEADER" $POSTGRES/psql "${{ env.PROD_DATABASE_URL }}" -c "\copy (SELECT * FROM products WHERE product_id IN (SELECT product_id FROM transactions ORDER BY transaction_id DESC LIMIT 50)) TO '${{ github.workspace }}/products-subset.csv' WITH CSV HEADER" $POSTGRES/psql "${{ env.PROD_DATABASE_URL }}" -c "\copy (SELECT * FROM users WHERE user_id IN (SELECT user_id FROM transactions ORDER BY transaction_id DESC LIMIT 50)) TO '${{ github.workspace }}/users-subset.csv' WITH CSV HEADER" - name: Drop tables and schema run: | $POSTGRES/psql "${{ env.DEV_DATABASE_URL }}" -c "DROP SCHEMA IF EXISTS public CASCADE;" $POSTGRES/psql "${{ env.DEV_DATABASE_URL }}" -c "CREATE SCHEMA public;" - name: Restore schema run: | $POSTGRES/pg_restore --clean --no-owner --no-acl --if-exists --schema-only -d "${{ env.DEV_DATABASE_URL }}" "${{ github.workspace }}/all-schema.bak" - name: Restore data run: | $POSTGRES/psql "${{ env.DEV_DATABASE_URL }}" -c "\copy public.users FROM '${{ github.workspace }}/users-subset.csv' WITH CSV HEADER" $POSTGRES/psql "${{ env.DEV_DATABASE_URL }}" -c "\copy public.products FROM '${{ github.workspace }}/products-subset.csv' WITH CSV HEADER" $POSTGRES/psql "${{ env.DEV_DATABASE_URL }}" -c "\copy public.transactions FROM '${{ github.workspace }}/transactions-subset.csv' WITH CSV HEADER" ``` The above code snippet might look a bit complicated at first, but it’s actually not that bad—let me break it down for you. ### Dump schema In this step, I use `pg_dump` to export the entire schema from the production database and save it to the GitHub workspace as a file named `all-schema.bak`. This file is stored in memory so it can be accessed later by the **Restore schema** step towards the end of the job. The flags used in this step are explained below: | Flag | Meaning | | --------------- | -------------------------------------------------------------------------------- | | `-Fc` | Dumps the database in a custom format. | | `--schema-only` | Dumps only the schema (table structures, indexes, constraints) without any data. | | `-f ` | Specifies the output file where the schema dump will be stored. | ### Dump data In this step, I use `psql` to query the data. This is the most complex step, involving three SQL queries, each targeting one of the three tables. The queries are as follows: #### Transactions query This query selects the 50 most recent **transactions** from the `transactions` table. Depending on your requirements, you can increase the `LIMIT` or modify the query. ```sql SELECT * FROM transactions ORDER BY transaction_id DESC LIMIT 50 ``` The results are saved to the GitHub workspace memory as a file called `transactions-subset.csv`, which will be used in a later step. #### Products query This query selects **products**, but only those with a `product_id` present in the 50 most recent **transactions**: ```sql SELECT * FROM products WHERE product_id IN (SELECT product_id FROM transactions ORDER BY transaction_id DESC LIMIT 50) ``` The results are saved to the GitHub workspace memory as a file called `products-subset.csv`, which will be used in a later step. #### Users query This query selects **users**, but only those with a `user_id` present in the 50 most recent **transactions**: ```sql SELECT * FROM users WHERE user_id IN (SELECT user_id FROM transactions ORDER BY transaction_id DESC LIMIT 50) ``` The results are saved to the GitHub workspace memory as a file called `users-subset.csv`, which will be used in a later step. ### Drop tables and schema In this step, I use `psql` to drop the schema and create a fresh one. Since this Action runs on a schedule, this cleanup ensures the target database is ready for new schema and data, avoiding any errors from schema changes since the last run. The queries used in this step are explained below: | Step | Effect | | --------------------------------------- | -------------------------------------------------------- | | `DROP SCHEMA IF EXISTS public CASCADE;` | Removes the public schema and everything inside it. | | `CREATE SCHEMA public;` | Recreates the public schema for a clean restore process. | ### Restore schema In this step, I use `pg_restore` to restore the schema from the `all-schema.bak` file, which was saved to the GitHub workspace during the Dump schema step. The flags used in this step are explained below: | Flag | Meaning | | --------------- | -------------------------------------------------------------------------------------------------------------- | | `--clean` | Drops existing database objects before recreating them, ensuring a clean restore. | | `--no-owner` | Ignores ownership information in the dump file, so restored objects are owned by the user running the restore. | | `--no-acl` | Excludes access control (GRANT/REVOKE) statements from the restore, preventing permission changes. | | `–if-exits` | Ensures that DROP commands (used with --clean) only execute if the object exists, preventing errors. | | `--schema-only` | Restores only the schema (table structures, indexes, constraints) without inserting any data. | | `-d` | Specifies the target database to restore into. | ## Restore data In this step, I use `psql` to restore the data to the target database from the `.csv` files generated in the **Dump data** step. ## Finished Once the Action completes successfully, your target database will have a fresh test data set ready for use! This Action is part of our [Dev/Test use case](/use-cases/dev-test), widely used by Neon customers who face limitations with traditional databases for testing. By leveraging a dedicated Neon database, while leaving production environments where they are, developers gain access to Neon's full suite of features, including the [built-in SQL editor](/docs/get-started/query-with-neon-sql-editor), [table explorer](/docs/guides/tables), and [branching](/docs/introduction/branching). If you'd like to learn more about using Neon for testing, check out our [dev/test use case](/use-cases/dev-test) or contact our [sales team](/contact-sales).