Make GDPR safe backups for WooCommerce

WordPress, GDPR, GitHub Actions, Continuous Integration

A client of mine runs WooCommerce on a WordPress site. They have a lot of orders going through this site. One thing I was bumping my head into for a long time is how to create GDPR-safe backups that I can use in my local development environment.

The challenge with WordPress

The hard part with WordPress is that many things resolve back to a post: A post is a post (duh), but also, a page, category, and products are posts. And WooCommerce orders are also posts. This means that all data of these types are saved in 2 tables: wp_posts and wp_postmeta. This makes for a pretty flexible programming experience: Adding extra data to any of these types is easy. But getting this data safely out of the database is something different.

Smile GdprDump

After investigating some tools, I tried to build my tool to do this. But as I was inspecting other tools on how they did the database dumping part, I figured out I could use an existing tool for this purpose: Smile GdprDump.

This tool allows you to configure how dumps are created. It allows for several configurable options: Only select a part of the table, use Faker to change data, or only dump the table structure but not the data.

I removed the data from the dump as I don't need the orders on my local system.

Installation is simple: Install it through Composer:

composer require --dev smile/gdpr-dump

Configuration

I ended up with this configuration:

---
database:
  host: '%env(DB_HOST)%'
  user: '%env(DB_USER)%'
  password: '%env(DB_PASSWORD)%'
  name: '%env(DB_NAME)%'

dump:
  output: 'wordpress-development-dump.sql.gz'
  compress: 'gzip'

tables:
  wp_comments:
	truncate: true

  wp_wc_customer_lookup:
	truncate: true

  wp_woocommerce_order_itemmeta:
	truncate: true

  wp_posts:
	filters:
	  - ['post_type', 'in', ['nav_menu_item', 'page', 'post', 'product', 'product_variation']]

  wp_postmeta:
	filters:
	  - ['post_id', 'in', 'expr: select ID from wp_posts where post_type IN ("nav_menu_item", "page", "post", "product", "product_variation")']

  wp_users:
	filters:
	  - ['user_email', 'in', 'expr: select user_email from wp_users where user_email like "%@controlaltdelete.nl"']

  wp_usermeta:
	filters:
	  - ['user_id', 'in', 'expr: select ID from wp_users where user_email like "%@controlaltdelete.nl"']

This does a few things:

  • This creates a gzipped file on wordpress-development-dump.sql.gz.

  • It truncates the data for wp_comments, wp_wc_customer_lookup, and wp_woocommerce_order_itemmeta . This means that it adds the table structure to the dump, but not the data.

  • For wp_posts and wp_postmeta it only dumps items that are a nav_menu_item, page, post, product, or product_variation. All other types are skipped. So that includes things like orders.

  • For wp_users and wp_usermeta it only dumps the data for users where the email ends on @controlaltdelete.nl. All other users are skipped.

It is always a good idea to check the project you are working on, it could be that there is more privacy-sensitive data in there. I like to search on @gmail.com in the dump to find extra tables to change.

Running the GDPR dumper

I have created a little bash script that runs this script for me. This is the content of that file:

export DB_HOST=db
export DB_USER=wordpress
export DB_PASSWORD=wordpress
export DB_NAME=wordpress

/var/www/html/vendor/bin/gdpr-dump /var/www/html/gdpr-dump/wordpress.yml

I'm running all my projects in Docker, so the path is always the same. Calling this is something like this:

sh create-gdpr-safe-backup.sh

Automating the GDPR dump

Now we have this working, we can take this a step further. I've created a GitHub Action that takes these steps:

  • Download a backup from Amazon S3. This is an automated backup that is created daily.

  • Start a MySQL server (withing GitHub Actions), and import this dump in there.

  • Run the GDPR dump program.

  • Upload the result back to S3 in a separate folder.

The file looks like this:

name: Create development dump
on:
  workflow_dispatch:
  schedule:
	- cron: '33 9 * * 1'

jobs:
  create-development-dump:
	runs-on: ubuntu-latest
	steps:
	  - uses: actions/checkout@v3

	  - name: Set up AWS credentials
		uses: aws-actions/configure-aws-credentials@v2
		with:
		  aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
		  aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
		  aws-region: eu-central-1

	  - name: Download latest backup
		run: |
		  BUCKET="s3://supersecret.com-backup/"
		  OBJECT="$(aws s3 ls $BUCKET --recursive |grep ".zip" | sort | tail -n 1 | awk '{print $4}')"
		  aws s3 cp s3://supersecret.com-backup/$OBJECT gdpr-dump/ --no-progress
		  unzip gdpr-dump/$OBJECT supersecret.com.sql -d gdpr-dump/backup/

	  - name: Start mysql & import database
		run: |
		  sudo /etc/init.d/mysql start
		  mysql -e 'CREATE DATABASE wordpress;' -uroot -proot
		  mysql -e "ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY 'root';" -uroot -proot
		  cat gdpr-dump/backup/supersecret.com.sql | mysql --binary-mode -uroot -proot wordpress

	  - name: Setup PHP
		uses: shivammathur/setup-php@v2
		with:
		  php-version: '8.0'

	  - name: Run Composer Install
		run: |
		  composer install --no-interaction --no-progress

	  - name: Run GDPR dump
		run: |
		  DB_HOST=localhost DB_NAME=wordpress DB_USER=root DB_PASSWORD=root vendor/bin/gdpr-dump gdpr-dump/wordpress.yml

	  - name: Upload to S3
		run: aws s3 cp ./wordpress-development-dump.sql.gz s3://supersecret.com-backup/development-dump/

For safety purposes, I have created a IAM user in S3 that has read-only access to the full dump and read/write access to the development-dump folder.

This script will run once a week, as that's more than enough for me.

What's next?

I don't need to have this automated for development purposes: I don't have to reset my local database that often. But my end goal is to automate the WordPress (plugin) updates. For this, I need end-2-end tests which can be run in the GitHub Actions pipeline. This backup is a big part of that, as it will get imported into the temporary environment before running the tests.

Michiel Gerritsen
About the author

Michiel Gerritsen

Connect on LinkedIn

Founder of Control Alt Delete, a Magento agency specialised in testing, CI/CD and checkout integrations. Working with Magento since 2015, and board member of Mage-OS.

Missing anything?
What are you missing? X
Thank you for your feedback!