Backing up PostgreSQL using WAL-E

If you are running PostgreSQL database server, you will surely need to do backups at some point. If you already do, you are probably using the pg_dump and pg_restore combo. These commands are good for small to medium sized databases, but not for big ones. Also you can not run pg_dump continuously, so you really can lose the new data if something bad happens between the dumps. Fortunately PostgreSQL is using the technique called Write Ahead Log (WAL) to ensure the integrity of the data. The simple explanation is: for every data change, there has to be a log entry written first and flushed to permanent storage. WAL is a continuous log of changes, which can be used to restore database to consistent state. If you disable fsync, you will no longer have the guarantee that your database can be recovered to the original state before the crash. Now how do we use this WAL to do continuous backups ? PostgreSQL allows us to specify custom command to run for every generated WAL change (archive). While this is great, it is still quite complex to setup continuous archiving by hand.

Enter WAL-E

WAL-E is a program designed to perform continuous archiving of PostgreSQL WAL files and base backups.

WAL-E is written in Python and supports most of the cloud providers. So without further ado, lets jump right into the process of setting it up. Please note that examples in this article were tested on Ubuntu 14.04, Python 2.7, WAL-E 9.2 and PostgreSQL 9.4.

Setting up WAL-E

Before installing the Python package we need to install some system dependencies:

apt-get install python-dev python-virtualenv libevent-dev pv lzop daemontools

It is recommended to install WAL-E into its own virtualenv in order to prevent dependency version collision:

virtualenv -p /opt/wal-e

Now we just need to install WAL-E using pip which is located in the virtualenvs bin directory:

/opt/wal-e/bin/pip install wal-e

In order to setup WAL-E we need to set some environment variables to configure the backup process. Head over to your Rackspace cloud admin and create a new container (or use the API). Rackspace Cloud Files are behind the scenes powered by OpenStack Swift. Specify the container name, region and type, these correspond to environment variables WALE_SWIFT_PREFIX, SWIFT_REGION and SWIFT_ENDPOINT_TYPE respectively. To avoid bandwidth charges, choose the same region as the cloud resource that will access these files. Also use the internalURL if possible, because access to a publicURL usually includes traffic charges. To check your SWIFT_TENANT, just go to the account settings and look for the Account Number field. Last two variables SWIFT_USER and SWIFT_PASSWORD are the login credentials of your account.

WALE_SWIFT_PREFIX="swift://my_container_name"  # dont forget the swift:// prefix
SWIFT_AUTHURL="https://identity.api.rackspacecloud.com/v2.0/"  # just copy this
SWIFT_ENDPOINT_TYPE="internalURL"  # change to publicURL if required
SWIFT_REGION="IAD"  # setup your region here
SWIFT_TENANT="my_user_id"  # this is your numeric account id in rackspace
SWIFT_USER="my_user_name"  # this is your login username in rackspace
SWIFT_PASSWORD="my_user_password"  # this is your login password in rackspace

I recommend setting up envdir for managing environment variables. First step is to create a directory that will store our envdir variable files:

mkdir -p /etc/wal-e.d/env

And now for each variable create file named after the variable and put the value of the variable inside the file on the first line. The directory structure should look similar to this:

/etc/wal-e.d/
└── env
    ├── SWIFT_AUTHURL
    ├── SWIFT_ENDPOINT_TYPE
    ├── SWIFT_PASSWORD
    ├── SWIFT_REGION
    ├── SWIFT_TENANT
    ├── SWIFT_USER
    └── WALE_SWIFT_PREFIX

To test if envdir is picking our environment variables you can use the env command (you should see a list of all available enviromental variables):

/usr/bin/envdir /etc/wal-e.d/env env

Testing connection to Cloud Files

After configuring our WAL-E setup, we can test that it works by running the backup-list command. This command outputs a list of uploaded base backups on success. If anything fails at this point, just stop and go over the setup steps again.

/usr/bin/envdir /etc/wal-e.d/env /opt/wale/bin/wal-e backup-list

Setting up CRON jobs for periodic backups

Now that we can communicate with Rackspace Cloud Files, we need to setup some periodic jobs. The first command backup-push will be responsible for uploading base backups periodically (every Monday in our example). The second command will cleanup the old base backups and WAL logs. In the example we are using the retain subcommand to keep the latest 5 base backups and corresponding WAL logs. Note that we also need to pass --confirm option in order to actually delete the files (it runs in dry run mode by default).

# create base backup every monday
0 0 * * 1 /usr/bin/envdir /etc/wal-e.d/env /opt/wale/bin/wal-e backup-push /var/lib/postgresql/9.4/main

# cleanup old backups every day (keep last 5)
0 0 * * * /usr/bin/envdir /etc/wal-e.d/env /opt/wale/bin/wal-e delete --confirm retain 5

Updating PostgreSQL config

The last bit that is missing is the actual PostgreSQL configuration. The %p in the archive command will be replaced by the current WAL archive everytime it gets called.

# enable WAL archiving
archive_mode = on

# local shell command to execute to archive a completed WAL file segment
archive_command = '/usr/bin/envdir /etc/wal-e.d/env /opt/wale/bin/wal-e wal-push %p'

# when this parameter is greater than zero, the server will switch to a new WAL segment file
# whenever this many seconds have elapsed since the last segment file switch
archive_timeout = 60

# determines how much information is written to the WAL
wal_level = hot_standby

The following wal_level types can be set:

minimal: adds information needed to recover from a crash or immediate shutdown
hot_standby: adds information required to run read-only queries on a standby server
logical: adds information necessary to support logical decoding

After updating the config, you will need to restart the server. Right after the restart, head over to your postgresql.log file and look for archiving logs. If everything looks ok you should be seeing lines similar to these:

wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "wal-push".
        STRUCTURED: time=2016-05-16T12:33:36.413415-00 pid=17223
wal_e.worker.upload INFO     MSG: begin archiving a file
        DETAIL: Uploading "pg_xlog/00000001000005C80000002A" to "swift://my_container_name/wal_005/00000001000005C80000002A.lzo".
        STRUCTURED: time=2016-05-16T12:33:36.469265-00 pid=17223 action=push-wal key=swift://my_container_name/wal_005/00000001000005C80000002A.lzo prefix= seg=00000001000005C80000002A state=begin
wal_e.worker.upload INFO     MSG: completed archiving to a file
        DETAIL: Archiving to "swift://my_container_name/wal_005/00000001000005C80000002A.lzo" complete at 3105.82KiB/s.
        STRUCTURED: time=2016-05-16T12:33:37.525434-00 pid=17223 action=push-wal key=swift://my_container_name/wal_005/00000001000005C80000002A.lzo prefix= rate=3105.82 seg=00000001000005C80000002A state=complete

Be aware that if your archive_command fails for some time, your archives might be already deleted by PostgreSQL over time. This can be tweaked using the wal_keep_segments paramater. Also you need to be sure that your CRON cleanup script is working, otherwise you can quickly run out of space. Either way, you should setup some sort of monitoring to detect failures early and to be able to react accordingly. Now you have a working continuous WAL archiving setup. This setup can be easily tweaked to support replication to other PostgreSQL server.

jbub