I was born the son of a lawless man
Always spoke my mind with a gun in my hand
Lived nine lives
Gunned down ten
Gonna ride like the wind
Ride Like The Wind, Christopher Cross
Recently, I upgraded the Groups.io database from postgres version 9.6 to 14, using pglogical to do so. These are my rough notes, mainly for myself to refer to the next time I need to do an upgrade, but perhaps they will be useful to others as well.
Postgres major versions are supported for 5 years. We had been on 9.6 since it was released; support ended a year ago. Upgrading Postgres major versions can be done in place, with a bit of downtime. But I wanted to be able to test the upgrade before switching over, and an in place upgrade would not allow that easily. Enter pglogical, an extension that streams SQL commands between different Postgres instances. That means that you can sync different major versions of Postgres.
The plan was to create a new Postgres 14 cluster, and replicate to it from our main cluster using pglogical. We could then test the new cluster over several weeks, by routing read-only production traffic to it, and see how it behaved in production. Once we were convinced that the new cluster was behaving well, we would have a short downtime and switch production writes over to it, and retire the old cluster. In the end, this worked out very well for us.
Setting Up Pglogical
Install the pglogical extension on your current database and make sure it’s included in the shared_preload_libraries line of postgresql.conf.
Set up a new Postgres 14 instance, also with pglogical installed.
Ensure that the 14 instance is listed in the 9.6 database’s pg_hba.conf file with replication privileges.
Run the following on the 14 instance for each database, to get the database schema:
This creates the sync link between databases and starts the sync process, and waits until the initial sync is complete. You can monitor the sync process via the log on the 14 instance.
At this point, I replicated the 14 instance to create the new production cluster. I then gradually moved read-only production queries over to it and monitored how it behaved over the course of a couple of weeks.
One thing that surprised me is that autovacuum will not be triggered on the 14 cluster and none of the database stats that you can monitor related to dead tuples will be updated. This is ok. Once you break the pglogical sync, autovacuum will proceed as expected.
We use logical replication to stream database changes to our elasticsearch cluster. Our setup is detailed here, and it’s worked great over the years. Unfortunately, the logical slots used for this are not replicated using pglogical, so you have to recreate them on the 14 instance. This is not difficult, but there was one thing that tripped me up.
I created a small utility to recreate the logical slots on the new instance and used it to recreate the slots on the 14 instance. When connecting to a logical slot, you provide a starting LSN to begin replication. What tripped me up was that I was initially using the LSN from the 9.6 instance, and was seeing no replication at all. But LSNs are different on the new instance. So when initially connecting to the logical slot on the new instance, use 0 for the starting LSN.
I wrote an upgrade checklist for things to do to move to the new instance. Once I was confident in the new cluster, I announced a one hour downtime on the site.
Final Production Cutover
Bring up a test web server that talks to the new cluster, and ensure that it works.
Take the site down.
Turn off pgbouncer to ensure no one was talking with the 9.6 database.
On the 9.6 instance you need to do a final sync of all replicated sequences:
SELECT pglogical.synchronize_sequence( seqoid ) FROM pglogical.sequence_state;
I then ran a script that dumped a bunch of values from the 9.6 and 14 instances and compared them, as a sanity check.
Turn off pglogical replication, on the 14 instance:
DROP extension pglogical;
Breaking the sync should cause autovacuum to run immediately; check to make sure it does.
Log into the test web server and make some changes, ensure that they work.
Turn off the 9.6 cluster.
Update the database config files to point to the new 14 cluster.
Bring up the site.
The downtime lasted about half an hour (it was announced as an hour downtime). I took the rest of the day off to decompress (upgrades are nerve wracking!).
Groups.io is the best tool to get a bunch of people organized and sharing knowledge. Start a free trial group today.