I always heard that once you had performed either a switchover or a failover, you needed to create a fresh backup of your Postgres cluster. I suppose that’s why I assumed you couldn’t PITR through a timeline change. Today, I learned you could PITR through a timeline change!
Some context π
Let’s assume you have a queen (elinor) and a princess (merida). If you’re not familiar with the queen/princess/worker terms, please read Tapoueh’s blog post about that.
Create the backup and add some data π
First, you create a simple backup of your Postgres cluster, using pg_basebackup
:
elinor:$ pg_basebackup -D /var/lib/postgresql/backup -Ft -z -P
And send it to the princess, just in case:
elinor:$ rsync -r /var/lib/postgresql/backup merida:/var/lib/postgresql/backup
You create a simple test table on elinor:
test=# create table test(id integer generated always as identity, value text);
CREATE TABLE
And add some data in your table:
test=# insert into test(value) values ('a');
INSERT 0 1
test=# insert into test(value) values ('a');
INSERT 0 1
test=# insert into test(value) values ('a');
INSERT 0 1
You can query those data on merida (the princess):
test=# select * from test;
id | value
----+-------
1 | a
2 | a
3 | a
(3 rows)
Promote the princess π
So, you shut down the queen and promote the princess:
elinor:$ pg_ctlcluster 12 main stop
merida:$ pg_ctlcluster 12 main promote
And then you add some rows in your test table from your new queen (merida):
test=# insert into test(value) values ('b');
INSERT 0 1
test=# insert into test(value) values ('b');
INSERT 0 1
Restore the backup π
Now, let’s stop merida and wipe the cluster clean.
merida:$ pg_dropcluster 12 main --stop
Let’s restore our backup:
merida$: mkdir /var/lib/postgresql/12/main
merida$: tar xzf base.tar.gz -C /var/lib/postgresql/12/main
merida$: touch /var/lib/postgresql/12/main/recovery.signal
merida$: chown -R postgres:postgres /var/lib/postgresql/12/main
And we just need to start Postgres:
merida:$ pg_ctlcluster 12 main start
Are my data back?
test=# select * from test;
id | value
----+-------
1 | a
2 | a
3 | a
4 | b
5 | b
(5 rows)
Yes! Everything is back. No data loss!
What did Postgres? π
Here is my starting logfile (yes, my prefix should be better than that):
2020-03-24 13:27:36.835 UTC [7826] LOG: starting PostgreSQL 12.2 (Ubuntu 12.2-2.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0, 64-bit
2020-03-24 13:27:36.835 UTC [7826] LOG: listening on IPv4 address "0.0.0.0", port 5432
2020-03-24 13:27:36.835 UTC [7826] LOG: listening on IPv6 address "::", port 5432
2020-03-24 13:27:36.842 UTC [7826] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-03-24 13:27:36.869 UTC [7827] LOG: database system was interrupted; last known up at 2020-03-24 10:07:35 UTC
2020-03-24 13:27:36.889 UTC [7827] LOG: restored log file "00000002.history" from archive
rsync: link_stat "/var/lib/postgresql/wals/00000003.history" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:36.891 UTC [7827] LOG: starting archive recovery
2020-03-24 13:27:36.894 UTC [7827] LOG: restored log file "00000002.history" from archive
rsync: link_stat "/var/lib/postgresql/wals/00000002000000000000000E" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:36.955 UTC [7827] LOG: restored log file "00000001000000000000000E" from archive
2020-03-24 13:27:36.985 UTC [7827] LOG: redo starts at 0/E000028
2020-03-24 13:27:36.994 UTC [7827] LOG: consistent recovery state reached at 0/E000100
2020-03-24 13:27:36.995 UTC [7826] LOG: database system is ready to accept read only connections
rsync: link_stat "/var/lib/postgresql/wals/00000002000000000000000F" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:37.046 UTC [7827] LOG: restored log file "00000001000000000000000F" from archive
rsync: link_stat "/var/lib/postgresql/wals/000000020000000000000010" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:37.120 UTC [7827] LOG: restored log file "000000010000000000000010" from archive
2020-03-24 13:27:37.198 UTC [7827] LOG: restored log file "000000020000000000000011" from archive
rsync: link_stat "/var/lib/postgresql/wals/000000020000000000000012" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:37.211 UTC [7827] LOG: redo done at 0/11000400
2020-03-24 13:27:37.211 UTC [7827] LOG: last completed transaction was at log time 2020-03-24 10:09:38.087444+00
2020-03-24 13:27:37.274 UTC [7827] LOG: restored log file "000000020000000000000011" from archive
rsync: link_stat "/var/lib/postgresql/wals/00000003.history" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:37.288 UTC [7827] LOG: selected new timeline ID: 3
2020-03-24 13:27:37.314 UTC [7827] LOG: archive recovery complete
2020-03-24 13:27:37.317 UTC [7827] LOG: restored log file "00000002.history" from archive
2020-03-24 13:27:37.382 UTC [7826] LOG: database system is ready to accept connections
- Postgres found out there was a timeline change (with the
00000002.history
file - Postgres found out there was no
00000003.history
file, meaning the current timeline was 2 (so that at the end of this recovery its timeline number will be 3) - Postgres looked for the wal file
000000000000000E
in timeline 2 but couldn’t find it - Postgres looked for the wal file
000000000000000E
in timeline 1, found it, and restored it, reaching the recovery point 0/E000100 - Postgres continue restoring with wal files
00000001000000000000000F
,000000010000000000000010
and000000020000000000000011
- Postgres can’t find file
000000020000000000000012
- Postgres has finished recovering
- Postgres changes the timeline number
- Database is ready!
Why do we need to back up after a promotion? π
First, a failover means an incident in production. It’s always a good move to backup if you think something fishy happened. If it’s a switchover, it means you’ll change things and again backup is still a good move.
Second, I don’t count how many times I’ve seen no archive_command settings on the princess and if your princess dies and you can’t get the .history files which tells Postgres that there was a timeline change and without those files, you won’t be able to recover your data.
So stay safe, backup your databases (and test your backups)!