Restoring Through timeline change

Β· 968 words Β· 5 minute read

I always heard that once you had performed either a switchover or a failover, you needed to create a fresh backup of your Postgres cluster. I suppose that’s why I assumed you couldn’t PITR through a timeline change. Today, I learned you could PITR through a timeline change!

Some context πŸ”—

Let’s assume you have a queen (elinor) and a princess (merida). If you’re not familiar with the queen/princess/worker terms, please read Tapoueh’s blog post about that.

Create the backup and add some data πŸ”—

First, you create a simple backup of your Postgres cluster, using pg_basebackup:

elinor:$ pg_basebackup -D /var/lib/postgresql/backup -Ft -z -P

And send it to the princess, just in case:

elinor:$ rsync -r /var/lib/postgresql/backup merida:/var/lib/postgresql/backup

You create a simple test table on elinor:

test=# create table test(id integer generated always as identity, value text);
CREATE TABLE

And add some data in your table:

test=# insert into test(value) values ('a');
INSERT 0 1
test=# insert into test(value) values ('a');
INSERT 0 1
test=# insert into test(value) values ('a');
INSERT 0 1

You can query those data on merida (the princess):

test=# select * from test;
 id | value 
 ----+-------
  1 | a
  2 | a
  3 | a
(3 rows)

Promote the princess πŸ”—

So, you shut down the queen and promote the princess:

elinor:$ pg_ctlcluster 12 main stop

merida:$ pg_ctlcluster 12 main promote

And then you add some rows in your test table from your new queen (merida):

test=# insert into test(value) values ('b');
INSERT 0 1
test=# insert into test(value) values ('b');
INSERT 0 1

Restore the backup πŸ”—

Now, let’s stop merida and wipe the cluster clean.

merida:$ pg_dropcluster 12 main --stop

Let’s restore our backup:

merida$: mkdir /var/lib/postgresql/12/main
merida$: tar xzf base.tar.gz -C /var/lib/postgresql/12/main
merida$: touch /var/lib/postgresql/12/main/recovery.signal
merida$: chown -R postgres:postgres /var/lib/postgresql/12/main

And we just need to start Postgres:

merida:$ pg_ctlcluster 12 main start

Are my data back?

test=# select * from test;
 id | value 
----+-------
  1 | a
  2 | a
  3 | a
  4 | b
  5 | b
(5 rows) 

Yes! Everything is back. No data loss!

What did Postgres? πŸ”—

Here is my starting logfile (yes, my prefix should be better than that):

2020-03-24 13:27:36.835 UTC [7826] LOG:  starting PostgreSQL 12.2 (Ubuntu 12.2-2.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0, 64-bit
2020-03-24 13:27:36.835 UTC [7826] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2020-03-24 13:27:36.835 UTC [7826] LOG:  listening on IPv6 address "::", port 5432
2020-03-24 13:27:36.842 UTC [7826] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-03-24 13:27:36.869 UTC [7827] LOG:  database system was interrupted; last known up at 2020-03-24 10:07:35 UTC
2020-03-24 13:27:36.889 UTC [7827] LOG:  restored log file "00000002.history" from archive
rsync: link_stat "/var/lib/postgresql/wals/00000003.history" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:36.891 UTC [7827] LOG:  starting archive recovery
2020-03-24 13:27:36.894 UTC [7827] LOG:  restored log file "00000002.history" from archive
rsync: link_stat "/var/lib/postgresql/wals/00000002000000000000000E" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:36.955 UTC [7827] LOG:  restored log file "00000001000000000000000E" from archive
2020-03-24 13:27:36.985 UTC [7827] LOG:  redo starts at 0/E000028
2020-03-24 13:27:36.994 UTC [7827] LOG:  consistent recovery state reached at 0/E000100
2020-03-24 13:27:36.995 UTC [7826] LOG:  database system is ready to accept read only connections
rsync: link_stat "/var/lib/postgresql/wals/00000002000000000000000F" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:37.046 UTC [7827] LOG:  restored log file "00000001000000000000000F" from archive
rsync: link_stat "/var/lib/postgresql/wals/000000020000000000000010" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:37.120 UTC [7827] LOG:  restored log file "000000010000000000000010" from archive
2020-03-24 13:27:37.198 UTC [7827] LOG:  restored log file "000000020000000000000011" from archive
rsync: link_stat "/var/lib/postgresql/wals/000000020000000000000012" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:37.211 UTC [7827] LOG:  redo done at 0/11000400
2020-03-24 13:27:37.211 UTC [7827] LOG:  last completed transaction was at log time 2020-03-24 10:09:38.087444+00
2020-03-24 13:27:37.274 UTC [7827] LOG:  restored log file "000000020000000000000011" from archive
rsync: link_stat "/var/lib/postgresql/wals/00000003.history" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
2020-03-24 13:27:37.288 UTC [7827] LOG:  selected new timeline ID: 3
2020-03-24 13:27:37.314 UTC [7827] LOG:  archive recovery complete
2020-03-24 13:27:37.317 UTC [7827] LOG:  restored log file "00000002.history" from archive
2020-03-24 13:27:37.382 UTC [7826] LOG:  database system is ready to accept connections
  1. Postgres found out there was a timeline change (with the 00000002.history file
  2. Postgres found out there was no 00000003.history file, meaning the current timeline was 2 (so that at the end of this recovery its timeline number will be 3)
  3. Postgres looked for the wal file 000000000000000E in timeline 2 but couldn’t find it
  4. Postgres looked for the wal file 000000000000000E in timeline 1, found it, and restored it, reaching the recovery point 0/E000100
  5. Postgres continue restoring with wal files 00000001000000000000000F, 000000010000000000000010 and 000000020000000000000011
  6. Postgres can’t find file 000000020000000000000012
  7. Postgres has finished recovering
  8. Postgres changes the timeline number
  9. Database is ready!

Why do we need to back up after a promotion? πŸ”—

First, a failover means an incident in production. It’s always a good move to backup if you think something fishy happened. If it’s a switchover, it means you’ll change things and again backup is still a good move.

Second, I don’t count how many times I’ve seen no archive_command settings on the princess and if your princess dies and you can’t get the .history files which tells Postgres that there was a timeline change and without those files, you won’t be able to recover your data.

So stay safe, backup your databases (and test your backups)!