If you’re one of my customers, you might know how I insist on monitoring your Postgres logs and digging into them to find precious insights on what’s going on with your Postgres. For a long time now, there is pgBadger. For PGSQL Phriday #010, Alicja asks us to focus on pgBadger.
You might be surprised to find out I am not using pgBadger. I will explain why later, but keep assured that I do think pgBadger is a good tool. It will help DBAs get better performance and follow how their instance is doing before there is a very bad problem.
My problem with Badger is that when I am called by a customer, it’s because they couldn’t figure out by themselves what’s wrong. So, it’s not a very simple problem. I found out that Badger was not flexible enough for my very particular use case. On top of that, I recommend my customers output their Postgres logs in CSV so that it’s easier to work with afterward. Even though Badger can manage CSV files, it asks for another Perl library and the setting of the environment is not that easy.
Log settings 🔗
I always recommend my customers to be very greedy with Postgres logging. The more information you get, the more accurate your view of what’s happening.
Here are the typical settings I recommend:
log_destination
: csv. You want the flexibility CSV format gives. On top of that, it will give you automatically all the metadata a line should have.logging_collector
: on. If you want some logs. That’s mandatory with the csv format.log_min_duration_statement
: for an OLTP workload, I recommend 250ms, but that gives you too many slow queries, put a higher value, fix the slowest queries and downsize that parameter gradually. For an OLAP workload, the value can be between 1 and 5 seconds.log_autovacuum_min_duration
: 0. You want to see all autovacuum launches, whatever their duration is.log_checkpoints
: on. You want to log all checkpoints.log_connections
: on. You want to see all connections (or tentatives of connection).log_disconnections
: on. This will help understand your transactions if there is no connection pooler. I still recommend setting that parameter to on even if you’re using a connection pooler.log_lock_waits
: on. You want to log when a query is waiting for a lock to be released.log_temp_files
: 0. You want all the temporary files to create a good analysis of thework_mem
setting value.
Log analysis 🔗
So, how do I analyze logs and why can’t Badger do it for me? What I do normally is using the CSV format output logs and creating a simple FDW (Foreign Data Wrapper) to query the logs as if they were tables in my database.
The firsts things I do are the same as Badger does:
- counting the number of PANIC, FATAL, ERROR messages in the log
- counting the number of connections/disconnections
- checking the checkpoints frequencies
- …
But then, I can get the same indicators during identified workload peaks. If you want that with Badger, You’d have to isolate the portion of the file you want to analyze and the analysis again. Here, I simply adjust my where conditions.
Also, for some indicators, like temporary files, I like to have more
granularity. For example, I’d like to know the percentiles of usage of the
temporary files. If I find one query using a very huge temporary file, I like
taking this query apart, to re-run the analysis on the others and find a good
value for work_mem
.
I also love the flexibility SQL gives me when the time comes to look at slow queries. I have more precision with SQL.
Badger as a log analyzer for production 🔗
That’s why I don’t use Badger. Again, Badger is not a bad tool and I encourage people working in production daily to install it and use it for cold analysis and to get a report on how your instance is performing overall. Moreover, I encourage production DBAs to look at the evolution of the different dashboards they get.
If you need to deep dive into some particular activities in your database, then play with SQL and you will have all the flexibility you need!