Community Blog
Get the latest updates on the Splunk Community, including member experiences, product education, events, and more!

Monitoring Postgres with OpenTelemetry

CaitlinHalla
Splunk Employee
Splunk Employee

Behind every business-critical application, you’ll find databases. These behind-the-scenes stores power everything from login and checkout to content lookups and “likes,” so issues with slow queries, too many full table scans (or too few index scans), incorrectly configured indices, or resource exhaustion directly impact application reliability and user experience. Thankfully, we can capture key database metrics to expose such issues and ensure optimal performance, efficient troubleshooting, and the overall reliability of our applications. 

In this post, we’ll explore monitoring the open-source relational database PostgreSQL. Postgres is widely used in enterprise applications for its scalability, extensibility, and support. It also collects and reports a huge amount of information about internal server activity with its statistics collector. We’ll harness these stats using the OpenTelemetry Collector and first focus on the database and infrastructure itself in Splunk Observability Cloud. Then we’ll see how everything connects to our application performance data.

Which metrics matter and why

Monitoring database metrics is critical to proactively identifying issues, performance optimizations, and database reliability, but with so many stats coming from the statistics collector, it can be difficult to determine what to focus on. How do we isolate what’s critical to monitor effectively? It can help to focus on operation-critical key metrics like those related to:

  • Query performance (query throughput/latency, locks, query errors, index hit rate)
  • Resource utilization (connections, CPU, memory, disk space, table/index size, disk I/O, cache hits)
  • Database health (replication lag, deadlocks, rollbacks, autovacuum performance)

Query Performance

Slow, resource-intensive queries or queries with high throughput can decrease the response time of our applications and degrade user experience. To prevent things like slow page load time, we want to focus on metrics related to query time –  total response time, index scans per second, and database latency. These metrics will indicate if our database has the right or wrong indexes, absent indexes, if our tables are fragmented, or have too many locks, etc.   

Resource Utilization

Exceeding resource thresholds can halt application operations altogether. If total active connections are too high resources might be exhausted, and users might not be able to interact with our application at all. Monitoring resource usage like CPU, memory, and table/index size can keep our databases up and running, while also allowing for accurate capacity planning and optimal user experience. 

Database health

Things like a high rollback to commit rate can indicate user experience issues, for example, users might be unable to complete product checkout on an e-commerce site. An increase in the number of dead rows can lead to degraded query performance or resource exhaustion with similar effects. Proactively monitoring these metrics helps easily identify inefficiencies, eliminate bottlenecks, reduce database bloat, and ultimately improve user experience.  

How to get the metrics

So how do we get these metrics from PostgreSQL to the OpenTelemetry Collector? The first step is installing the OpenTelemetry Collector. If you’re working with the Splunk Distribution of the OpenTelemetry Collector, you can follow the guided install docs. I’m using Docker Compose to set up my application, Postgres service, and OpenTelemetry Collector, so here’s how I added the Splunk Distribution of the OTel Collector: 

docker compose.png

If you already have your OpenTelemetry Collector configuration file ready to edit, you can proceed to add a PostgreSQL receiver to the receivers block so you can start collecting telemetry data from Postgres. Because I set up the Collector with Docker Compose, I manually created my Collector configuration file (otel-collector-config.yaml). Here’s the PostgreSQL receiver I added to my Collector config: 

receiver.png

Note: generally, your database and microservices would be behind network and API security layers so your databases and services would talk to each other unencrypted, which is why I have tls set to insecure: true. If your database requires an authenticated connection, you’ll need to supply a certificate similar to what’s shown in the documentation’s sample configuration

I’m also exporting data for my application to my Splunk Observability Cloud backend, so I’ve added an exporter for that and added both my new receiver and new exporter to my metrics pipeline: 

whole config.png

If you’re not using the Splunk Distribution of the OpenTelemetry Collector or not exporting data to Splunk Observability Cloud, configuring the PostgreSQL receiver block will still follow the example shown, but you’ll need to configure a different exporter and add it to the metrics pipeline. 

That’s it! Now either build, start, or restart your service (I did a docker compose up --build) and watch your database metrics flow into your backend observability platform of choice. 

Note: If you’re working with a complex service architecture and the Splunk Distribution of the OpenTelemetry Collector, you might want to consider using automatic discovery. This allows the Collector to automatically detect and instrument services and their data sources. Depending on your environment and Collector installation method, you can follow the appropriate docs (Linux, Windows, Kubernetes) to deploy the Collector with automatic discovery.

How to see the data in Splunk Observability Cloud

Now that we’re collecting Postgres data, let’s jump over to Splunk Observability Cloud Infrastructure to visualize our telemetry data. We can select the Datastores section and open up either our PostgreSQL databases for database-level metrics or PostgreSQL hosts for metrics related to the infrastructure hosting your PostgreSQL database(s):

infrastructure landing.png

Going into the PostgreSQL databases navigator, we can see the metrics related to all of our databases:

databases overview.png

databases overview 2.png

Here we see those key metrics that can hint at performance issues like total operations, index scans per second, and rollbacks. If total operations are high, we’ll know at a glance if our database resources can handle the current workload intensity. If our index scans per second drop, this can suggest we’re not using indexes efficiently. Databases with a high number of rollbacks could be experiencing an increase in transaction failures or deadlocks. All of these things can lead to slow or unreliable performance for our users.

Clicking into our database we see database-specific metrics:

my_app overview.png

We can monitor index size for efficient resource optimation and right-size indexes. Dead row monitoring helps ensure efficient vacuuming to decrease table bloat and increase performance. It looks like we have 18 total operations per second, but 0 index scans per second, which might mean we aren’t indexing and could have some query performance inefficiencies. 

Going into the PostgreSQL hosts navigator, we can view things like changes in operations per second, transactions, and disk usage to ensure our system can handle current workloads and maintain consistent performance: 

my_app hosts 1.png my_app hosts 2.png

We can also click into a specific host to view individual host metrics like how many transactions succeeded, failed, or were rolled back and the cache hits versus disk hits, both of which impact overall performance: 

individual host metrics.png

Moving between Infrastructure, APM, and Log Observer

Our database monitoring journey will most likely start at the service level or with the applications they back, so let’s dig into query performance and how to view its impacts on overall application performance. 

From within our PostgreSQL host navigator, if we select a specific host, we can view logs or related content in APM to view services that have a dependency on the currently selected host: 

related apm content.png

We can then jump to the Database Query Performance to view and analyze query time, latency, and errors to see which specific areas are impacting response time and user experience and where we might be able to optimize our query performance: 

database query performance.png

Closing this out, we can see our Service Map and where the current database sits so that we can investigate specific errors, traces, or related logs:

service map db.png

traces.png

We moved from Infrastructure to Application Performance Monitoring, but we could have just as easily started with our Service Map and began troubleshooting database performance issues from there using Database Query Performance, database requests/errors, or traces. 

Wrap Up

Monitoring key metrics from the databases that power our applications is critical to the performance and reliability that our users count on. Configuring the OpenTelemetry Collector to receive PostgreSQL telemetry data and export this data to a backend observability platform is an easy process that provides invaluable visibility into the databases that back our services. If you’d like to try exporting your Postgres data to Splunk Observability Cloud, try it free for 14 days!  

Resources

Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In December, the Splunk Threat Research Team had 1 release of new security content via the Enterprise Security ...

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

(This is the first of a series of 2 blogs). Splunk Enterprise Security is a fantastic tool that offers robust ...

Index This | What are the 12 Days of Splunk-mas?

December 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...