|
| 1 | +# PostgresAI monitoring reference documentation |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +PostgresAI monitoring is a comprehensive Postgres database monitoring solution built on pgwatch, Grafana, and Prometheus. This system provides real-time insights into Postgres database performance, health, and operations through a set of specialized dashboards. |
| 6 | + |
| 7 | +## Architecture |
| 8 | + |
| 9 | +The monitoring stack consists of: |
| 10 | +- **pgwatch**: Postgres monitoring agent that collects metrics |
| 11 | +- **Grafana**: Visualization and dashboard platform |
| 12 | +- **Flask Backend**: Additional API services for enhanced functionality |
| 13 | +- **prometheus and Postgres**: Storage for metrics and query texts |
| 14 | + |
| 15 | +## Dashboard Reference |
| 16 | + |
| 17 | +### Dashboard 1: Node Performance Overview |
| 18 | +**Purpose**: High-level overview of Postgres database performance and health |
| 19 | + |
| 20 | +**Key Metrics**: |
| 21 | +- **Active session history**: Database wait events by type (CPU, locks, I/O) |
| 22 | +- **Sessions**: Connection states (Active, Idle, Idle-in-transaction, Waiting) |
| 23 | +- **Transactions**: Commit vs rollback ratios and rates |
| 24 | +- **Query performance**: Calls, execution time, and latency metrics |
| 25 | +- **Buffer cache**: Hit ratios and I/O patterns |
| 26 | +- **WAL activity**: Write-ahead log generation and archiving |
| 27 | + |
| 28 | +### Dashboard 2: Aggregated Query Analysis |
| 29 | +**Purpose**: Identify top-performing and problematic queries across the database |
| 30 | + |
| 31 | +**Key Metrics**: |
| 32 | +- **Detailed table view**: Table of stats for each query from pg_stat_statements |
| 33 | +- **Top queries by calls**: Most frequently executed queries |
| 34 | +- **Top queries by execution time**: Queries consuming most total time |
| 35 | +- **Top queries by latency**: Slowest individual query executions |
| 36 | +- **I/O analysis**: Queries with highest disk read/write activity |
| 37 | +- **Buffer usage**: Queries with best/worst cache efficiency |
| 38 | +- **Temp file usage**: Queries spilling to disk for sorting/hashing |
| 39 | +- **WAL generation**: Queries generating most write-ahead log data |
| 40 | + |
| 41 | + |
| 42 | +### Dashboard 3: Single Query Analysis |
| 43 | +**Purpose**: Deep-dive analysis of individual queries by query ID |
| 44 | + |
| 45 | +**Key Metrics**: |
| 46 | +- **Execution Timeline**: Calls and execution time over time |
| 47 | +- **Wait Events**: Specific wait types for this query |
| 48 | +- **Resource Usage**: Buffer hits, disk I/O, WAL generation |
| 49 | +- **Performance Metrics**: Latency, rows returned, temp file usage |
| 50 | +- **Per-Call Analysis**: Average metrics per query execution |
| 51 | + |
| 52 | + |
| 53 | +### Dashboard 4: Wait sampling dashboard |
| 54 | +**Purpose**: Detailed analysis of database wait events and blocking |
| 55 | + |
| 56 | +**Key Metrics**: |
| 57 | +- **Active session history**: All wait events including background processes |
| 58 | +- **Active session history by event type**: Detailed categorization by event type |
| 59 | +- **Active session history by event type and event**: Wait events correlated with specific queries |
| 60 | + |
| 61 | +### Dashboard 5: Backup stats |
| 62 | +**Purpose**: Monitor backup and recovery processes |
| 63 | + |
| 64 | +**Key Metrics**: |
| 65 | +- **Archive success and errors**: Rate of successful WAL archives versus failed archive attempts |
| 66 | +- **Archive lag**: Amount of WAL data in bytes that has been generated but not yet archived |
| 67 | +- **WAL archive success rate**: Percentage of successful WAL archive operations |
| 68 | + |
| 69 | +### Dashboard 7: Autovacuum and bloat |
| 70 | +**Purpose**: Monitor Postgres maintenance processes and table health |
| 71 | + |
| 72 | +**Key Metrics**: |
| 73 | +- **Vacuum Timeline**: Autovacuum progress through different phases |
| 74 | + |
| 75 | + |
| 76 | +### Dashboard 8: Index health |
| 77 | +**Purpose**: Monitor index performance and maintenance needs |
| 78 | + |
| 79 | +**Key Metrics**: |
| 80 | +- **Index Bloat** |
| 81 | +- **Index Size** |
| 82 | + |
| 83 | + |
| 84 | +### Dashboard 9: Table stats |
| 85 | +**Purpose**: Monitor table-level operations and data patterns |
| 86 | + |
| 87 | +**Key Metrics**: |
| 88 | +- **CRUD operations**: Insert, update, delete rates by table |
| 89 | + |
| 90 | + |
| 91 | +## Complete Graph Inventory |
| 92 | + |
| 93 | +### Dashboard 1: Node Performance Overview (36 graphs) |
| 94 | +1. **Active session history** - Shows database wait events by type (CPU, locks, I/O) to identify performance bottlenecks |
| 95 | +2. **Host stats** - Displays system-level metrics like CPU, memory, and disk usage |
| 96 | +3. **Postgres stats** - Core Postgres instance metrics and version information |
| 97 | +4. **Sessions** - Connection states (Active, Idle, Idle-in-transaction, Waiting) with max_connections limit |
| 98 | +5. **Non-idle sessions** - Active database connections excluding idle ones for workload monitoring |
| 99 | +6. **Calls (pg_stat_statements)** - Total SQL statement executions per second across all queries |
| 100 | +7. **Transactions** - Transaction commit vs rollback rates and overall transaction activity |
| 101 | +9. **Commit vs rollback ratio** - Ratio of successful vs failed transactions indicating application health |
| 102 | +10. **Statements total time (pg_stat_statements)** - Total execution time per second for all SQL statements |
| 103 | +11. **Statements time per call (pg_stat_statements) aka latency** - Average execution time per query call (key latency metric) |
| 104 | +12. **Total rows (pg_stat_statements)** - Total rows returned per second across all queries |
| 105 | +13. **Rows per call (pg_stat_statements)** - Average rows returned per query execution |
| 106 | +14. **blk_read_time vs blk_write_time (s/s) (pg_stat_statements)** - Time spent reading/writing disk blocks per second |
| 107 | +15. **blk_read_time vs blk_write_time per call (pg_stat_statements)** - Average disk I/O time per query execution |
| 108 | +16. **shared_blks_hit (bytes) (pg_stat_statements)** - Data read from shared buffer cache (good performance indicator) |
| 109 | +17. **shared_blks_hit (bytes) per call (pg_stat_statements)** - Average cache hits per query execution |
| 110 | +18. **shared_blks_read (bytes) (pg_stat_statements)** - Data read from disk (cache misses - expensive operations) |
| 111 | +19. **shared_blks_read (bytes) per call (pg_stat_statements)** - Average disk reads per query execution |
| 112 | +20. **shared_blks_written (bytes) (pg_stat_statements)** - Data written from buffers to disk per second |
| 113 | +21. **shared_blks_written (bytes) per call (pg_stat_statements)** - Average buffer writes per query execution |
| 114 | +22. **shared_blks_dirtied (bytes) (pg_stat_statements)** - Buffer blocks modified (dirtied) per second |
| 115 | +23. **shared_blks_dirtied (bytes) per call (pg_stat_statements)** - Average buffer modifications per query |
| 116 | +24. **shared_blks_read_ratio (pg_stat_statements)** - Cache miss ratio (< 10-20% indicates good cache efficiency) |
| 117 | +25. **WAL bytes (pg_current_wal_lsn)** - Write-ahead log generation rate (affects replication and recovery) |
| 118 | +26. **WAL bytes per call (pg_current_wal_lsn)** - Average WAL generation per query execution |
| 119 | +27. **WAL fpi (pg_stat_statements)** - WAL full page images generated per second |
| 120 | +28. **WAL fpi per call (pg_current_wal_lsn)** - Average full page images per query execution |
| 121 | +29. **temp_bytes_read vs temp_bytes_written (pg_stat_statements)** - Temporary file I/O operations |
| 122 | +30. **temp_bytes_read vs temp_bytes_written per call (pg_stat_statements)** - Average temp file usage per query |
| 123 | +31. **Locks by mode** - Active locks by type (AccessShareLock, RowExclusiveLock, etc.) |
| 124 | +32. **Longest non-idle transaction age, > 1 min** - Age of oldest active transaction (>1min threshold) |
| 125 | +33. **Age of the oldest transaction ID that has not been frozen** - Transaction ID age (watch for wraparound issues) |
| 126 | +34. **Age of the oldest multi-transaction ID that has not been frozen** - Multi-transaction ID age monitoring |
| 127 | +35. **bgwriter and checkpointer** - Background writer vs checkpointer activity comparison |
| 128 | +36. **Vacuum timeline** - VACUUM operation progress through different phases |
| 129 | + |
| 130 | +### Dashboard 2: Aggregated Query Analysis (25 graphs) |
| 131 | +1. **Detailed table view (pg_stat_statements)** - Tabular view of query performance metrics with sorting and filtering |
| 132 | +2. **Top $top_n queries analysis (pg_stat_statements)** - Overview of most significant queries by multiple metrics |
| 133 | +3. **Top $top_n statements by calls (pg_stat_statements)** - Most frequently executed queries (call frequency) |
| 134 | +4. **Top $top_n statements by execution time (pg_stat_statements)** - Queries consuming most total execution time |
| 135 | +5. **Top $top_n statements by execution time per call (pg_stat_statements)** - Slowest individual query executions |
| 136 | +6. **Top $top_n statements by planning time (pg_stat_statements)** - Queries with highest total query planning time |
| 137 | +7. **Top $top_n statements by planning time per call (pg_stat_statements)** - Queries with slowest planning per execution |
| 138 | +8. **Top $top_n statements by rows (pg_stat_statements)** - Queries returning most total rows |
| 139 | +9. **Top $top_n statements by rows per call (pg_stat_statements)** - Queries with highest average rows per execution |
| 140 | +10. **Top $top_n statements by shared_blks_hit (in bytes) (pg_stat_statements)** - Queries with best cache efficiency (most hits) |
| 141 | +11. **Top $top_n statements by shared_blks_hit (in bytes) per call (pg_stat_statements)** - Best average cache hits per query |
| 142 | +12. **Top $top_n statements by shared_blks_read (in bytes) (pg_stat_statements)** - Queries causing most disk reads (worst cache performance) |
| 143 | +13. **Top $top_n statements by shared_blks_read (in bytes) per call (pg_stat_statements)** - Highest average disk reads per query |
| 144 | +14. **Top $top_n statements by shared_blks_written (in bytes) (pg_stat_statements)** - Queries writing most data to buffers |
| 145 | +15. **Top $top_n statements by shared_blks_written (in bytes) per call (pg_stat_statements)** - Highest average buffer writes per query |
| 146 | +16. **Top $top_n statements by shared_blks_dirtied (in bytes) per call (pg_stat_statements)** - Queries modifying most buffer data |
| 147 | +17. **Top $top_n statements by WAL bytes (pg_stat_statements)** - Queries generating most write-ahead log data |
| 148 | +18. **Top $top_n statements by WAL bytes per call (pg_stat_statements)** - Highest average WAL generation per query |
| 149 | +19. **Top $top_n statements by WAL fpi (pg_stat_statements)** - Queries generating most WAL full page images |
| 150 | +20. **Top $top_n statements by WAL fpi per call (pg_stat_statements)** - Highest average FPI generation per query |
| 151 | +21. **Top $top_n statements by temp bytes read (pg_stat_statements)** - Queries reading most from temporary files |
| 152 | +22. **Top $top_n statements by temp bytes read per call (pg_stat_statements)** - Highest average temp file reads per query |
| 153 | +23. **Top $top_n statements by temp bytes written (pg_stat_statements)** - Queries writing most to temporary files |
| 154 | +24. **Top $top_n statements by temp bytes written per call (pg_stat_statements)** - Highest average temp file writes per query |
| 155 | +25. **Query Analysis panels (multiple instances)** - Drill-down analysis panels for individual queries |
| 156 | + |
| 157 | +### Dashboard 3: Single Query Analysis (17 graphs) |
| 158 | +1. **Active session history** - Wait events specifically for the selected query ID |
| 159 | +2. **Calls (pg_stat_statements)** - Execution frequency of the specific query over time |
| 160 | +3. **Execution time (pg_stat_statements)** - Total execution time for the specific query per second |
| 161 | +4. **Execution time per call (pg_stat_statements)** - Average execution time per call for the specific query |
| 162 | +5. **Rows (pg_stat_statements)** - Total rows returned by the specific query per second |
| 163 | +6. **Rows per call (pg_stat_statements)** - Average rows returned per execution of the specific query |
| 164 | +7. **shared_blks_hit (in bytes) (pg_stat_statements)** - Cache efficiency for the specific query (bytes from memory) |
| 165 | +8. **shared_blks_hit (in bytes) per call (pg_stat_statements)** - Average cache hits per execution of the specific query |
| 166 | +9. **WAL bytes (pg_stat_statements)** - WAL generation rate for the specific query |
| 167 | +10. **WAL bytes per call (pg_stat_statements)** - Average WAL generation per execution of the specific query |
| 168 | +11. **WAL fpi (in bytes) (pg_stat_statements)** - Full page images generated by the specific query |
| 169 | +12. **WAL fpi per call (pg_stat_statements)** - Average FPI generation per execution of the specific query |
| 170 | +13. **Temp bytes read (pg_stat_statements)** - Temporary file reads for the specific query |
| 171 | +14. **Temp bytes read per call (pg_stat_statements)** - Average temp file reads per execution of the specific query |
| 172 | +15. **Temp bytes written (pg_stat_statements)** - Temporary file writes for the specific query |
| 173 | +16. **Temp bytes written per call (pg_stat_statements)** - Average temp file writes per execution of the specific query |
| 174 | +17. **Query Analysis panels (multiple instances)** - Detailed analysis panels for the selected query |
| 175 | + |
| 176 | +### Dashboard 4: Wait sampling dashboard (4 graphs) |
| 177 | +1. **Active session history** - Comprehensive view of all database wait events including background processes |
| 178 | +2. **Active session history by event type** - Wait events grouped by category (CPU, I/O, locks, etc.) |
| 179 | +3. **Active session history by event type and event** - Detailed breakdown with specific event names and query IDs |
| 180 | +4. **Query Analysis** - Drill-down analysis for queries associated with wait events |
| 181 | + |
| 182 | +### Dashboard 5: Backup stats (3 graphs) |
| 183 | +1. **Archive success and errors** - Rate of successful vs failed WAL archive operations |
| 184 | +2. **WAL archive success rate** - Percentage of successful archive operations (should be 100%) |
| 185 | +3. **Archive lag** - Amount of WAL data waiting to be archived (data loss window) |
| 186 | + |
| 187 | +### Dashboard 7: Autovacuum and bloat (1 graph) |
| 188 | +1. **Vacuum timeline** - Progress of VACUUM operations through phases (scanning, vacuuming, cleaning, etc.) |
| 189 | + |
| 190 | +### Dashboard 8: Index health (6 graphs) |
| 191 | +1. **Detailed index view** - Tabular view of all indexes with bloat, size, and usage statistics |
| 192 | +2. **Top $top_n index analysis** - Overview of most problematic indexes by various metrics |
| 193 | +3. **Top $top_n indexes by estimated bloat %** - Indexes with highest percentage of wasted space |
| 194 | +4. **Top $top_n indexes by estimated bloat size** - Indexes with largest absolute amount of wasted space |
| 195 | +5. **Top $top_n indexes by size** - Largest indexes by total size (memory and disk impact) |
| 196 | +6. **Query Analysis panels (multiple instances)** - Detailed analysis for index-related queries |
| 197 | + |
| 198 | +### Dashboard 9: Table stats (7 graphs) |
| 199 | +1. **Tuple operations** - Total CRUD operations (insert, update, delete, hot update) across all tables |
| 200 | +2. **Tuple operations (%)** - Percentage breakdown of different operation types |
| 201 | +3. **Number of inserted tuples by table** - Insert rates for individual tables over time |
| 202 | +4. **Number of updated tuples by table** - Update rates for individual tables (watch for bloat impact) |
| 203 | +5. **Number of hot updated tuples by table** - HOT updates by table (efficient updates avoiding index updates) |
| 204 | +6. **Number of deleted tuples by table** - Delete rates by table (triggers vacuum operations) |
| 205 | +7. **Table details panels (multiple instances)** - Detailed statistics and metrics for individual tables |
| 206 | + |
0 commit comments