Skip to content

Commit 842f289

Browse files
docs: add PostgresAI monitoring reference documentation
Add comprehensive monitoring reference guide that documents: - PostgresAI monitoring architecture and components - Detailed dashboard descriptions and key metrics - Complete graph inventory across all 9 dashboards - Updated to follow PostgresAI documentation standards: * Sentence-style capitalization throughout * Consistent terminology (Postgres vs PostgreSQL) * Professional formatting and structure
1 parent 6a9a33d commit 842f289

File tree

1 file changed

+206
-0
lines changed

1 file changed

+206
-0
lines changed

MONITORING_REFERENCE.md

Lines changed: 206 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,206 @@
1+
# PostgresAI monitoring reference documentation
2+
3+
## Overview
4+
5+
PostgresAI monitoring is a comprehensive Postgres database monitoring solution built on pgwatch, Grafana, and Prometheus. This system provides real-time insights into Postgres database performance, health, and operations through a set of specialized dashboards.
6+
7+
## Architecture
8+
9+
The monitoring stack consists of:
10+
- **pgwatch**: Postgres monitoring agent that collects metrics
11+
- **Grafana**: Visualization and dashboard platform
12+
- **Flask Backend**: Additional API services for enhanced functionality
13+
- **prometheus and Postgres**: Storage for metrics and query texts
14+
15+
## Dashboard Reference
16+
17+
### Dashboard 1: Node Performance Overview
18+
**Purpose**: High-level overview of Postgres database performance and health
19+
20+
**Key Metrics**:
21+
- **Active session history**: Database wait events by type (CPU, locks, I/O)
22+
- **Sessions**: Connection states (Active, Idle, Idle-in-transaction, Waiting)
23+
- **Transactions**: Commit vs rollback ratios and rates
24+
- **Query performance**: Calls, execution time, and latency metrics
25+
- **Buffer cache**: Hit ratios and I/O patterns
26+
- **WAL activity**: Write-ahead log generation and archiving
27+
28+
### Dashboard 2: Aggregated Query Analysis
29+
**Purpose**: Identify top-performing and problematic queries across the database
30+
31+
**Key Metrics**:
32+
- **Detailed table view**: Table of stats for each query from pg_stat_statements
33+
- **Top queries by calls**: Most frequently executed queries
34+
- **Top queries by execution time**: Queries consuming most total time
35+
- **Top queries by latency**: Slowest individual query executions
36+
- **I/O analysis**: Queries with highest disk read/write activity
37+
- **Buffer usage**: Queries with best/worst cache efficiency
38+
- **Temp file usage**: Queries spilling to disk for sorting/hashing
39+
- **WAL generation**: Queries generating most write-ahead log data
40+
41+
42+
### Dashboard 3: Single Query Analysis
43+
**Purpose**: Deep-dive analysis of individual queries by query ID
44+
45+
**Key Metrics**:
46+
- **Execution Timeline**: Calls and execution time over time
47+
- **Wait Events**: Specific wait types for this query
48+
- **Resource Usage**: Buffer hits, disk I/O, WAL generation
49+
- **Performance Metrics**: Latency, rows returned, temp file usage
50+
- **Per-Call Analysis**: Average metrics per query execution
51+
52+
53+
### Dashboard 4: Wait sampling dashboard
54+
**Purpose**: Detailed analysis of database wait events and blocking
55+
56+
**Key Metrics**:
57+
- **Active session history**: All wait events including background processes
58+
- **Active session history by event type**: Detailed categorization by event type
59+
- **Active session history by event type and event**: Wait events correlated with specific queries
60+
61+
### Dashboard 5: Backup stats
62+
**Purpose**: Monitor backup and recovery processes
63+
64+
**Key Metrics**:
65+
- **Archive success and errors**: Rate of successful WAL archives versus failed archive attempts
66+
- **Archive lag**: Amount of WAL data in bytes that has been generated but not yet archived
67+
- **WAL archive success rate**: Percentage of successful WAL archive operations
68+
69+
### Dashboard 7: Autovacuum and bloat
70+
**Purpose**: Monitor Postgres maintenance processes and table health
71+
72+
**Key Metrics**:
73+
- **Vacuum Timeline**: Autovacuum progress through different phases
74+
75+
76+
### Dashboard 8: Index health
77+
**Purpose**: Monitor index performance and maintenance needs
78+
79+
**Key Metrics**:
80+
- **Index Bloat**
81+
- **Index Size**
82+
83+
84+
### Dashboard 9: Table stats
85+
**Purpose**: Monitor table-level operations and data patterns
86+
87+
**Key Metrics**:
88+
- **CRUD operations**: Insert, update, delete rates by table
89+
90+
91+
## Complete Graph Inventory
92+
93+
### Dashboard 1: Node Performance Overview (36 graphs)
94+
1. **Active session history** - Shows database wait events by type (CPU, locks, I/O) to identify performance bottlenecks
95+
2. **Host stats** - Displays system-level metrics like CPU, memory, and disk usage
96+
3. **Postgres stats** - Core Postgres instance metrics and version information
97+
4. **Sessions** - Connection states (Active, Idle, Idle-in-transaction, Waiting) with max_connections limit
98+
5. **Non-idle sessions** - Active database connections excluding idle ones for workload monitoring
99+
6. **Calls (pg_stat_statements)** - Total SQL statement executions per second across all queries
100+
7. **Transactions** - Transaction commit vs rollback rates and overall transaction activity
101+
9. **Commit vs rollback ratio** - Ratio of successful vs failed transactions indicating application health
102+
10. **Statements total time (pg_stat_statements)** - Total execution time per second for all SQL statements
103+
11. **Statements time per call (pg_stat_statements) aka latency** - Average execution time per query call (key latency metric)
104+
12. **Total rows (pg_stat_statements)** - Total rows returned per second across all queries
105+
13. **Rows per call (pg_stat_statements)** - Average rows returned per query execution
106+
14. **blk_read_time vs blk_write_time (s/s) (pg_stat_statements)** - Time spent reading/writing disk blocks per second
107+
15. **blk_read_time vs blk_write_time per call (pg_stat_statements)** - Average disk I/O time per query execution
108+
16. **shared_blks_hit (bytes) (pg_stat_statements)** - Data read from shared buffer cache (good performance indicator)
109+
17. **shared_blks_hit (bytes) per call (pg_stat_statements)** - Average cache hits per query execution
110+
18. **shared_blks_read (bytes) (pg_stat_statements)** - Data read from disk (cache misses - expensive operations)
111+
19. **shared_blks_read (bytes) per call (pg_stat_statements)** - Average disk reads per query execution
112+
20. **shared_blks_written (bytes) (pg_stat_statements)** - Data written from buffers to disk per second
113+
21. **shared_blks_written (bytes) per call (pg_stat_statements)** - Average buffer writes per query execution
114+
22. **shared_blks_dirtied (bytes) (pg_stat_statements)** - Buffer blocks modified (dirtied) per second
115+
23. **shared_blks_dirtied (bytes) per call (pg_stat_statements)** - Average buffer modifications per query
116+
24. **shared_blks_read_ratio (pg_stat_statements)** - Cache miss ratio (< 10-20% indicates good cache efficiency)
117+
25. **WAL bytes (pg_current_wal_lsn)** - Write-ahead log generation rate (affects replication and recovery)
118+
26. **WAL bytes per call (pg_current_wal_lsn)** - Average WAL generation per query execution
119+
27. **WAL fpi (pg_stat_statements)** - WAL full page images generated per second
120+
28. **WAL fpi per call (pg_current_wal_lsn)** - Average full page images per query execution
121+
29. **temp_bytes_read vs temp_bytes_written (pg_stat_statements)** - Temporary file I/O operations
122+
30. **temp_bytes_read vs temp_bytes_written per call (pg_stat_statements)** - Average temp file usage per query
123+
31. **Locks by mode** - Active locks by type (AccessShareLock, RowExclusiveLock, etc.)
124+
32. **Longest non-idle transaction age, > 1 min** - Age of oldest active transaction (>1min threshold)
125+
33. **Age of the oldest transaction ID that has not been frozen** - Transaction ID age (watch for wraparound issues)
126+
34. **Age of the oldest multi-transaction ID that has not been frozen** - Multi-transaction ID age monitoring
127+
35. **bgwriter and checkpointer** - Background writer vs checkpointer activity comparison
128+
36. **Vacuum timeline** - VACUUM operation progress through different phases
129+
130+
### Dashboard 2: Aggregated Query Analysis (25 graphs)
131+
1. **Detailed table view (pg_stat_statements)** - Tabular view of query performance metrics with sorting and filtering
132+
2. **Top $top_n queries analysis (pg_stat_statements)** - Overview of most significant queries by multiple metrics
133+
3. **Top $top_n statements by calls (pg_stat_statements)** - Most frequently executed queries (call frequency)
134+
4. **Top $top_n statements by execution time (pg_stat_statements)** - Queries consuming most total execution time
135+
5. **Top $top_n statements by execution time per call (pg_stat_statements)** - Slowest individual query executions
136+
6. **Top $top_n statements by planning time (pg_stat_statements)** - Queries with highest total query planning time
137+
7. **Top $top_n statements by planning time per call (pg_stat_statements)** - Queries with slowest planning per execution
138+
8. **Top $top_n statements by rows (pg_stat_statements)** - Queries returning most total rows
139+
9. **Top $top_n statements by rows per call (pg_stat_statements)** - Queries with highest average rows per execution
140+
10. **Top $top_n statements by shared_blks_hit (in bytes) (pg_stat_statements)** - Queries with best cache efficiency (most hits)
141+
11. **Top $top_n statements by shared_blks_hit (in bytes) per call (pg_stat_statements)** - Best average cache hits per query
142+
12. **Top $top_n statements by shared_blks_read (in bytes) (pg_stat_statements)** - Queries causing most disk reads (worst cache performance)
143+
13. **Top $top_n statements by shared_blks_read (in bytes) per call (pg_stat_statements)** - Highest average disk reads per query
144+
14. **Top $top_n statements by shared_blks_written (in bytes) (pg_stat_statements)** - Queries writing most data to buffers
145+
15. **Top $top_n statements by shared_blks_written (in bytes) per call (pg_stat_statements)** - Highest average buffer writes per query
146+
16. **Top $top_n statements by shared_blks_dirtied (in bytes) per call (pg_stat_statements)** - Queries modifying most buffer data
147+
17. **Top $top_n statements by WAL bytes (pg_stat_statements)** - Queries generating most write-ahead log data
148+
18. **Top $top_n statements by WAL bytes per call (pg_stat_statements)** - Highest average WAL generation per query
149+
19. **Top $top_n statements by WAL fpi (pg_stat_statements)** - Queries generating most WAL full page images
150+
20. **Top $top_n statements by WAL fpi per call (pg_stat_statements)** - Highest average FPI generation per query
151+
21. **Top $top_n statements by temp bytes read (pg_stat_statements)** - Queries reading most from temporary files
152+
22. **Top $top_n statements by temp bytes read per call (pg_stat_statements)** - Highest average temp file reads per query
153+
23. **Top $top_n statements by temp bytes written (pg_stat_statements)** - Queries writing most to temporary files
154+
24. **Top $top_n statements by temp bytes written per call (pg_stat_statements)** - Highest average temp file writes per query
155+
25. **Query Analysis panels (multiple instances)** - Drill-down analysis panels for individual queries
156+
157+
### Dashboard 3: Single Query Analysis (17 graphs)
158+
1. **Active session history** - Wait events specifically for the selected query ID
159+
2. **Calls (pg_stat_statements)** - Execution frequency of the specific query over time
160+
3. **Execution time (pg_stat_statements)** - Total execution time for the specific query per second
161+
4. **Execution time per call (pg_stat_statements)** - Average execution time per call for the specific query
162+
5. **Rows (pg_stat_statements)** - Total rows returned by the specific query per second
163+
6. **Rows per call (pg_stat_statements)** - Average rows returned per execution of the specific query
164+
7. **shared_blks_hit (in bytes) (pg_stat_statements)** - Cache efficiency for the specific query (bytes from memory)
165+
8. **shared_blks_hit (in bytes) per call (pg_stat_statements)** - Average cache hits per execution of the specific query
166+
9. **WAL bytes (pg_stat_statements)** - WAL generation rate for the specific query
167+
10. **WAL bytes per call (pg_stat_statements)** - Average WAL generation per execution of the specific query
168+
11. **WAL fpi (in bytes) (pg_stat_statements)** - Full page images generated by the specific query
169+
12. **WAL fpi per call (pg_stat_statements)** - Average FPI generation per execution of the specific query
170+
13. **Temp bytes read (pg_stat_statements)** - Temporary file reads for the specific query
171+
14. **Temp bytes read per call (pg_stat_statements)** - Average temp file reads per execution of the specific query
172+
15. **Temp bytes written (pg_stat_statements)** - Temporary file writes for the specific query
173+
16. **Temp bytes written per call (pg_stat_statements)** - Average temp file writes per execution of the specific query
174+
17. **Query Analysis panels (multiple instances)** - Detailed analysis panels for the selected query
175+
176+
### Dashboard 4: Wait sampling dashboard (4 graphs)
177+
1. **Active session history** - Comprehensive view of all database wait events including background processes
178+
2. **Active session history by event type** - Wait events grouped by category (CPU, I/O, locks, etc.)
179+
3. **Active session history by event type and event** - Detailed breakdown with specific event names and query IDs
180+
4. **Query Analysis** - Drill-down analysis for queries associated with wait events
181+
182+
### Dashboard 5: Backup stats (3 graphs)
183+
1. **Archive success and errors** - Rate of successful vs failed WAL archive operations
184+
2. **WAL archive success rate** - Percentage of successful archive operations (should be 100%)
185+
3. **Archive lag** - Amount of WAL data waiting to be archived (data loss window)
186+
187+
### Dashboard 7: Autovacuum and bloat (1 graph)
188+
1. **Vacuum timeline** - Progress of VACUUM operations through phases (scanning, vacuuming, cleaning, etc.)
189+
190+
### Dashboard 8: Index health (6 graphs)
191+
1. **Detailed index view** - Tabular view of all indexes with bloat, size, and usage statistics
192+
2. **Top $top_n index analysis** - Overview of most problematic indexes by various metrics
193+
3. **Top $top_n indexes by estimated bloat %** - Indexes with highest percentage of wasted space
194+
4. **Top $top_n indexes by estimated bloat size** - Indexes with largest absolute amount of wasted space
195+
5. **Top $top_n indexes by size** - Largest indexes by total size (memory and disk impact)
196+
6. **Query Analysis panels (multiple instances)** - Detailed analysis for index-related queries
197+
198+
### Dashboard 9: Table stats (7 graphs)
199+
1. **Tuple operations** - Total CRUD operations (insert, update, delete, hot update) across all tables
200+
2. **Tuple operations (%)** - Percentage breakdown of different operation types
201+
3. **Number of inserted tuples by table** - Insert rates for individual tables over time
202+
4. **Number of updated tuples by table** - Update rates for individual tables (watch for bloat impact)
203+
5. **Number of hot updated tuples by table** - HOT updates by table (efficient updates avoiding index updates)
204+
6. **Number of deleted tuples by table** - Delete rates by table (triggers vacuum operations)
205+
7. **Table details panels (multiple instances)** - Detailed statistics and metrics for individual tables
206+

0 commit comments

Comments
 (0)