Trace Analysis Guide
This guide provides detailed instructions for querying and analyzing traces from the NBU Exporter to diagnose performance issues and optimize your monitoring setup.
Table of Contents
- Understanding Trace Structure
- Common Queries
- Performance Analysis
- Troubleshooting Scenarios
- Best Practices
Understanding Trace Structure
Span Hierarchy
Each Prometheus scrape creates a trace with the following structure:
prometheus.scrape (root span)
├── netbackup.fetch_storage
│ └── http.request (GET /storage/storage-units)
└── netbackup.fetch_jobs
├── netbackup.fetch_job_page (offset=0)
│ └── http.request (GET /admin/jobs?offset=0)
├── netbackup.fetch_job_page (offset=1)
│ └── http.request (GET /admin/jobs?offset=1)
└── netbackup.fetch_job_page (offset=N)
└── http.request (GET /admin/jobs?offset=N)
Span Attributes
Root Span (prometheus.scrape)
| Attribute | Type | Description | Example |
|---|---|---|---|
scrape.duration_ms |
int | Total scrape duration in milliseconds | 45230 |
scrape.storage_metrics_count |
int | Number of storage metrics collected | 12 |
scrape.job_metrics_count |
int | Number of job metrics collected | 156 |
scrape.status |
string | Overall scrape status | "success", "partial_failure" |
Storage Fetch (netbackup.fetch_storage)
| Attribute | Type | Description | Example |
|---|---|---|---|
netbackup.endpoint |
string | API endpoint path | "/storage/storage-units" |
netbackup.storage_units |
int | Number of storage units retrieved | 6 |
netbackup.api_version |
string | API version used | "13.0" |
Job Fetch (netbackup.fetch_jobs)
| Attribute | Type | Description | Example |
|---|---|---|---|
netbackup.endpoint |
string | API endpoint path | "/admin/jobs" |
netbackup.time_window |
string | Scraping interval | "1h" |
netbackup.total_jobs |
int | Total jobs retrieved | 1523 |
netbackup.total_pages |
int | Number of pages fetched | 16 |
HTTP Request (http.request)
| Attribute | Type | Description | Example |
|---|---|---|---|
http.method |
string | HTTP method | "GET" |
http.url |
string | Full request URL | "https://nbu:1556/netbackup/admin/jobs" |
http.status_code |
int | HTTP response status code | 200 |
http.duration_ms |
int | Request duration in milliseconds | 2341 |
Common Queries
Jaeger UI Queries
Find All Traces
Find Slow Scrapes (> 30 seconds)
Find Failed Scrapes
Find Specific Time Range
Find High Pagination
TraceQL Queries (Tempo/Grafana)
Find Slow API Calls
Find Failed HTTP Requests
Find Scrapes with Many Jobs
Find Storage Fetch Issues
Aggregate by Status Code
Performance Analysis
Identifying Bottlenecks
Step 1: Find Slow Traces
- Open Jaeger UI
- Select service:
nbu-exporter - Set "Min Duration" to 30s
- Click "Find Traces"
- Sort by duration (longest first)
Step 2: Analyze Span Timeline
Click on a slow trace and examine the waterfall view:
Example 1: Slow Storage Fetch
prometheus.scrape: 35.2s
├── netbackup.fetch_storage: 30.1s ⚠️ BOTTLENECK
│ └── http.request: 30.0s
└── netbackup.fetch_jobs: 5.1s
Diagnosis: Storage API is slow Action: Check NetBackup server performance, verify network latency
Example 2: High Pagination
prometheus.scrape: 48.3s
├── netbackup.fetch_storage: 2.1s
└── netbackup.fetch_jobs: 46.2s ⚠️ BOTTLENECK
├── netbackup.fetch_job_page: 15.4s
├── netbackup.fetch_job_page: 15.3s
└── netbackup.fetch_job_page: 15.5s
Diagnosis: Too many job pages (high job volume)
Action: Reduce scrapingInterval from 1h to 30m
Example 3: API Errors
prometheus.scrape: 25.2s
├── netbackup.fetch_storage: 2.1s
└── netbackup.fetch_jobs: 23.1s
├── netbackup.fetch_job_page: 5.2s (http.status_code=500) ⚠️ ERROR
├── netbackup.fetch_job_page: 5.3s (http.status_code=500) ⚠️ ERROR
└── netbackup.fetch_job_page: 5.4s (http.status_code=500) ⚠️ ERROR
Diagnosis: NetBackup API errors Action: Check NetBackup server logs, verify API key permissions
Step 3: Examine Span Attributes
Click on a span to view its attributes:
Key metrics to check:
http.duration_ms: Request latencyhttp.status_code: Success/failure statusnetbackup.total_pages: Pagination countnetbackup.total_jobs: Job volume
Performance Metrics
Baseline Performance
Normal scrape (< 1000 jobs, 6 storage units):
prometheus.scrape: 8-12s
├── netbackup.fetch_storage: 1-2s
└── netbackup.fetch_jobs: 6-10s
└── 1-2 pages @ 3-5s each
High-volume scrape (> 5000 jobs):
prometheus.scrape: 45-60s
├── netbackup.fetch_storage: 1-2s
└── netbackup.fetch_jobs: 43-58s
└── 10-15 pages @ 3-5s each
Performance Targets
| Metric | Target | Warning | Critical |
|---|---|---|---|
| Total scrape duration | < 30s | 30-60s | > 60s |
| Storage fetch | < 5s | 5-10s | > 10s |
| Job page fetch | < 5s | 5-10s | > 10s |
| HTTP status codes | 200 | 4xx | 5xx |
| Total pages | < 5 | 5-10 | > 10 |
Troubleshooting Scenarios
Scenario 1: Scrapes Timing Out
Symptoms:
- Prometheus scrape timeout errors
- Incomplete metrics
- Traces show > 60s duration
Analysis:
Solutions:
- Reduce
scrapingIntervalfrom 1h to 30m - Increase Prometheus scrape timeout
- Optimize NetBackup server performance
Scenario 2: Intermittent Failures
Symptoms:
- Some scrapes succeed, others fail
- HTTP 500 errors in traces
scrape.status=partial_failure
Analysis:
Solutions:
- Check NetBackup server logs for errors
- Verify API key has correct permissions
- Check network connectivity
- Review NetBackup server resource usage
Scenario 3: Slow Storage Fetch
Symptoms:
- Storage metrics delayed
netbackup.fetch_storage> 10s
Analysis:
Query: Operation: netbackup.fetch_storage, Min Duration: 10s
Look for: http.duration_ms in child spans
Solutions:
- Check NetBackup storage service status
- Verify network latency to NetBackup server
- Review NetBackup server disk I/O
- Check for storage unit issues
Scenario 4: High Pagination
Symptoms:
- Long scrape durations
- Many
netbackup.fetch_job_pagespans netbackup.total_pages> 10
Analysis:
Solutions:
- Reduce
scrapingIntervalto fetch fewer jobs - Consider filtering jobs by policy or status
- Optimize NetBackup job retention policies
Scenario 5: API Version Issues
Symptoms:
- HTTP 406 errors
- Version detection failures
- Inconsistent API responses
Analysis:
Solutions:
- Verify NetBackup version compatibility
- Check API version configuration
- Enable automatic version detection
- Review NetBackup API logs
Best Practices
Query Optimization
Use specific time ranges:
Filter by operation:
Use tags for precision:
Sampling Strategy
Development:
Production (normal load):
Production (high load):
Alerting on Traces
Create alerts for:
- Scrape duration > 60s
- HTTP status codes >= 500
- High pagination (> 10 pages)
- Partial failures
Example Prometheus alert:
- alert: SlowNBUScrape
expr: histogram_quantile(0.95, rate(scrape_duration_seconds_bucket[5m])) > 60
annotations:
summary: "NBU scrapes are slow"
description: "95th percentile scrape duration is {{ $value }}s"
Regular Analysis
Daily:
- Check for failed scrapes
- Review slow traces (> 30s)
- Monitor HTTP error rates
Weekly:
- Analyze pagination trends
- Review API version distribution
- Check sampling effectiveness
Monthly:
- Optimize scraping intervals
- Review trace retention policies
- Update performance baselines
Advanced Analysis
Comparing Traces
Compare before/after optimization:
- Find baseline trace (before changes)
- Note key metrics (duration, pages, etc.)
- Make configuration changes
- Find new trace (after changes)
- Compare metrics side-by-side
Example comparison:
| Metric | Before | After | Improvement |
|---|---|---|---|
| Total duration | 48.3s | 12.1s | 75% faster |
| Total pages | 15 | 3 | 80% reduction |
| Job count | 5234 | 1156 | Reduced scope |
Correlation with Metrics
Link traces to Prometheus metrics:
- Note trace timestamp
- Query Prometheus for same time range
- Correlate trace spans with metric values
- Identify patterns
Example:
Trace shows: 15 pages fetched
Prometheus shows: nbu_jobs_count{} = 5234
Calculation: 5234 / 15 ≈ 349 jobs per page
Exporting Trace Data
Export for analysis:
- In Jaeger UI, click "JSON" on trace view
- Save trace JSON
- Analyze with custom tools
- Generate reports
Example Python analysis:
import json
with open('trace.json') as f:
trace = json.load(f)
for span in trace['spans']:
if span['operationName'] == 'http.request':
duration = span['duration'] / 1000 # Convert to ms
print(f"HTTP request: {duration}ms")
Span Attributes Reference
Complete Attribute List
| Span | Attribute | Type | Description |
|---|---|---|---|
| prometheus.scrape | scrape.duration_ms | int | Total scrape duration |
| prometheus.scrape | scrape.storage_metrics_count | int | Storage metrics collected |
| prometheus.scrape | scrape.job_metrics_count | int | Job metrics collected |
| prometheus.scrape | scrape.status | string | Overall status |
| netbackup.fetch_storage | netbackup.endpoint | string | API endpoint |
| netbackup.fetch_storage | netbackup.storage_units | int | Storage units retrieved |
| netbackup.fetch_storage | netbackup.api_version | string | API version |
| netbackup.fetch_jobs | netbackup.endpoint | string | API endpoint |
| netbackup.fetch_jobs | netbackup.time_window | string | Scraping interval |
| netbackup.fetch_jobs | netbackup.total_jobs | int | Total jobs retrieved |
| netbackup.fetch_jobs | netbackup.total_pages | int | Pages fetched |
| netbackup.fetch_job_page | netbackup.page_offset | int | Page offset |
| netbackup.fetch_job_page | netbackup.jobs_in_page | int | Jobs in this page |
| http.request | http.method | string | HTTP method |
| http.request | http.url | string | Request URL |
| http.request | http.status_code | int | Response status |
| http.request | http.duration_ms | int | Request duration |