Module Overview
Even with proper configuration, integrations can encounter issues. This module provides comprehensive troubleshooting procedures, diagnostic techniques, and optimization strategies for all three data sources. You'll learn to identify root causes, resolve common errors, and implement best practices for production deployments.
What You'll Master: Systematic troubleshooting methodologies, performance optimization, security hardening, and production-ready monitoring configurations.
Time to Complete: 30-35 minutes
🌐 Section 1: LibreNMS Troubleshooting
LibreNMS integration issues typically stem from database connectivity, permission problems, or query inefficiencies. This section provides step-by-step resolution procedures for the most common scenarios.
1Issue: Database Connection Failed
Symptom: Grafana displays error connecting to database or dial tcp: connect: connection refused
Diagnostic Procedure:
Resolution Steps:
If database is not running:
If connection is refused:
If user doesn't exist or lacks permissions:
Common Firewall Issue
Port 3306 must be open between Grafana and LibreNMS servers:
2Issue: No Data Returned from Queries
Symptom: Panels display "No data" or queries return empty results despite devices being polled.
Diagnostic Commands:
LibreNMS Table Reference
Common tables used in Grafana queries:
devices- Device inventory and statusports- Interface statistics and metricssensors- Temperature, power, environmental datastorage- Disk and memory utilization
Correct query format for Grafana:
Query Testing Best Practice
Always test queries in Grafana's Explore view first:
- Navigate to Explore (compass icon in left sidebar)
- Select LibreNMS data source
- Use Format: Table to see raw results
- Verify column names and data types
- Once working, copy query to dashboard
3Issue: Slow Query Performance
Symptom: Dashboards take 10+ seconds to load, timeout errors, or database CPU spikes.
Diagnostic Analysis:
Optimization Strategies:
1. Add Database Indexes
2. Limit Data Retrieval
Performance Benchmarks
| Time Range | Target Query Time | Recommended Interval |
|---|---|---|
| Last 6 hours | < 2 seconds | 1 minute |
| Last 24 hours | < 3 seconds | 5 minutes |
| Last 7 days | < 5 seconds | 30 minutes |
📡 Section 2: InfluxDB Troubleshooting
InfluxDB issues commonly involve authentication, query language compatibility (Flux vs InfluxQL), and time series aggregation challenges.
4Issue: 401 Unauthorized Errors
Symptom: 401 Unauthorized or invalid authentication credentials
Diagnostic Procedure (InfluxDB 2.x):
Resolution Steps:
If token is invalid or expired:
InfluxDB 1.x vs 2.x Authentication
| Version | Auth Method | Grafana Config |
|---|---|---|
| InfluxDB 1.x | Username/Password | Basic Auth enabled, no token |
| InfluxDB 2.x | Token-based | Token in HTTP headers, no Basic Auth |
5Issue: No Measurements Found
Symptom: Metric dropdowns are empty, or queries return measurement not found
Diagnostic Commands (InfluxDB 1.x - InfluxQL):
Problem: Telegraf not collecting data
6Issue: Query Timeout Errors
Symptom: context deadline exceeded or query timeout
Optimization Techniques:
1. Use Appropriate Time Windows
2. Implement Downsampling for Long Ranges
📋 Section 3: VictoriaLogs Troubleshooting
VictoriaLogs troubleshooting focuses on endpoint configuration, LogQL syntax, and label-based filtering.
7Issue: Failed to Call Resource Errors
Symptom: failed to call resource or 404 Not Found
Diagnostic Procedure:
VictoriaLogs URL Configuration
Correct URL format:
Problem: Port not accessible
8Issue: Label Filtering Not Working
Symptom: Queries return no results or wrong results despite logs existing
Diagnostic Steps:
Common LogQL Syntax Issues:
LogQL Query Examples
| Use Case | Query |
|---|---|
| All logs from a host | {host="server-01"} |
| Logs containing "error" | {} | "error" |
| Multiple label filters | {host="server-01",level="error"} |
🔍 Section 4: Grafana Configuration Debugging
9Enable Debug Logging
Debug logging provides detailed information about data source connections and query execution.
Viewing Grafana Logs:
Debug Logging Impact
Debug logging generates significant disk I/O. Key considerations:
- Can generate 100MB+ logs per hour on busy systems
- May impact performance on resource-constrained servers
- Always revert to INFO level after troubleshooting
10Using Explore Mode for Query Testing
Grafana's Explore view is the best tool for testing queries before adding them to dashboards.
Explore Mode Workflow:
Explore Mode Shortcuts
- Shift + Enter: Run query
- Split view: Click "Split" to compare multiple queries
- Share Explore: Copy URL to share exact query with team
11Network Connectivity Testing
Systematic network testing eliminates connectivity as a variable in troubleshooting.
Connectivity Test Checklist:
- Verify Grafana server can resolve data source hostname
- Test port accessibility with telnet or nc
- Verify firewall rules on both endpoints
- Test with curl from Grafana server
- Validate DNS resolution
- Confirm time synchronization (NTP)
Diagnostic Commands:
🛡️ Section 5: Security Best Practices
12Implement Least Privilege Access
Database users for Grafana should have only SELECT (read) permissions.
MySQL/MariaDB (LibreNMS):
InfluxDB 2.x:
Security Validation
Test that read-only user CANNOT modify data:
13Enable SSL/TLS Encryption
Encrypt data in transit between Grafana and data sources.
MySQL/MariaDB SSL Configuration:
Copy CA certificate to Grafana server:
14Implement Firewall Rules
Restrict data source access to only the Grafana server.
Ubuntu/Debian (UFW):
RHEL/CentOS (firewalld):
📊 Section 6: Performance Optimization
15Dashboard Refresh Rate Tuning
Aggressive refresh rates increase database load without providing value for slowly-changing metrics.
Recommended Refresh Intervals:
| Dashboard Type | Recommended Refresh | Rationale |
|---|---|---|
| Real-time monitoring | 30 seconds - 1 minute | Balance freshness with load |
| Network interface stats | 1-5 minutes | Polling typically every 5 minutes |
| Application logs | 1-2 minutes | Log volume and query complexity |
| Historical analysis | Off (manual refresh) | Static data doesn't need auto-refresh |
Query Count Optimization
Reduce the number of queries per dashboard:
- Combine related metrics into single query when possible
- Use query variables to avoid duplicated queries
- Hide panels not currently in use (reduces query execution)
- Use Grafana's query caching (enabled by default)
16Database Connection Pooling
Optimize database connections to handle multiple concurrent dashboard users.
MySQL/MariaDB Connection Pooling:
✅ Section 7: Production Deployment Checklist
17Pre-Deployment Validation
Security Checklist:
- Changed default Grafana admin password
- Created read-only database users for all data sources
- Implemented firewall rules restricting data source access
- Enabled SSL/TLS for all data source connections
- Set up regular credential rotation schedule
- Disabled anonymous Grafana access
- Enabled audit logging for access tracking
Performance Checklist:
- Optimized database queries (EXPLAIN analysis completed)
- Added indexes to frequently queried columns
- Implemented appropriate dashboard refresh rates
- Enabled query caching in Grafana
- Configured database connection pooling
- Tested dashboard load times (<5 seconds target)
- Verified no query timeout errors under normal load
Reliability Checklist:
- All data sources pass health checks
- Configured service auto-start on boot
- Implemented database backup procedures
- Created Grafana dashboard export backups
- Documented restore procedures
- Set up monitoring for Grafana itself
- Configured log rotation for all services
18Post-Deployment Monitoring
After deployment, continuously monitor the health and performance of your Grafana integration.
Key Metrics to Monitor:
| Metric | Warning Threshold | Critical Threshold |
|---|---|---|
| Dashboard Load Time | > 5 seconds | > 10 seconds |
| Query Error Rate | > 1% | > 5% |
| Database CPU Usage | > 70% | > 90% |
| Grafana Memory Usage | > 80% | > 95% |
Deployment Success Criteria
Your Grafana integration is production-ready when:
- ✅ All dashboards load in under 5 seconds
- ✅ Query error rate is < 0.1%
- ✅ All security hardening steps are complete
- ✅ Backup and restore procedures are tested
- ✅ Documentation is complete and accessible
- ✅ Team members are trained on basic troubleshooting
🔄 Troubleshooting Workflow Summary
Use this systematic approach when encountering any integration issue:
What exact error message or behavior are you seeing?
Is the issue with Grafana, data source, network, or query?
Service running? Network accessible? Credentials valid?
Turn on debug logs for detailed error information
Use Explore mode, command-line tools, curl tests
Implement fix based on diagnostic results
Test thoroughly, document the solution
Add monitoring, alerts, or automation to prevent recurrence
📚 Quick Reference: Common Issues
| Symptom | Likely Cause | Quick Fix | See Section |
|---|---|---|---|
| Database connection failed | Service down or firewall | Check service status, test port | Section 1, Step 1 |
| No data in panels | Wrong table/field names | Verify query in Explore mode | Section 1, Step 2 |
| Slow dashboard loading | Unoptimized queries | Run EXPLAIN, add indexes | Section 1, Step 3 |
| 401 Unauthorized | Wrong token or password | Regenerate credentials | Section 2, Step 4 |
| No measurements found | Telegraf not collecting | Check Telegraf status | Section 2, Step 5 |
| Query timeout | Query too broad | Implement downsampling | Section 2, Step 6 |
| Failed to call resource | Wrong VictoriaLogs URL | Use /select/logsql endpoint | Section 3, Step 7 |
| Label filtering not working | Case sensitivity or syntax | Check label values, use regex | Section 3, Step 8 |
🎓 Module Summary
Congratulations! You've completed the troubleshooting and optimization module. You now have the skills to:
- ✅ Diagnose and resolve LibreNMS database connection and query issues
- ✅ Troubleshoot InfluxDB authentication and performance problems
- ✅ Fix VictoriaLogs endpoint and LogQL syntax errors
- ✅ Use Grafana's debugging tools effectively
- ✅ Implement security best practices for production deployments
- ✅ Optimize query performance and dashboard load times
- ✅ Follow systematic troubleshooting methodologies
- ✅ Validate production-readiness using comprehensive checklists
Next Steps
You're now ready for the final module: Verification & Assessment. This module will test your knowledge through hands-on scenarios and provide a comprehensive skills assessment.