← Back to Menu Module 7 of 8

🔧 Troubleshooting & Optimization

Diagnosing and Resolving Integration Issues

Module Overview

Even with proper configuration, integrations can encounter issues. This module provides comprehensive troubleshooting procedures, diagnostic techniques, and optimization strategies for all three data sources. You'll learn to identify root causes, resolve common errors, and implement best practices for production deployments.

What You'll Master: Systematic troubleshooting methodologies, performance optimization, security hardening, and production-ready monitoring configurations.

Time to Complete: 30-35 minutes

🌐 Section 1: LibreNMS Troubleshooting

LibreNMS integration issues typically stem from database connectivity, permission problems, or query inefficiencies. This section provides step-by-step resolution procedures for the most common scenarios.

1Issue: Database Connection Failed

Symptom: Grafana displays error connecting to database or dial tcp: connect: connection refused

Diagnostic Procedure:

# Step 1: Verify database is running systemctl status mariadb # or for MySQL systemctl status mysql # Step 2: Test connectivity from Grafana server telnet librenms-server 3306 # Step 3: Verify user credentials mysql -h librenms-server -u grafana_readonly -p # Step 4: Check user permissions SHOW GRANTS FOR 'grafana_readonly'@'%';
+------------------------------------------------------------------+ | Grants for grafana_readonly@% | +------------------------------------------------------------------+ | GRANT SELECT ON `librenms`.* TO `grafana_readonly`@`%` | +------------------------------------------------------------------+

Resolution Steps:

If database is not running:

sudo systemctl start mariadb sudo systemctl enable mariadb

If connection is refused:

# Check bind address in MySQL config sudo nano /etc/mysql/mariadb.conf.d/50-server.cnf # Find and modify: bind-address = 0.0.0.0 # Allow remote connections # Restart service sudo systemctl restart mariadb

If user doesn't exist or lacks permissions:

# Recreate user with correct permissions mysql -u root -p CREATE USER 'grafana_readonly'@'%' IDENTIFIED BY 'SecurePassword123!'; GRANT SELECT ON librenms.* TO 'grafana_readonly'@'%'; FLUSH PRIVILEGES; EXIT;

Common Firewall Issue

Port 3306 must be open between Grafana and LibreNMS servers:

# On LibreNMS server (Ubuntu/Debian): sudo ufw allow from GRAFANA_IP to any port 3306 # On LibreNMS server (RHEL/CentOS): sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="GRAFANA_IP" port protocol="tcp" port="3306" accept' sudo firewall-cmd --reload

2Issue: No Data Returned from Queries

Symptom: Panels display "No data" or queries return empty results despite devices being polled.

Diagnostic Commands:

# Step 1: Verify data exists in database mysql -h librenms-server -u grafana_readonly -p librenms # Check recent polling data SELECT device_id, hostname, status, last_polled FROM devices ORDER BY last_polled DESC LIMIT 10; # Check port statistics SELECT COUNT(*) as total_records, MAX(poll_time) as latest_data FROM ports; # Verify specific device data SELECT d.hostname, p.ifDescr, p.ifOperStatus FROM devices d JOIN ports p ON d.device_id = p.device_id WHERE d.hostname = 'core-switch-01' LIMIT 5;

LibreNMS Table Reference

Common tables used in Grafana queries:

  • devices - Device inventory and status
  • ports - Interface statistics and metrics
  • sensors - Temperature, power, environmental data
  • storage - Disk and memory utilization

Correct query format for Grafana:

SELECT UNIX_TIMESTAMP(poll_time) AS time_sec, ifInOctets_rate * 8 AS "Inbound bps", ifOutOctets_rate * 8 AS "Outbound bps" FROM ports WHERE poll_time >= FROM_UNIXTIME($__from / 1000) AND poll_time <= FROM_UNIXTIME($__to / 1000) AND port_id = $port_id ORDER BY poll_time ASC;

Query Testing Best Practice

Always test queries in Grafana's Explore view first:

  1. Navigate to Explore (compass icon in left sidebar)
  2. Select LibreNMS data source
  3. Use Format: Table to see raw results
  4. Verify column names and data types
  5. Once working, copy query to dashboard

3Issue: Slow Query Performance

Symptom: Dashboards take 10+ seconds to load, timeout errors, or database CPU spikes.

Diagnostic Analysis:

# Step 1: Identify slow queries mysql -u root -p SET GLOBAL slow_query_log = 'ON'; SET GLOBAL long_query_time = 2; # Log queries > 2 seconds # Step 2: Analyze a specific query EXPLAIN SELECT d.hostname, p.ifDescr, AVG(p.ifInOctets_rate) as avg_in FROM devices d JOIN ports p ON d.device_id = p.device_id WHERE p.poll_time > NOW() - INTERVAL 1 HOUR GROUP BY d.hostname, p.ifDescr;

Optimization Strategies:

1. Add Database Indexes

# Create composite index on frequently queried columns ALTER TABLE ports ADD INDEX idx_poll_time (poll_time, device_id); # Add index for device lookups ALTER TABLE devices ADD INDEX idx_hostname (hostname); # Verify indexes were created SHOW INDEXES FROM ports;

2. Limit Data Retrieval

# BAD: Retrieves all historical data SELECT * FROM ports WHERE device_id = 1; # GOOD: Only fetch what's needed SELECT UNIX_TIMESTAMP(poll_time) AS time_sec, ifInOctets_rate, ifOutOctets_rate FROM ports WHERE device_id = 1 AND poll_time >= FROM_UNIXTIME($__from / 1000) AND poll_time <= FROM_UNIXTIME($__to / 1000) LIMIT 10000; # Prevent runaway queries

Performance Benchmarks

Time Range Target Query Time Recommended Interval
Last 6 hours < 2 seconds 1 minute
Last 24 hours < 3 seconds 5 minutes
Last 7 days < 5 seconds 30 minutes

📡 Section 2: InfluxDB Troubleshooting

InfluxDB issues commonly involve authentication, query language compatibility (Flux vs InfluxQL), and time series aggregation challenges.

4Issue: 401 Unauthorized Errors

Symptom: 401 Unauthorized or invalid authentication credentials

Diagnostic Procedure (InfluxDB 2.x):

# Step 1: Verify InfluxDB is running systemctl status influxd # Step 2: Test API endpoint curl http://influxdb-server:8086/health # Step 3: Verify token validity curl -H "Authorization: Token YOUR_TOKEN_HERE" \ http://influxdb-server:8086/api/v2/buckets # Step 4: List all tokens (requires admin access) influx auth list

Resolution Steps:

If token is invalid or expired:

# Create new read token influx auth create \ --org YourOrgName \ --description "Grafana Read Token" \ --read-bucket telegraf

InfluxDB 1.x vs 2.x Authentication

Version Auth Method Grafana Config
InfluxDB 1.x Username/Password Basic Auth enabled, no token
InfluxDB 2.x Token-based Token in HTTP headers, no Basic Auth

5Issue: No Measurements Found

Symptom: Metric dropdowns are empty, or queries return measurement not found

Diagnostic Commands (InfluxDB 1.x - InfluxQL):

# Connect to database influx -database telegraf # List all measurements SHOW MEASUREMENTS; # Check measurement schema SHOW FIELD KEYS FROM cpu; # Verify recent data exists SELECT * FROM cpu WHERE time > now() - 5m LIMIT 10;

Problem: Telegraf not collecting data

# Check Telegraf status systemctl status telegraf # View Telegraf logs journalctl -u telegraf -n 50 --no-pager # Test Telegraf config telegraf --test --config /etc/telegraf/telegraf.conf # Restart Telegraf if needed sudo systemctl restart telegraf

6Issue: Query Timeout Errors

Symptom: context deadline exceeded or query timeout

Optimization Techniques:

1. Use Appropriate Time Windows

// BAD: Queries all historical data (Flux) from(bucket: "telegraf") |> range(start: 0) |> filter(fn: (r) => r._measurement == "cpu") // GOOD: Limited time range from(bucket: "telegraf") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => r._measurement == "cpu")

2. Implement Downsampling for Long Ranges

// For time ranges > 24 hours, aggregate data (Flux) from(bucket: "telegraf") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => r._measurement == "cpu") |> filter(fn: (r) => r._field == "usage_idle") |> aggregateWindow(every: 5m, fn: mean) |> yield(name: "mean")

📋 Section 3: VictoriaLogs Troubleshooting

VictoriaLogs troubleshooting focuses on endpoint configuration, LogQL syntax, and label-based filtering.

7Issue: Failed to Call Resource Errors

Symptom: failed to call resource or 404 Not Found

Diagnostic Procedure:

# Step 1: Verify VictoriaLogs is running curl http://victorialogs-server:9428/health # Step 2: Test LogQL endpoint curl -X POST 'http://victorialogs-server:9428/select/logsql/query' \ -d 'query={}' # Step 3: Verify labels endpoint curl -X POST 'http://victorialogs-server:9428/select/logsql/labels'

VictoriaLogs URL Configuration

Correct URL format:

# In Grafana data source settings: URL: http://victorialogs-server:9428/select/logsql # Common mistakes to avoid: # ❌ http://victorialogs-server:9428 (missing path) # ❌ http://victorialogs-server:9428/select (incomplete)

Problem: Port not accessible

# Test connectivity from Grafana server telnet victorialogs-server 9428 # Check if VictoriaLogs is listening sudo netstat -tlnp | grep 9428 # Verify firewall rules (Ubuntu/Debian) sudo ufw allow 9428/tcp

8Issue: Label Filtering Not Working

Symptom: Queries return no results or wrong results despite logs existing

Diagnostic Steps:

# Step 1: List all available labels curl -X POST 'http://victorialogs-server:9428/select/logsql/labels' # Step 2: Get values for specific label curl -X POST 'http://victorialogs-server:9428/select/logsql/label_values' \ -d 'label=host' # Step 3: Test basic query without filters curl -X POST 'http://victorialogs-server:9428/select/logsql/query' \ -d 'query={} | limit 10'

Common LogQL Syntax Issues:

# ❌ WRONG: Using commas between filters {host="server-01", service="nginx"} # ✅ CORRECT: Filters within same label selector {host="server-01",service="nginx"} # ✅ CORRECT: Using regex matching {host=~"server-.*"}

LogQL Query Examples

Use Case Query
All logs from a host {host="server-01"}
Logs containing "error" {} | "error"
Multiple label filters {host="server-01",level="error"}

🔍 Section 4: Grafana Configuration Debugging

9Enable Debug Logging

Debug logging provides detailed information about data source connections and query execution.

# Method 1: Edit grafana.ini configuration file sudo nano /etc/grafana/grafana.ini # Find and modify the [log] section: [log] mode = console file level = debug # Save and restart Grafana sudo systemctl restart grafana-server

Viewing Grafana Logs:

# For systemd-based systems journalctl -u grafana-server -f --since "10 minutes ago" # For file-based logging tail -f /var/log/grafana/grafana.log # Filter for specific errors journalctl -u grafana-server | grep "datasource"

Debug Logging Impact

Debug logging generates significant disk I/O. Key considerations:

  • Can generate 100MB+ logs per hour on busy systems
  • May impact performance on resource-constrained servers
  • Always revert to INFO level after troubleshooting

10Using Explore Mode for Query Testing

Grafana's Explore view is the best tool for testing queries before adding them to dashboards.

Explore Mode Workflow:

Step 1: Click Explore icon (compass) in left sidebar
Step 2: Select data source from dropdown at top
Step 3: Build query using query builder or write custom query
Step 4: Click "Run query" button or press Shift+Enter
Step 5: Review results and check Query Inspector

Explore Mode Shortcuts

  • Shift + Enter: Run query
  • Split view: Click "Split" to compare multiple queries
  • Share Explore: Copy URL to share exact query with team

11Network Connectivity Testing

Systematic network testing eliminates connectivity as a variable in troubleshooting.

Connectivity Test Checklist:

  • Verify Grafana server can resolve data source hostname
  • Test port accessibility with telnet or nc
  • Verify firewall rules on both endpoints
  • Test with curl from Grafana server
  • Validate DNS resolution
  • Confirm time synchronization (NTP)

Diagnostic Commands:

# Test 1: DNS resolution nslookup librenms-server # Test 2: Network connectivity ping -c 4 librenms-server # Test 3: Port accessibility telnet librenms-server 3306 # Test 4: Full HTTP request curl -v http://influxdb-server:8086/health # Test 5: Time synchronization timedatectl status

🛡️ Section 5: Security Best Practices

12Implement Least Privilege Access

Database users for Grafana should have only SELECT (read) permissions.

MySQL/MariaDB (LibreNMS):

# Create dedicated read-only user CREATE USER 'grafana_readonly'@'grafana-server-ip' IDENTIFIED BY 'StrongPassword123!'; # Grant SELECT only on required database GRANT SELECT ON librenms.* TO 'grafana_readonly'@'grafana-server-ip'; # Apply changes FLUSH PRIVILEGES;

InfluxDB 2.x:

# Create read-only token for specific bucket influx auth create \ --org YourOrgName \ --description "Grafana Read-Only Token" \ --read-bucket telegraf

Security Validation

Test that read-only user CANNOT modify data:

# This should succeed: SELECT * FROM devices LIMIT 1; # These should all fail: INSERT INTO devices VALUES (...); UPDATE devices SET hostname='test'; DELETE FROM devices WHERE id=1;

13Enable SSL/TLS Encryption

Encrypt data in transit between Grafana and data sources.

MySQL/MariaDB SSL Configuration:

# Step 1: Generate SSL certificates sudo mysql_ssl_rsa_setup --datadir=/var/lib/mysql # Step 2: Enable SSL in MySQL config sudo nano /etc/mysql/mariadb.conf.d/50-server.cnf [mysqld] ssl-ca=/var/lib/mysql/ca.pem ssl-cert=/var/lib/mysql/server-cert.pem ssl-key=/var/lib/mysql/server-key.pem require_secure_transport=ON # Step 3: Restart MySQL sudo systemctl restart mariadb

Copy CA certificate to Grafana server:

# On Grafana server: sudo nano /etc/grafana/mysql-ca.pem # Paste the CA certificate content sudo chown grafana:grafana /etc/grafana/mysql-ca.pem sudo chmod 400 /etc/grafana/mysql-ca.pem

14Implement Firewall Rules

Restrict data source access to only the Grafana server.

Ubuntu/Debian (UFW):

# On LibreNMS server sudo ufw allow from GRAFANA_IP to any port 3306 proto tcp sudo ufw deny 3306/tcp sudo ufw enable # On InfluxDB server sudo ufw allow from GRAFANA_IP to any port 8086 proto tcp sudo ufw deny 8086/tcp

RHEL/CentOS (firewalld):

# On LibreNMS server sudo firewall-cmd --permanent --add-rich-rule=' rule family="ipv4" source address="GRAFANA_IP" port protocol="tcp" port="3306" accept' sudo firewall-cmd --reload

📊 Section 6: Performance Optimization

15Dashboard Refresh Rate Tuning

Aggressive refresh rates increase database load without providing value for slowly-changing metrics.

Recommended Refresh Intervals:

Dashboard Type Recommended Refresh Rationale
Real-time monitoring 30 seconds - 1 minute Balance freshness with load
Network interface stats 1-5 minutes Polling typically every 5 minutes
Application logs 1-2 minutes Log volume and query complexity
Historical analysis Off (manual refresh) Static data doesn't need auto-refresh

Query Count Optimization

Reduce the number of queries per dashboard:

  • Combine related metrics into single query when possible
  • Use query variables to avoid duplicated queries
  • Hide panels not currently in use (reduces query execution)
  • Use Grafana's query caching (enabled by default)

16Database Connection Pooling

Optimize database connections to handle multiple concurrent dashboard users.

MySQL/MariaDB Connection Pooling:

# Tune MySQL for Grafana workload sudo nano /etc/mysql/mariadb.conf.d/50-server.cnf [mysqld] max_connections = 200 max_user_connections = 50 wait_timeout = 300 interactive_timeout = 300 # InnoDB buffer pool (set to 70% of RAM) innodb_buffer_pool_size = 2G # Restart MySQL sudo systemctl restart mariadb

✅ Section 7: Production Deployment Checklist

17Pre-Deployment Validation

Security Checklist:

  • Changed default Grafana admin password
  • Created read-only database users for all data sources
  • Implemented firewall rules restricting data source access
  • Enabled SSL/TLS for all data source connections
  • Set up regular credential rotation schedule
  • Disabled anonymous Grafana access
  • Enabled audit logging for access tracking

Performance Checklist:

  • Optimized database queries (EXPLAIN analysis completed)
  • Added indexes to frequently queried columns
  • Implemented appropriate dashboard refresh rates
  • Enabled query caching in Grafana
  • Configured database connection pooling
  • Tested dashboard load times (<5 seconds target)
  • Verified no query timeout errors under normal load

Reliability Checklist:

  • All data sources pass health checks
  • Configured service auto-start on boot
  • Implemented database backup procedures
  • Created Grafana dashboard export backups
  • Documented restore procedures
  • Set up monitoring for Grafana itself
  • Configured log rotation for all services

18Post-Deployment Monitoring

After deployment, continuously monitor the health and performance of your Grafana integration.

Key Metrics to Monitor:

Metric Warning Threshold Critical Threshold
Dashboard Load Time > 5 seconds > 10 seconds
Query Error Rate > 1% > 5%
Database CPU Usage > 70% > 90%
Grafana Memory Usage > 80% > 95%

Deployment Success Criteria

Your Grafana integration is production-ready when:

  • ✅ All dashboards load in under 5 seconds
  • ✅ Query error rate is < 0.1%
  • ✅ All security hardening steps are complete
  • ✅ Backup and restore procedures are tested
  • ✅ Documentation is complete and accessible
  • ✅ Team members are trained on basic troubleshooting

🔄 Troubleshooting Workflow Summary

Use this systematic approach when encountering any integration issue:

Step 1: Identify the Symptom

What exact error message or behavior are you seeing?

Step 2: Isolate the Component

Is the issue with Grafana, data source, network, or query?

Step 3: Check Basics

Service running? Network accessible? Credentials valid?

Step 4: Enable Debug Logging

Turn on debug logs for detailed error information

Step 5: Test in Isolation

Use Explore mode, command-line tools, curl tests

Step 6: Apply Solution

Implement fix based on diagnostic results

Step 7: Verify Resolution

Test thoroughly, document the solution

Step 8: Implement Prevention

Add monitoring, alerts, or automation to prevent recurrence

📚 Quick Reference: Common Issues

Symptom Likely Cause Quick Fix See Section
Database connection failed Service down or firewall Check service status, test port Section 1, Step 1
No data in panels Wrong table/field names Verify query in Explore mode Section 1, Step 2
Slow dashboard loading Unoptimized queries Run EXPLAIN, add indexes Section 1, Step 3
401 Unauthorized Wrong token or password Regenerate credentials Section 2, Step 4
No measurements found Telegraf not collecting Check Telegraf status Section 2, Step 5
Query timeout Query too broad Implement downsampling Section 2, Step 6
Failed to call resource Wrong VictoriaLogs URL Use /select/logsql endpoint Section 3, Step 7
Label filtering not working Case sensitivity or syntax Check label values, use regex Section 3, Step 8

🎓 Module Summary

Congratulations! You've completed the troubleshooting and optimization module. You now have the skills to:

  • ✅ Diagnose and resolve LibreNMS database connection and query issues
  • ✅ Troubleshoot InfluxDB authentication and performance problems
  • ✅ Fix VictoriaLogs endpoint and LogQL syntax errors
  • ✅ Use Grafana's debugging tools effectively
  • ✅ Implement security best practices for production deployments
  • ✅ Optimize query performance and dashboard load times
  • ✅ Follow systematic troubleshooting methodologies
  • ✅ Validate production-readiness using comprehensive checklists

Next Steps

You're now ready for the final module: Verification & Assessment. This module will test your knowledge through hands-on scenarios and provide a comprehensive skills assessment.