Module 7: Troubleshooting | Grafana Integration Lab

Module Overview

Even with proper configuration, integrations can encounter issues. This module provides comprehensive troubleshooting procedures, diagnostic techniques, and optimization strategies for all three data sources. You'll learn to identify root causes, resolve common errors, and implement best practices for production deployments.

What You'll Master: Systematic troubleshooting methodologies, performance optimization, security hardening, and production-ready monitoring configurations.

Time to Complete: 30-35 minutes

🌐 Section 1: LibreNMS Troubleshooting

LibreNMS integration issues typically stem from database connectivity, permission problems, or query inefficiencies. This section provides step-by-step resolution procedures for the most common scenarios.

1Issue: Database Connection Failed

Symptom: Grafana displays error connecting to database or dial tcp: connect: connection refused

Diagnostic Procedure:

# Step 1: Verify database is running
systemctl status mariadb
# or for MySQL
systemctl status mysql

# Step 2: Test connectivity from Grafana server
telnet librenms-server 3306

# Step 3: Verify user credentials
mysql -h librenms-server -u grafana_readonly -p

# Step 4: Check user permissions
SHOW GRANTS FOR 'grafana_readonly'@'%';

+------------------------------------------------------------------+ | Grants for grafana_readonly@% | +------------------------------------------------------------------+ | GRANT SELECT ON `librenms`.* TO `grafana_readonly`@`%` | +------------------------------------------------------------------+

Resolution Steps:

If database is not running:

sudo systemctl start mariadb
sudo systemctl enable mariadb

If connection is refused:

# Check bind address in MySQL config
sudo nano /etc/mysql/mariadb.conf.d/50-server.cnf

# Find and modify:
bind-address = 0.0.0.0  # Allow remote connections

# Restart service
sudo systemctl restart mariadb

If user doesn't exist or lacks permissions:

# Recreate user with correct permissions
mysql -u root -p

CREATE USER 'grafana_readonly'@'%' IDENTIFIED BY 'SecurePassword123!';
GRANT SELECT ON librenms.* TO 'grafana_readonly'@'%';
FLUSH PRIVILEGES;
EXIT;

Common Firewall Issue

Port 3306 must be open between Grafana and LibreNMS servers:

# On LibreNMS server (Ubuntu/Debian):
sudo ufw allow from GRAFANA_IP to any port 3306

# On LibreNMS server (RHEL/CentOS):
sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="GRAFANA_IP" port protocol="tcp" port="3306" accept'
sudo firewall-cmd --reload

2Issue: No Data Returned from Queries

Symptom: Panels display "No data" or queries return empty results despite devices being polled.

Diagnostic Commands:

# Step 1: Verify data exists in database
mysql -h librenms-server -u grafana_readonly -p librenms

# Check recent polling data
SELECT device_id, hostname, status, last_polled 
FROM devices 
ORDER BY last_polled DESC 
LIMIT 10;

# Check port statistics
SELECT COUNT(*) as total_records, 
       MAX(poll_time) as latest_data
FROM ports;

# Verify specific device data
SELECT d.hostname, p.ifDescr, p.ifOperStatus
FROM devices d
JOIN ports p ON d.device_id = p.device_id
WHERE d.hostname = 'core-switch-01'
LIMIT 5;

LibreNMS Table Reference

Common tables used in Grafana queries:

devices - Device inventory and status
ports - Interface statistics and metrics
sensors - Temperature, power, environmental data
storage - Disk and memory utilization

Correct query format for Grafana:

SELECT 
    UNIX_TIMESTAMP(poll_time) AS time_sec,
    ifInOctets_rate * 8 AS "Inbound bps",
    ifOutOctets_rate * 8 AS "Outbound bps"
FROM ports
WHERE poll_time >= FROM_UNIXTIME($__from / 1000)
  AND poll_time <= FROM_UNIXTIME($__to / 1000)
  AND port_id = $port_id
ORDER BY poll_time ASC;

Query Testing Best Practice

Always test queries in Grafana's Explore view first:

Navigate to Explore (compass icon in left sidebar)
Select LibreNMS data source
Use Format: Table to see raw results
Verify column names and data types
Once working, copy query to dashboard

3Issue: Slow Query Performance

Symptom: Dashboards take 10+ seconds to load, timeout errors, or database CPU spikes.

Diagnostic Analysis:

# Step 1: Identify slow queries
mysql -u root -p

SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 2;  # Log queries > 2 seconds

# Step 2: Analyze a specific query
EXPLAIN SELECT 
    d.hostname,
    p.ifDescr,
    AVG(p.ifInOctets_rate) as avg_in
FROM devices d
JOIN ports p ON d.device_id = p.device_id
WHERE p.poll_time > NOW() - INTERVAL 1 HOUR
GROUP BY d.hostname, p.ifDescr;

Optimization Strategies:

1. Add Database Indexes

# Create composite index on frequently queried columns
ALTER TABLE ports ADD INDEX idx_poll_time (poll_time, device_id);

# Add index for device lookups
ALTER TABLE devices ADD INDEX idx_hostname (hostname);

# Verify indexes were created
SHOW INDEXES FROM ports;

2. Limit Data Retrieval

# BAD: Retrieves all historical data
SELECT * FROM ports WHERE device_id = 1;

# GOOD: Only fetch what's needed
SELECT 
    UNIX_TIMESTAMP(poll_time) AS time_sec,
    ifInOctets_rate,
    ifOutOctets_rate
FROM ports
WHERE device_id = 1
  AND poll_time >= FROM_UNIXTIME($__from / 1000)
  AND poll_time <= FROM_UNIXTIME($__to / 1000)
LIMIT 10000;  # Prevent runaway queries

Performance Benchmarks

Time Range	Target Query Time	Recommended Interval
Last 6 hours	< 2 seconds	1 minute
Last 24 hours	< 3 seconds	5 minutes
Last 7 days	< 5 seconds	30 minutes

📡 Section 2: InfluxDB Troubleshooting

InfluxDB issues commonly involve authentication, query language compatibility (Flux vs InfluxQL), and time series aggregation challenges.

4Issue: 401 Unauthorized Errors

Symptom: 401 Unauthorized or invalid authentication credentials

Diagnostic Procedure (InfluxDB 2.x):

# Step 1: Verify InfluxDB is running
systemctl status influxd

# Step 2: Test API endpoint
curl http://influxdb-server:8086/health

# Step 3: Verify token validity
curl -H "Authorization: Token YOUR_TOKEN_HERE" \
     http://influxdb-server:8086/api/v2/buckets

# Step 4: List all tokens (requires admin access)
influx auth list

Resolution Steps:

If token is invalid or expired:

# Create new read token
influx auth create \
  --org YourOrgName \
  --description "Grafana Read Token" \
  --read-bucket telegraf

InfluxDB 1.x vs 2.x Authentication

Version	Auth Method	Grafana Config
InfluxDB 1.x	Username/Password	Basic Auth enabled, no token
InfluxDB 2.x	Token-based	Token in HTTP headers, no Basic Auth

5Issue: No Measurements Found

Symptom: Metric dropdowns are empty, or queries return measurement not found

Diagnostic Commands (InfluxDB 1.x - InfluxQL):

# Connect to database
influx -database telegraf

# List all measurements
SHOW MEASUREMENTS;

# Check measurement schema
SHOW FIELD KEYS FROM cpu;

# Verify recent data exists
SELECT * FROM cpu WHERE time > now() - 5m LIMIT 10;

Problem: Telegraf not collecting data

# Check Telegraf status
systemctl status telegraf

# View Telegraf logs
journalctl -u telegraf -n 50 --no-pager

# Test Telegraf config
telegraf --test --config /etc/telegraf/telegraf.conf

# Restart Telegraf if needed
sudo systemctl restart telegraf

6Issue: Query Timeout Errors

Symptom: context deadline exceeded or query timeout

Optimization Techniques:

1. Use Appropriate Time Windows

// BAD: Queries all historical data (Flux)
from(bucket: "telegraf")
  |> range(start: 0)
  |> filter(fn: (r) => r._measurement == "cpu")

// GOOD: Limited time range
from(bucket: "telegraf")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r._measurement == "cpu")

2. Implement Downsampling for Long Ranges

// For time ranges > 24 hours, aggregate data (Flux)
from(bucket: "telegraf")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r._measurement == "cpu")
  |> filter(fn: (r) => r._field == "usage_idle")
  |> aggregateWindow(every: 5m, fn: mean)
  |> yield(name: "mean")

📋 Section 3: VictoriaLogs Troubleshooting

VictoriaLogs troubleshooting focuses on endpoint configuration, LogQL syntax, and label-based filtering.

7Issue: Failed to Call Resource Errors

Symptom: failed to call resource or 404 Not Found

Diagnostic Procedure:

# Step 1: Verify VictoriaLogs is running
curl http://victorialogs-server:9428/health

# Step 2: Test LogQL endpoint
curl -X POST 'http://victorialogs-server:9428/select/logsql/query' \
     -d 'query={}'

# Step 3: Verify labels endpoint
curl -X POST 'http://victorialogs-server:9428/select/logsql/labels'

VictoriaLogs URL Configuration

Correct URL format:

# In Grafana data source settings:
URL: http://victorialogs-server:9428/select/logsql

# Common mistakes to avoid:
# ❌ http://victorialogs-server:9428 (missing path)
# ❌ http://victorialogs-server:9428/select (incomplete)

Problem: Port not accessible

# Test connectivity from Grafana server
telnet victorialogs-server 9428

# Check if VictoriaLogs is listening
sudo netstat -tlnp | grep 9428

# Verify firewall rules (Ubuntu/Debian)
sudo ufw allow 9428/tcp

8Issue: Label Filtering Not Working

Symptom: Queries return no results or wrong results despite logs existing

Diagnostic Steps:

# Step 1: List all available labels
curl -X POST 'http://victorialogs-server:9428/select/logsql/labels'

# Step 2: Get values for specific label
curl -X POST 'http://victorialogs-server:9428/select/logsql/label_values' \
     -d 'label=host'

# Step 3: Test basic query without filters
curl -X POST 'http://victorialogs-server:9428/select/logsql/query' \
     -d 'query={} | limit 10'

Common LogQL Syntax Issues:

# ❌ WRONG: Using commas between filters
{host="server-01", service="nginx"}

# ✅ CORRECT: Filters within same label selector
{host="server-01",service="nginx"}

# ✅ CORRECT: Using regex matching
{host=~"server-.*"}

LogQL Query Examples

Use Case	Query
All logs from a host	`{host="server-01"}`
Logs containing "error"	`{} \| "error"`
Multiple label filters	`{host="server-01",level="error"}`

🔍 Section 4: Grafana Configuration Debugging

9Enable Debug Logging

Debug logging provides detailed information about data source connections and query execution.

# Method 1: Edit grafana.ini configuration file
sudo nano /etc/grafana/grafana.ini

# Find and modify the [log] section:
[log]
mode = console file
level = debug

# Save and restart Grafana
sudo systemctl restart grafana-server

Viewing Grafana Logs:

# For systemd-based systems
journalctl -u grafana-server -f --since "10 minutes ago"

# For file-based logging
tail -f /var/log/grafana/grafana.log

# Filter for specific errors
journalctl -u grafana-server | grep "datasource"

Debug Logging Impact

Debug logging generates significant disk I/O. Key considerations:

Can generate 100MB+ logs per hour on busy systems
May impact performance on resource-constrained servers
Always revert to INFO level after troubleshooting

10Using Explore Mode for Query Testing

Grafana's Explore view is the best tool for testing queries before adding them to dashboards.

Explore Mode Workflow:

Step 1: Click Explore icon (compass) in left sidebar

↓

Step 2: Select data source from dropdown at top

↓

Step 3: Build query using query builder or write custom query

↓

Step 4: Click "Run query" button or press Shift+Enter

↓

Step 5: Review results and check Query Inspector

Explore Mode Shortcuts

Shift + Enter: Run query
Split view: Click "Split" to compare multiple queries
Share Explore: Copy URL to share exact query with team

11Network Connectivity Testing

Systematic network testing eliminates connectivity as a variable in troubleshooting.

Connectivity Test Checklist:

Verify Grafana server can resolve data source hostname
Test port accessibility with telnet or nc
Verify firewall rules on both endpoints
Test with curl from Grafana server
Validate DNS resolution
Confirm time synchronization (NTP)

Diagnostic Commands:

# Test 1: DNS resolution
nslookup librenms-server

# Test 2: Network connectivity
ping -c 4 librenms-server

# Test 3: Port accessibility
telnet librenms-server 3306

# Test 4: Full HTTP request
curl -v http://influxdb-server:8086/health

# Test 5: Time synchronization
timedatectl status

🛡️ Section 5: Security Best Practices

12Implement Least Privilege Access

Database users for Grafana should have only SELECT (read) permissions.

MySQL/MariaDB (LibreNMS):

# Create dedicated read-only user
CREATE USER 'grafana_readonly'@'grafana-server-ip' 
    IDENTIFIED BY 'StrongPassword123!';

# Grant SELECT only on required database
GRANT SELECT ON librenms.* TO 'grafana_readonly'@'grafana-server-ip';

# Apply changes
FLUSH PRIVILEGES;

InfluxDB 2.x:

# Create read-only token for specific bucket
influx auth create \
  --org YourOrgName \
  --description "Grafana Read-Only Token" \
  --read-bucket telegraf

Security Validation

Test that read-only user CANNOT modify data:

# This should succeed:
SELECT * FROM devices LIMIT 1;

# These should all fail:
INSERT INTO devices VALUES (...);
UPDATE devices SET hostname='test';
DELETE FROM devices WHERE id=1;

13Enable SSL/TLS Encryption

Encrypt data in transit between Grafana and data sources.

MySQL/MariaDB SSL Configuration:

# Step 1: Generate SSL certificates
sudo mysql_ssl_rsa_setup --datadir=/var/lib/mysql

# Step 2: Enable SSL in MySQL config
sudo nano /etc/mysql/mariadb.conf.d/50-server.cnf

[mysqld]
ssl-ca=/var/lib/mysql/ca.pem
ssl-cert=/var/lib/mysql/server-cert.pem
ssl-key=/var/lib/mysql/server-key.pem
require_secure_transport=ON

# Step 3: Restart MySQL
sudo systemctl restart mariadb

Copy CA certificate to Grafana server:

# On Grafana server:
sudo nano /etc/grafana/mysql-ca.pem
# Paste the CA certificate content

sudo chown grafana:grafana /etc/grafana/mysql-ca.pem
sudo chmod 400 /etc/grafana/mysql-ca.pem

14Implement Firewall Rules

Restrict data source access to only the Grafana server.

Ubuntu/Debian (UFW):

# On LibreNMS server
sudo ufw allow from GRAFANA_IP to any port 3306 proto tcp
sudo ufw deny 3306/tcp
sudo ufw enable

# On InfluxDB server
sudo ufw allow from GRAFANA_IP to any port 8086 proto tcp
sudo ufw deny 8086/tcp

RHEL/CentOS (firewalld):

# On LibreNMS server
sudo firewall-cmd --permanent --add-rich-rule='
  rule family="ipv4" 
  source address="GRAFANA_IP" 
  port protocol="tcp" port="3306" 
  accept'

sudo firewall-cmd --reload

📊 Section 6: Performance Optimization

15Dashboard Refresh Rate Tuning

Aggressive refresh rates increase database load without providing value for slowly-changing metrics.

Recommended Refresh Intervals:

Dashboard Type	Recommended Refresh	Rationale
Real-time monitoring	30 seconds - 1 minute	Balance freshness with load
Network interface stats	1-5 minutes	Polling typically every 5 minutes
Application logs	1-2 minutes	Log volume and query complexity
Historical analysis	Off (manual refresh)	Static data doesn't need auto-refresh

Query Count Optimization

Reduce the number of queries per dashboard:

Combine related metrics into single query when possible
Use query variables to avoid duplicated queries
Hide panels not currently in use (reduces query execution)
Use Grafana's query caching (enabled by default)

16Database Connection Pooling

Optimize database connections to handle multiple concurrent dashboard users.

MySQL/MariaDB Connection Pooling:

# Tune MySQL for Grafana workload
sudo nano /etc/mysql/mariadb.conf.d/50-server.cnf

[mysqld]
max_connections = 200
max_user_connections = 50
wait_timeout = 300
interactive_timeout = 300

# InnoDB buffer pool (set to 70% of RAM)
innodb_buffer_pool_size = 2G

# Restart MySQL
sudo systemctl restart mariadb

✅ Section 7: Production Deployment Checklist

17Pre-Deployment Validation

Security Checklist:

Changed default Grafana admin password
Created read-only database users for all data sources
Implemented firewall rules restricting data source access
Enabled SSL/TLS for all data source connections
Set up regular credential rotation schedule
Disabled anonymous Grafana access
Enabled audit logging for access tracking

Performance Checklist:

Optimized database queries (EXPLAIN analysis completed)
Added indexes to frequently queried columns
Implemented appropriate dashboard refresh rates
Enabled query caching in Grafana
Configured database connection pooling
Tested dashboard load times (<5 seconds target)
Verified no query timeout errors under normal load

Reliability Checklist:

All data sources pass health checks
Configured service auto-start on boot
Implemented database backup procedures
Created Grafana dashboard export backups
Documented restore procedures
Set up monitoring for Grafana itself
Configured log rotation for all services

18Post-Deployment Monitoring

After deployment, continuously monitor the health and performance of your Grafana integration.

Key Metrics to Monitor:

Metric	Warning Threshold	Critical Threshold
Dashboard Load Time	> 5 seconds	> 10 seconds
Query Error Rate	> 1%	> 5%
Database CPU Usage	> 70%	> 90%
Grafana Memory Usage	> 80%	> 95%

Deployment Success Criteria

Your Grafana integration is production-ready when:

✅ All dashboards load in under 5 seconds
✅ Query error rate is < 0.1%
✅ All security hardening steps are complete
✅ Backup and restore procedures are tested
✅ Documentation is complete and accessible
✅ Team members are trained on basic troubleshooting

🔄 Troubleshooting Workflow Summary

Use this systematic approach when encountering any integration issue:

Step 1: Identify the Symptom

What exact error message or behavior are you seeing?

↓

Step 2: Isolate the Component

Is the issue with Grafana, data source, network, or query?

↓

Step 3: Check Basics

Service running? Network accessible? Credentials valid?

↓

Step 4: Enable Debug Logging

Turn on debug logs for detailed error information

↓

Step 5: Test in Isolation

Use Explore mode, command-line tools, curl tests

↓

Step 6: Apply Solution

Implement fix based on diagnostic results

↓

Step 7: Verify Resolution

Test thoroughly, document the solution

↓

Step 8: Implement Prevention

Add monitoring, alerts, or automation to prevent recurrence

📚 Quick Reference: Common Issues

Symptom	Likely Cause	Quick Fix	See Section
Database connection failed	Service down or firewall	Check service status, test port	Section 1, Step 1
No data in panels	Wrong table/field names	Verify query in Explore mode	Section 1, Step 2
Slow dashboard loading	Unoptimized queries	Run EXPLAIN, add indexes	Section 1, Step 3
401 Unauthorized	Wrong token or password	Regenerate credentials	Section 2, Step 4
No measurements found	Telegraf not collecting	Check Telegraf status	Section 2, Step 5
Query timeout	Query too broad	Implement downsampling	Section 2, Step 6
Failed to call resource	Wrong VictoriaLogs URL	Use /select/logsql endpoint	Section 3, Step 7
Label filtering not working	Case sensitivity or syntax	Check label values, use regex	Section 3, Step 8

🎓 Module Summary

Congratulations! You've completed the troubleshooting and optimization module. You now have the skills to:

✅ Diagnose and resolve LibreNMS database connection and query issues
✅ Troubleshoot InfluxDB authentication and performance problems
✅ Fix VictoriaLogs endpoint and LogQL syntax errors
✅ Use Grafana's debugging tools effectively
✅ Implement security best practices for production deployments
✅ Optimize query performance and dashboard load times
✅ Follow systematic troubleshooting methodologies
✅ Validate production-readiness using comprehensive checklists

Next Steps

You're now ready for the final module: Verification & Assessment. This module will test your knowledge through hands-on scenarios and provide a comprehensive skills assessment.

🔧 Troubleshooting & Optimization

Module Overview

🌐 Section 1: LibreNMS Troubleshooting

1Issue: Database Connection Failed

Diagnostic Procedure:

Resolution Steps:

Common Firewall Issue

2Issue: No Data Returned from Queries

Diagnostic Commands:

LibreNMS Table Reference

Query Testing Best Practice

3Issue: Slow Query Performance

Diagnostic Analysis:

Optimization Strategies:

Performance Benchmarks

📡 Section 2: InfluxDB Troubleshooting

4Issue: 401 Unauthorized Errors

Diagnostic Procedure (InfluxDB 2.x):

Resolution Steps:

InfluxDB 1.x vs 2.x Authentication

5Issue: No Measurements Found

Diagnostic Commands (InfluxDB 1.x - InfluxQL):

6Issue: Query Timeout Errors

Optimization Techniques:

📋 Section 3: VictoriaLogs Troubleshooting

7Issue: Failed to Call Resource Errors

Diagnostic Procedure:

VictoriaLogs URL Configuration

8Issue: Label Filtering Not Working

Diagnostic Steps:

Common LogQL Syntax Issues:

LogQL Query Examples

🔍 Section 4: Grafana Configuration Debugging

9Enable Debug Logging

Viewing Grafana Logs:

Debug Logging Impact

10Using Explore Mode for Query Testing

Explore Mode Workflow:

Explore Mode Shortcuts

11Network Connectivity Testing

Connectivity Test Checklist:

Diagnostic Commands:

🛡️ Section 5: Security Best Practices

12Implement Least Privilege Access

MySQL/MariaDB (LibreNMS):

InfluxDB 2.x:

Security Validation

13Enable SSL/TLS Encryption

MySQL/MariaDB SSL Configuration:

14Implement Firewall Rules

Ubuntu/Debian (UFW):

RHEL/CentOS (firewalld):

📊 Section 6: Performance Optimization

15Dashboard Refresh Rate Tuning

Recommended Refresh Intervals:

Query Count Optimization

16Database Connection Pooling

MySQL/MariaDB Connection Pooling:

✅ Section 7: Production Deployment Checklist

17Pre-Deployment Validation

Security Checklist:

Performance Checklist:

Reliability Checklist:

18Post-Deployment Monitoring

Key Metrics to Monitor:

Deployment Success Criteria

🔄 Troubleshooting Workflow Summary

📚 Quick Reference: Common Issues

🎓 Module Summary

Next Steps