Module Overview
Congratulations on reaching the final module! You've built a comprehensive monitoring integration spanning LibreNMS, Telegraf/InfluxDB, and VictoriaLogs. Now it's time to validate your work, test every component, and ensure production readiness.
This module provides systematic verification procedures, performance benchmarks, hands-on exercises, and knowledge assessment quizzes. By the end, you'll have confidence that your integration is robust, optimized, and ready for production deployment.
By completing this module, you will be able to: verify data source connectivity and accuracy, validate dashboard functionality across all panels, test alert rules and notification delivery, measure query performance and optimize bottlenecks, complete production readiness checklists, and demonstrate mastery through practical exercises and quiz assessments.
Estimated Completion Time: 25-30 minutes
Section 1: Data Source Verification
Before trusting your dashboards in production, you must verify that each data source is properly configured, returning accurate data, and performing within acceptable parameters. This section provides systematic testing procedures for LibreNMS, InfluxDB, and VictoriaLogs.
Step 1: LibreNMS Data Source Verification
LibreNMS serves as your network inventory and SNMP polling engine. Verify that Grafana can query device data, interface statistics, and alert information correctly.
Test 1: Verify LibreNMS connectivity and authentication
# Navigate to Grafana Data Sources
# URL: http://your-grafana-server:3000/datasources
# Click on your LibreNMS MySQL data source
# Look for green "Data source is working" message
# Test query from Explore tab:
SELECT hostname, sysName, os
FROM devices
WHERE status = 1
LIMIT 5;
# Expected output: List of 5 active devices with their OS types
Test 2: Validate interface data accuracy
# Test query for interface statistics
SELECT
devices.hostname,
ports.ifName,
ports.ifOperStatus,
ports.ifSpeed,
ports.ifInOctets_rate,
ports.ifOutOctets_rate
FROM ports
JOIN devices ON ports.device_id = devices.device_id
WHERE devices.hostname = 'core-switch-01'
AND ports.ifOperStatus = 'up'
ORDER BY ports.ifInOctets_rate DESC
LIMIT 10;
# Expected output: Top 10 busiest interfaces with current rates
# Verify rates match SNMP walk data or switch CLI output
You should see:
- Green health status in Grafana data source configuration
- Device queries return expected hostnames and metadata
- Interface rates match known traffic patterns
- Query execution time under 2 seconds for typical device queries
Step 2: InfluxDB Data Source Verification
InfluxDB stores time-series metrics from Telegraf agents. Verify metric collection, retention policies, and query performance.
Test 1: Verify InfluxDB connectivity and database access
# From InfluxDB server or remote client
influx -precision rfc3339
# List available databases
SHOW DATABASES
# Expected output should include: telegraf
# Use telegraf database
USE telegraf
# Show measurements (metric types)
SHOW MEASUREMENTS
# Expected output: cpu, disk, mem, net, system, etc.
Test 2: Validate metric collection and data freshness
# Check most recent data point for each host
SELECT
LAST(usage_idle) as last_cpu_idle,
host
FROM cpu
WHERE time > now() - 5m
GROUP BY host
# Verify data within last 60 seconds (Telegraf default interval: 10s)
SELECT
time,
host,
usage_idle
FROM cpu
WHERE time > now() - 2m
ORDER BY time DESC
LIMIT 20
If queries return no data: Check Telegraf agent status on monitored hosts (systemctl status telegraf),
verify InfluxDB output configuration in /etc/telegraf/telegraf.conf, check firewall rules
allowing port 8086, and review InfluxDB logs for write errors (/var/log/influxdb/influxd.log).
You should see:
- Telegraf database exists and contains expected measurements
- Recent data points (within last 60 seconds) for all monitored hosts
- Query execution times under 1 second for 24-hour ranges
- No write errors in InfluxDB logs
Step 3: VictoriaLogs Data Source Verification
VictoriaLogs aggregates syslog and application logs. Verify log ingestion, filtering, and query performance using LogQL syntax.
Test 1: Verify VictoriaLogs connectivity
# Check VictoriaLogs health from command line
curl -s http://victorialogs-server:9428/health
# Expected output: {"status":"ok"}
# Test basic log query in Grafana Explore
{job="syslog"} | limit 100
# Query by severity
{job="syslog"} |~ "error|critical|alert" | limit 50
For best query performance: Use label filters ({job="syslog", hostname="..."}) before line filters (|~),
limit time ranges to necessary duration, use count_over_time for aggregations rather than
retrieving all log lines.
You should see:
- Health endpoint returns {"status":"ok"}
- Recent logs appear within last 60 seconds
- Severity filtering returns appropriate log entries
- Query execution times under 3 seconds for 24-hour ranges
Step 4: Query Performance Baseline Measurements
Establish performance baselines to identify degradation over time as your monitoring environment scales.
| Query Type | Data Source | Time Range | Target Response Time |
|---|---|---|---|
| Device inventory list | LibreNMS | N/A | < 2 seconds |
| Interface top 10 bandwidth | LibreNMS | Current | < 3 seconds |
| CPU usage time series | InfluxDB | 24 hours | < 1 second |
| Memory aggregation | InfluxDB | 7 days | < 2 seconds |
| Log search by keyword | VictoriaLogs | 1 hour | < 2 seconds |
Section 2: Dashboard Functionality Testing
Your dashboards are the primary interface for monitoring operations. This section provides comprehensive testing procedures to ensure every interactive element functions correctly.
Step 1: Variable Selection and Panel Updates
Dashboard variables allow dynamic filtering. Test that variable changes propagate to all dependent panels.
Test procedure:
- Navigate to your unified monitoring dashboard
- Document current variable values
- Change the variable value (select different host)
- Verify all panels update within 2-3 seconds
- Test multi-select variables with multiple values
- Test "All" option for aggregate data
All panels refresh within 2-3 seconds. No panels show "No data". Panel titles update to reflect new variable values. Query inspector shows updated WHERE clauses.
Step 2: Time Range Picker Validation
The time range picker is crucial for historical analysis. Verify it affects all visualizations correctly.
| Time Range | Expected Behavior |
|---|---|
| Last 5 minutes | All panels show only last 5 minutes of data |
| Last 6 hours | Data compressed to 30s or 1m aggregation |
| Last 7 days | Downsampled to 5m or 10m intervals |
Step 3: Refresh and Auto-Refresh Testing
For real-time monitoring, auto-refresh is essential. Test both manual refresh and automatic intervals.
- Click refresh button - observe all panels reload
- Set auto-refresh to 10s
- Open browser console → Network tab
- Observe requests firing every 10 seconds
- Verify panels update with changing metrics
Aggressive auto-refresh (5s or less) can overload data sources. For production NOC screens, use 30s or 1m refresh rates. For troubleshooting, 10s is acceptable. Disable when not actively monitoring.
Section 3: Alert Rule Validation
Alerting is critical for proactive monitoring. This section ensures alert rules trigger correctly, notifications are delivered, and alert lifecycle functions properly.
Step 1: Alert Condition Trigger Testing
Intentionally trigger each alert to verify threshold accuracy and notification delivery.
Test high CPU alert:
# SSH to a monitored host
ssh admin@test-server-01
# Generate CPU load
stress --cpu 4 --timeout 120s
# Watch Grafana alert state transition:
# Normal → Pending → Firing
# Verify notification received
Test interface down alert:
# Safely shut down a test interface
ssh admin@test-switch
config t
interface GigabitEthernet1/0/24
shutdown
# Wait for LibreNMS poller cycle
# Verify alert fires in Grafana
# Restore interface:
no shutdown
For each alert rule, verify:
- State transitions Normal → Pending → Firing correctly
- Alert evaluation follows configured interval
- Pending duration matches configured threshold
- Alert annotations appear on dashboards
Step 2: Notification Channel Verification
Test that alerts are delivered to all configured notification channels.
| Channel Type | Test Method |
|---|---|
| Email (SMTP) | Send test notification button |
| Slack | Send test notification button |
| PagerDuty | Trigger test alert |
# Minimum alert notification should contain:
# - Alert name/title
# - Current metric value
# - Threshold value
# - Affected host/device
# - Timestamp
# - Dashboard link
# - Instructions/runbook link
Step 3: Alert Silencing for Maintenance
Test that you can silence alerts during planned maintenance.
- Navigate to Alerting → Silences
- Click "New silence"
- Configure matcher: hostname=test-server-01
- Set duration: 1 hour
- Trigger alert condition
- Verify alert shows "Suppressed" instead of firing
Create silences 15 minutes before maintenance. Use descriptive comments including ticket number. Set duration slightly longer than estimated maintenance. Delete silence manually if work completes early.
Section 4: Knowledge Assessment Quiz
Test your understanding of concepts covered throughout this lab series.
Which query language is used to retrieve data from LibreNMS in Grafana?
LibreNMS uses a MySQL database to store device inventory and SNMP polling data. When configuring LibreNMS as a data source in Grafana, you select "MySQL" and write standard SQL queries.
To convert LibreNMS interface traffic from octets to Mbps, what calculation is required?
1 byte (octet) = 8 bits. Formula:
(octets * 8) / 1000000 = Mbps. Example: 10,000,000
octets/sec = 80 Mbps.
What is the PRIMARY operational benefit of integrating LibreNMS, Telegraf, and VictoriaLogs?
The key value is correlation: when interface errors (LibreNMS), CPU spikes (Telegraf), and application errors (VictoriaLogs) occur simultaneously, correlation dramatically reduces MTTR.
📊 Quiz Score
Total Questions: 3 sample questions shown
Correct Answers: 0 / 3
Section 5: Hands-On Practical Exercises
Apply your knowledge through practical exercises that simulate real-world scenarios.
Exercise 1: Generate Test Load and Observe Dashboard Changes
Objective: Verify dashboards accurately reflect system load changes in real-time.
- Open your unified monitoring dashboard
- Set auto-refresh to 10 seconds
- Note baseline CPU/memory for test server
- SSH to server:
stress --cpu 2 --vm 2 --vm-bytes 512M --timeout 120s - Observe panels update within 30 seconds
- Document peak values
- Verify metrics return to baseline
Expected: CPU spike to ~100%, memory +512MB, baseline within 1 minute.
Exercise 2: Correlate Data Across Sources
Scenario: Web application unreachable. Use dashboard to identify cause.
- Stop web service:
systemctl stop nginx - Generate test traffic:
curl http://test-webserver - Check VictoriaLogs for HTTP errors
- Check LibreNMS for interface status
- Check InfluxDB for connection failures
- Correlate timestamps: which issue occurred first?
- Document analysis and resolution
Expected: Identify stopped service as root cause, demonstrate timestamp correlation.
Exercise 3: Create Custom Dashboard Variable
Objective: Build multi-select variable for filtering by location.
- Edit dashboard → Settings → Variables → Add
- Name: location, Type: Query
- Data source: LibreNMS MySQL
- Query:
SELECT DISTINCT location FROM devices WHERE status = 1 - Enable "Multi-value" and "Include All"
- Update panels to use:
WHERE location IN ($location) - Test selection and verify updates
Expected: Dropdown filters entire dashboard, "All" shows all devices.
Section 6: Production Readiness Checklist
Complete this comprehensive checklist before production deployment.
Infrastructure Health
Performance Validation
Alerting and Notifications
Security and Access Control
Backup and Documentation
Progress
Completion
Status
Section 7: Next Steps & Advanced Topics
You've built a production-grade unified monitoring solution. Here are advanced topics to expand your capabilities.
1. Advanced Alerting Strategies
Predictive Alerting: Use Grafana ML plugin or InfluxDB forecast() to predict issues before they occur.
Composite Alerts: Create alerts evaluating multiple conditions across data sources.
Dynamic Thresholds: Use percentile-based alerts that adapt to normal patterns instead of static thresholds.
2. Scaling for Enterprise
High Availability: Deploy Grafana behind load balancer with shared PostgreSQL backend.
Performance: Implement query result caching, database replication for read scaling, and CDN for static assets.
Multi-tenancy: Use Grafana organizations for customer/department isolation.
3. Integration Opportunities
ITSM Integration: Connect alerts to ServiceNow, Jira for automatic ticket creation.
ChatOps: Deploy Grafana Slack bot for dashboard queries from chat.
Automation: Use Terraform for infrastructure-as-code dashboard deployments.
4. Continuous Improvement
Monthly Review: Analyze alert fatigue metrics, dashboard usage statistics, and query performance trends.
Feedback Loop: Collect input from operations team on dashboard effectiveness.
Stay Updated: Follow Grafana Labs blog, join community forums, attend virtual meetups.
🎉 Congratulations!
You've completed the Grafana Dashboard Integration Lab! You've learned to:
- Configure three distinct data sources (LibreNMS, Telegraf, VictoriaLogs)
- Write queries in SQL, InfluxQL, and LogQL
- Build integrated dashboards for unified monitoring
- Troubleshoot common integration issues
- Apply production-grade best practices
- Validate deployments through comprehensive testing
What's Next? Apply these skills to your production environment, share your dashboards with your team, and continue learning through the WholeStack Solutions platform.