← Back to Menu Module 1 of 8

🎯 Overview & Architecture

Understanding the Monitoring Stack Foundations

πŸ“š Learning Objectives

By the end of this module, you will be able to:

  • Explain the role of each monitoring component (LibreNMS, Telegraf, VictoriaLogs) in the stack
  • Understand how data flows from collection sources through to Grafana visualization
  • Identify the query languages used for each data source (SQL, InfluxQL, LogQL)
  • Recognize the integration patterns that enable cross-source correlation
  • Describe the benefits of unified monitoring versus siloed approaches

🌟 The Challenge: Monitoring Silos

In traditional enterprise environments, monitoring systems operate in isolation. Network teams use SNMP-based tools like LibreNMS to track router interfaces and switch ports. System administrators deploy agents like Telegraf to collect CPU, memory, and disk metrics from servers. Meanwhile, DevOps teams aggregate application logs using tools like VictoriaLogs or Elasticsearch.

When an incident occursβ€”say, users report slow application performanceβ€”troubleshooting becomes a fragmented exercise. You might check LibreNMS for network congestion, then switch to your metrics platform for CPU spikes, and finally dig through logs to find application errors. Each tool provides a piece of the puzzle, but correlating events across systems requires manual effort, domain expertise, and valuable time.

This lab teaches you to break down these silos. By integrating LibreNMS, Telegraf, and VictoriaLogs into Grafana, you create a single pane of glass where network issues, system performance degradation, and application errors appear side-by-side, making root cause analysis faster and more intuitive.

πŸ—οΈ Architecture Overview

The Grafana Dashboard Integration architecture consists of three data collection layers feeding into a unified visualization platform. Each layer specializes in a different type of observability data, but all share common characteristics that enable integration:

🌐 Network Layer

LibreNMS uses SNMP to poll network devices, storing metrics in MySQL. Provides interface statistics, device availability, BGP session status, and environmental sensors.

πŸ“‘ System Layer

Telegraf/InfluxDB collects system metrics via agents deployed on servers. Time-series database stores CPU, memory, disk I/O, network traffic, and custom application metrics.

πŸ“ Application Layer

VictoriaLogs aggregates logs from applications, systems, and infrastructure. Provides full-text search, label-based filtering, and log rate calculations for observability.

Complete System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         GRAFANA DASHBOARD LAYER                                 β”‚
β”‚                         (Port: 3000)                                            β”‚
β”‚                         Unified Visualization & Correlation                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚  β”‚  Network Health  β”‚  β”‚  System Metrics  β”‚  β”‚   Log Analytics  β”‚            β”‚
β”‚  β”‚    Dashboard     β”‚  β”‚    Dashboard     β”‚  β”‚    Dashboard     β”‚            β”‚
β”‚  β”‚                  β”‚  β”‚                  β”‚  β”‚                  β”‚            β”‚
β”‚  β”‚  β€’ Interface     β”‚  β”‚  β€’ CPU/Memory    β”‚  β”‚  β€’ Error Rates   β”‚            β”‚
β”‚  β”‚    Traffic       β”‚  β”‚  β€’ Disk I/O      β”‚  β”‚  β€’ Log Search    β”‚            β”‚
β”‚  β”‚  β€’ BGP Status    β”‚  β”‚  β€’ Network Stats β”‚  β”‚  β€’ Aggregations  β”‚            β”‚
β”‚  β”‚  β€’ Device Health β”‚  β”‚  β€’ Process Info  β”‚  β”‚  β€’ Filtering     β”‚            β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚                  β”‚                  β”‚
             β”‚                  β”‚                  β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”
    β”‚   LibreNMS      β”‚  β”‚  InfluxDB   β”‚  β”‚ VictoriaLogs β”‚
    β”‚  Data Source    β”‚  β”‚ Data Source β”‚  β”‚ Data Source  β”‚
    β”‚                 β”‚  β”‚             β”‚  β”‚              β”‚
    β”‚  MySQL Plugin   β”‚  β”‚  Native     β”‚  β”‚  Loki Plugin β”‚
    β”‚  Port: 3306     β”‚  β”‚  Port: 8086 β”‚  β”‚  Port: 9428  β”‚
    β”‚                 β”‚  β”‚             β”‚  β”‚              β”‚
    β”‚  Query: SQL     β”‚  β”‚  Query:     β”‚  β”‚  Query:      β”‚
    β”‚                 β”‚  β”‚  InfluxQL   β”‚  β”‚  LogQL       β”‚
    β”‚                 β”‚  β”‚  or Flux    β”‚  β”‚              β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚                  β”‚                 β”‚
             β”‚                  β”‚                 β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   LibreNMS      β”‚  β”‚  Telegraf   β”‚  β”‚  Vector/      β”‚
    β”‚   Server        β”‚  β”‚   Agents    β”‚  β”‚  Promtail     β”‚
    β”‚                 β”‚  β”‚             β”‚  β”‚  Log Shippers β”‚
    β”‚  SNMP Poller    β”‚  β”‚  Collectors:β”‚  β”‚               β”‚
    β”‚  MySQL Database β”‚  β”‚  β€’ system   β”‚  β”‚  Collectors:  β”‚
    β”‚                 β”‚  β”‚  β€’ cpu      β”‚  β”‚  β€’ Syslog     β”‚
    β”‚  Collects:      β”‚  β”‚  β€’ mem      β”‚  β”‚  β€’ App Logs   β”‚
    β”‚  β€’ Interfaces   β”‚  β”‚  β€’ disk     β”‚  β”‚  β€’ Container  β”‚
    β”‚  β€’ Devices      β”‚  β”‚  β€’ net      β”‚  β”‚  β€’ Audit Logs β”‚
    β”‚  β€’ BGP Peers    β”‚  β”‚  β€’ docker   β”‚  β”‚               β”‚
    β”‚  β€’ Sensors      β”‚  β”‚  β€’ custom   β”‚  β”‚               β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚                  β”‚                 β”‚
             β”‚                  β”‚                 β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚              MONITORED INFRASTRUCTURE                    β”‚
    β”‚                                                           β”‚
    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
    β”‚  β”‚  Network   β”‚  β”‚   Linux    β”‚  β”‚  Windows   β”‚        β”‚
    β”‚  β”‚  Devices   β”‚  β”‚   Servers  β”‚  β”‚  Servers   β”‚        β”‚
    β”‚  β”‚            β”‚  β”‚            β”‚  β”‚            β”‚        β”‚
    β”‚  β”‚  β€’ Routers β”‚  β”‚  β€’ Web     β”‚  β”‚  β€’ SQL     β”‚        β”‚
    β”‚  β”‚  β€’ Switchesβ”‚  β”‚  β€’ App     β”‚  β”‚  β€’ AD/DNS  β”‚        β”‚
    β”‚  β”‚  β€’ FW/LB   β”‚  β”‚  β€’ Docker  β”‚  β”‚  β€’ IIS     β”‚        β”‚
    β”‚  β”‚  β€’ WiFi AP β”‚  β”‚  β€’ K8s     β”‚  β”‚  β€’ Exchangeβ”‚        β”‚
    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ’‘ Key Architectural Principle

Notice that Grafana acts as the query federation layer. It doesn't store data itselfβ€”instead, it queries each specialized data store in real-time, using the appropriate protocol and query language. This architecture provides flexibility: you can upgrade or replace individual components without disrupting the entire stack.

πŸ”„ Data Flow Patterns

Understanding how data moves from collection points through storage to visualization is crucial for effective troubleshooting and optimization. Each data source follows a similar pattern but with important distinctions:

🌐 LibreNMS Data Flow

Collection Method: SNMP polling (v2c or v3) at 5-minute intervals (default)

Storage: MySQL/MariaDB relational database with normalized schema

Data Retention: Configurable, typically 1 year for raw metrics, indefinite for device inventory

Grafana Integration: MySQL data source plugin executing SELECT queries

Query Language: Standard SQL with time-series specific functions

Primary Use Cases:

  • Interface bandwidth utilization and error rates
  • Device availability and uptime tracking
  • BGP/OSPF routing protocol status
  • Environmental sensors (temperature, voltage, fan speed)
  • Network device inventory management

Data Format: Structured rows with pre-calculated rates (octets/sec) stored in dedicated tables

πŸ“‘ Telegraf/InfluxDB Data Flow

Collection Method: Agent-based push model with configurable input plugins

Storage: InfluxDB time-series database optimized for high-write throughput

Data Retention: Retention policies and downsampling (e.g., 90 days full resolution, 2 years aggregated)

Grafana Integration: Native InfluxDB data source with InfluxQL or Flux support

Query Language: InfluxQL (SQL-like) or Flux (functional language)

Primary Use Cases:

  • Server resource utilization (CPU, memory, disk, network)
  • Application performance metrics (response times, request rates)
  • Container and Kubernetes metrics
  • Database performance counters
  • Custom business metrics via StatsD or HTTP inputs

Data Format: Measurements with tags (indexed) and fields (non-indexed) using line protocol

πŸ“ VictoriaLogs Data Flow

Collection Method: Log shippers (Vector, Promtail, Fluentd) push via HTTP

Storage: Columnar storage format optimized for log ingestion and compression

Data Retention: Time-based or size-based limits, typically 30-90 days depending on volume

Grafana Integration: Loki data source plugin (LogQL compatibility mode)

Query Language: LogQL - combines label filtering with log processing pipeline

Primary Use Cases:

  • Application error tracking and debugging
  • Security event aggregation and analysis
  • Audit trail and compliance logging
  • Infrastructure change tracking
  • Correlation of events across distributed systems

Data Format: Structured logs with labels (indexed for filtering) and message content (searchable)

πŸ”— Integration Patterns

The power of this architecture lies not just in having three data sources, but in how they work together. Several key integration patterns enable effective correlation:

1. Time-Based Correlation

All three systems use UTC timestamps, allowing you to create dashboards where panels from different sources share the same time range. When you zoom into a 5-minute window showing a network outage in LibreNMS, system metrics from Telegraf and error logs from VictoriaLogs automatically adjust to the same time period.

2. Common Label Strategy

By using consistent naming conventions across data sourcesβ€”particularly for hostname, environment, and region labelsβ€”you can create dashboard variables that filter all panels simultaneously. A single dropdown for "hostname" can control queries to LibreNMS, InfluxDB, and VictoriaLogs simultaneously.

3. Cross-Source Alerting

Grafana's alerting engine can reference multiple data sources in a single alert rule. For example, you might trigger an alert when:

  • Network interface errors exceed threshold (LibreNMS)
  • AND CPU usage is above 90% (Telegraf)
  • AND error logs increase by 300% (VictoriaLogs)

This multi-signal approach reduces false positives and provides richer context for on-call engineers.

4. Unified Variable Definitions

Dashboard variables can query any data source. You might define a $datacenter variable by querying LibreNMS for device locations, then use that same variable to filter InfluxDB metrics and VictoriaLogs streams. This creates a truly unified filtering experience.

Integration Aspect Implementation Detail User Benefit
Unified Time Series All sources use UTC timestamps, synced via NTP Accurate event correlation across systems
Common Labels hostname, environment, region standardized Single-click filtering across all data
Variable Templates Dashboard variables query any data source Dynamic, context-aware dashboards
Alert Correlation Rules can reference multiple data sources Smarter alerts with reduced noise
Annotation Integration Events from logs appear on metric graphs Visual correlation of cause and effect

πŸ“Š Query Language Overview

Each data source uses a different query language optimized for its data model. Understanding these languages is essential for building effective dashboards:

SQL for LibreNMS (MySQL)

Standard SQL with time-series patterns. LibreNMS stores data in normalized tables with pre-calculated rates. Common patterns include JOINs between devices and ports tables, time-based WHERE clauses, and aggregation functions.

Example Use Case: Calculate total bandwidth across all interfaces on a device

InfluxQL for Telegraf/InfluxDB

SQL-like query language designed for time-series data. Key concepts include measurements (like SQL tables), fields (values), and tags (indexed dimensions). Supports aggregation windows and time-based grouping.

Example Use Case: Show 95th percentile CPU usage per host over 24 hours

LogQL for VictoriaLogs

Combines label-based filtering (like Prometheus) with log processing pipelines. Queries start with label selectors, then pipe through filters, parsers, and aggregations. Supports regex, JSON extraction, and metric generation from logs.

Example Use Case: Calculate error rate per service from application logs

πŸŽ“ Learning Path

Don't worry if these query languages are new to you. Modules 3-5 provide step-by-step examples with detailed explanations. You'll start with simple queries and progressively build more complex visualizations. By Module 6, you'll be combining all three query languages in a single dashboard.

✨ Benefits of Unified Monitoring

Why invest time in integrating these systems? The benefits extend far beyond convenience:

Faster Mean Time to Resolution (MTTR)

When an incident occurs, having all relevant data in one place dramatically speeds up troubleshooting. Instead of logging into three different systems, correlating timestamps, and switching context, engineers see the complete picture immediately. A study by DevOps Research and Assessment (DORA) found that organizations with unified observability reduce MTTR by an average of 65%.

Proactive Problem Detection

Correlation dashboards reveal patterns that single-source views miss. For example, you might notice that network packet loss (LibreNMS) consistently precedes disk I/O spikes (Telegraf) and database connection errors in logs (VictoriaLogs). This pattern might indicate a storage replication issue that's invisible when viewing systems in isolation.

Improved Collaboration

Network engineers, system administrators, and application developers often speak different technical languages and use different tools. Unified dashboards become a common groundβ€”a shared vocabulary for discussing system health. During incident calls, everyone literally looks at the same graphs.

Cost Optimization

While this lab uses open-source tools, the integration patterns you'll learn apply to commercial solutions too. Understanding how to correlate data across specialized tools means you can avoid expensive "all-in-one" monitoring platforms that try to do everything but excel at nothing.

Historical Analysis and Capacity Planning

Unified dashboards make it easier to identify long-term trends across multiple dimensions. You might discover that network bandwidth growth (LibreNMS) correlates with specific application deployment patterns (logs) and requires CPU upgrades (Telegraf) six months before hitting capacity limits.

⚠️ Important Considerations

  • Data Retention: Each system has different retention policies. Plan accordinglyβ€”LibreNMS might keep interface stats for 1 year, but you might only retain detailed logs for 30 days.
  • Query Performance: Cross-source dashboards can generate significant load. Use dashboard refresh rates wisely (30s-1m is typical) and implement query caching where appropriate.
  • Time Synchronization: All systems MUST use NTP for accurate correlation. Even a few seconds of clock drift can make troubleshooting confusing.
  • Network Segmentation: Ensure Grafana can reach all data sources. Firewall rules and network policies must allow the necessary traffic flows.
  • Security: Each integration point is a potential security boundary. Use read-only credentials, implement network segmentation, and regularly audit access.

🎯 What's Next?

Now that you understand the architecture and integration patterns, you're ready to begin hands-on configuration. Module 2 will guide you through preparing your lab environment, verifying prerequisites, and ensuring all systems are accessible before you start connecting data sources to Grafana.

The journey from here follows a logical progression:

  1. Verify environment and prerequisites (Module 2)
  2. Configure each data source individually (Modules 3-5)
  3. Build integrated dashboards (Module 6)
  4. Learn troubleshooting techniques (Module 7)
  5. Validate your work and assess knowledge (Module 8)