š Module Overview
InfluxDB is a time-series database purpose-built for handling high-write and query loads of time-stamped data. When paired with Telegraf (an agent for collecting and reporting metrics), it becomes a powerful system monitoring solution. This module will guide you through configuring InfluxDB as a Grafana data source and creating queries to visualize system metrics collected by Telegraf agents deployed across your infrastructure.
What you'll learn:
- Differences between InfluxDB 1.x and 2.x architectures
- Creating API tokens and authentication for both versions
- Writing Flux queries for InfluxDB 2.x
- Writing InfluxQL queries for InfluxDB 1.x
- Understanding measurements, fields, and tags
- Time aggregation and automatic interval adjustment
- Creating reusable dashboard variables
- Performance optimization and retention policies
Estimated completion time: 25-30 minutes
š InfluxDB Version Comparison
Before configuring your data source, it's essential to understand which version of InfluxDB you're working with. The two versions have significantly different query languages, authentication methods, and organizational structures.
InfluxDB 1.x (InfluxQL)
- Query Language: InfluxQL (SQL-like)
- Authentication: Username/Password
- Organization: Database ā Retention Policy
- Port: 8086 (HTTP)
- Best For: Legacy systems, simpler queries
- Support Status: Maintenance mode
InfluxDB 2.x (Flux)
- Query Language: Flux (functional)
- Authentication: API Tokens
- Organization: Organization ā Bucket
- Port: 8086 (HTTP)
- Best For: Complex transformations, new deployments
- Support Status: Active development
š” How to Check Your Version
Run this command on your InfluxDB server:
influx version (for 1.x shows "InfluxDB shell version: 1.x.x")influx version (for 2.x shows "Influx CLI 2.x.x")Or check the web UI: InfluxDB 2.x has a modern web interface at http://server:8086, while 1.x requires Chronograf for UI.
š§ InfluxDB 2.x Configuration (Recommended)
InfluxDB 2.x represents the modern architecture with improved performance, better query capabilities, and a unified authentication system. We'll start with this version as it's the current standard for new deployments.
Create an API Token in InfluxDB 2.x
Before configuring Grafana, you need to create an API token in InfluxDB. This token will authenticate Grafana's requests and define what data it can access.
Using the Web UI (Recommended):
1. Navigate to your InfluxDB 2.x web interface: http://your-influxdb-server:8086
2. Log in with your InfluxDB credentials
3. Click on Data (left sidebar) ā API Tokens
4. Click + Generate API Token ā Read/Write Token
5. Configure the token permissions:
- Description: "Grafana Read Token"
- Read Buckets: Select "telegraf" (or your metrics bucket)
- Write Buckets: None (Grafana only needs read access)
6. Click Generate and immediately copy the token - you won't be able to see it again!
Using the CLI:
ā ļø Token Security
Store your API token securely! If you lose it, you'll need to generate a new one. Never commit tokens to version control or share them in plain text. Consider using environment variables or secrets management systems in production.
Add InfluxDB 2.x Data Source in Grafana
Now that you have your API token, let's configure Grafana to connect to InfluxDB 2.x.
Navigation Path:
Connections ā Data Sources ā Add data source ā InfluxDB
Configuration Settings:
Testing the Connection:
After entering all settings, scroll down and click Save & Test. You should see:
ā Connection Successful
Data source is working
⢠Found 1 bucket: telegraf
⢠InfluxDB version: 2.x.x
ā ļø Common Connection Errors
"Unauthorized" - Check your API token has read permissions for the bucket
"Not Found" - Verify organization name and bucket name are correct (case-sensitive)
"Connection Refused" - Check URL, port (8086), and firewall rules
"Timeout" - Increase timeout in Advanced settings or check network connectivity
Understanding Flux Query Structure
Flux is a functional data scripting language designed for querying, analyzing, and acting on time-series data. Unlike SQL, Flux uses a pipeline model where data flows through a series of functions.
Basic Flux Query Pattern:
Key Flux Concepts:
- |> - Pipe operator: passes output of one function to input of next
- from() - Specifies which bucket to query
- range() - Defines the time window for data retrieval
- filter() - Reduces data based on conditions
- _measurement - The metric type (cpu, mem, disk, etc.)
- _field - The actual metric value being measured
- Tags - Metadata for grouping (hostname, cpu, interface, etc.)
š” Grafana Variables in Flux
v.timeRangeStart - Dashboard time picker start (automatic)
v.timeRangeStop - Dashboard time picker end (automatic)
v.windowPeriod - Calculated aggregation interval based on time range
${variable} - Custom dashboard variable (e.g., ${hostname})
Query Example 1: CPU Usage by Core
This query retrieves CPU usage for all cores and displays idle time inverted to show utilization percentage.
Query Explanation:
- Line 3: Filters for CPU measurement type
- Line 4: Selects the idle percentage field (easier to invert than sum of all usage types)
- Line 5: Excludes the cpu-total aggregate, showing individual cores (cpu0, cpu1, etc.)
- Line 6: Aggregates data points into time windows using mean average
- Line 7: Inverts idle to usage (100 - idle = used)
- Line 8: Names the output stream for reference
Expected Output:
Time series showing CPU usage percentage for each core, typically resulting in multiple lines on a graph (one per CPU core).
Query Example 2: Total CPU Usage with Alert Threshold
This query shows overall system CPU usage and can be used with Grafana alerts.
Query Features:
- Uses
cpu-totalfor system-wide CPU percentage - Includes hostname variable for multi-server dashboards
=~operator allows regex matching for flexible filtering- Single line output perfect for gauge panels or alert rules
š” Creating the $hostname Variable
Dashboard Settings ā Variables ā Add variable:
Name: hostname
Type: Query
Query:
import "influxdata/influxdb/schema"
schema.tagValues(bucket: "telegraf", tag: "host")
Query Example 3: Memory Usage Percentage
Displays memory utilization as a percentage of total available memory.
Alternative: Memory Usage in Bytes
Panel Configuration Tip:
For the bytes query, configure the Y-axis with Unit: Data ā bytes (IEC) in the panel settings. This will automatically display values as KiB, MiB, or GiB for better readability.
Query Example 4: Disk Usage by Mount Point
Monitor disk space across multiple filesystems with automatic mount point detection.
Understanding Disk Metrics:
| Field | Description | Use Case |
|---|---|---|
used_percent |
Percentage of disk space used | Quick capacity check, alerting |
used |
Bytes used on filesystem | Capacity planning, trend analysis |
free |
Bytes available on filesystem | Free space monitoring |
total |
Total filesystem size in bytes | Inventory, capacity documentation |
inodes_used |
Number of inodes consumed | File count monitoring (Linux/Unix) |
Common Tag Filters:
- path: Mount point (/, /home, /var, etc.)
- device: Physical device (/dev/sda1, etc.)
- fstype: Filesystem type (ext4, xfs, ntfs, etc.)
- mode: Mount mode (rw, ro)
Query Example 5: Network Interface Traffic
Track inbound and outbound network traffic rates across all interfaces.
Query Breakdown:
- bytes_sent: Cumulative outbound bytes (TX)
- bytes_recv: Cumulative inbound bytes (RX)
- derivative(): Calculates rate of change (bytes per second)
- nonNegative: true: Handles counter resets (e.g., interface restart)
- unit: 1s: Normalizes to bytes per second
Panel Configuration:
Set the unit to Data rate ā bytes/sec (IEC) to display as Mbps, Gbps, etc. Use Transform: Organize fields to rename "bytes_sent" to "TX" and "bytes_recv" to "RX" for clarity.
š” Advanced: Show TX as Negative for Traffic Graph
This creates the traditional up/down network traffic visualization.
Query Example 6: System Load Average
Monitor system load averages to identify performance bottlenecks.
Understanding Load Average:
- load1: 1-minute load average (immediate activity)
- load5: 5-minute load average (recent trend)
- load15: 15-minute load average (long-term trend)
Interpretation: Load average represents the number of processes waiting for CPU time. Compare to CPU core count: load of 4.0 on a 4-core system = 100% utilization. Load of 8.0 on 4-core system = significant queuing and potential performance issues.
Query Example 7: Disk I/O Operations
Track read and write operations per second to identify I/O bottlenecks.
Related Disk I/O Metrics:
| Field | Description | Unit |
|---|---|---|
reads |
Cumulative read operations | Counter (use derivative for IOPS) |
writes |
Cumulative write operations | Counter (use derivative for IOPS) |
read_bytes |
Cumulative bytes read | Bytes (use derivative for throughput) |
write_bytes |
Cumulative bytes written | Bytes (use derivative for throughput) |
io_time |
Time spent doing I/Os | Milliseconds |
Query Example 8: Process Count by State
Monitor process states to identify zombie processes or system resource exhaustion.
Process States Explained:
- running: Actively executing or waiting for CPU time
- sleeping: Waiting for an event (normal state for most processes)
- stopped: Halted by signal (debugging or job control)
- zombies: Completed but parent hasn't acknowledged (should be 0)
- blocked: Waiting for I/O to complete (high values indicate I/O bottleneck)
ā ļø Zombie Process Alert
If zombie count is consistently above 0, it indicates processes not properly cleaned up by their parent. While not immediately harmful, large numbers of zombies can exhaust the process table. Identify the parent process and restart it if necessary.
š§ InfluxDB 1.x Configuration (Legacy)
While InfluxDB 1.x is in maintenance mode, many production environments still run this version. The configuration process differs significantly from 2.x, using InfluxQL (SQL-like) queries instead of Flux.
Create Read-Only User in InfluxDB 1.x
Connect to your InfluxDB 1.x instance and create a dedicated user for Grafana with read-only permissions.
š” InfluxDB 1.x Authentication
If authentication is not enabled in your InfluxDB 1.x instance, you can skip user creation and leave the username/password fields empty in Grafana. However, this is not recommended for production.
To enable authentication, edit /etc/influxdb/influxdb.conf:
[http]
auth-enabled = true
Then restart InfluxDB: systemctl restart influxdb
Add InfluxDB 1.x Data Source in Grafana
Configure Grafana to connect using InfluxQL instead of Flux.
Navigation Path:
Connections ā Data Sources ā Add data source ā InfluxDB
Configuration Settings:
Testing the Connection:
Click Save & Test. You should see:
ā Data source is working
⢠Database: telegraf
⢠InfluxDB version: 1.x.x
Understanding InfluxQL Query Structure
InfluxQL is similar to SQL, making it familiar for those with relational database experience. It uses SELECT, FROM, WHERE, and GROUP BY clauses.
Basic InfluxQL Query Pattern:
Key InfluxQL Concepts:
- Aggregation Functions: mean(), sum(), min(), max(), count(), median(), stddev()
- $timeFilter: Grafana variable for dashboard time range (required)
- $__interval: Automatically calculated aggregation interval
- ~ / =~ : Regex matching operators
- ^$variable$: Grafana variable syntax in WHERE clause
š” Grafana Variables in InfluxQL
$timeFilter - Required in WHERE clause for time filtering
$__interval - Auto-calculated GROUP BY time interval
$hostname - Dashboard variable (create in Variables settings)
$__interval_ms - Interval in milliseconds
InfluxQL Query Example 1: CPU Usage
Calculate CPU usage percentage from idle time, similar to the Flux example.
Query Components:
- 100 - mean("usage_idle"): Converts idle to usage percentage
- AS "CPU Usage %": Sets display name for the series
- cpu-total: System-wide aggregate (use 'cpu0', 'cpu1', etc. for per-core)
- FILL(null): Handles gaps in data (alternatives: FILL(0), FILL(previous))
Alternative: All CPU Cores
InfluxQL Query Example 2: Memory Usage
Display memory utilization with both percentage and absolute values.
Alternative: Memory in Bytes with Labels
Configure the panel with Unit: bytes (IEC) for automatic KB/MB/GB formatting.
InfluxQL Query Example 3: Disk Usage by Partition
Monitor disk space across all mounted filesystems with mount point labels.
Using Subqueries for Free Space Alerts:
This query shows free space in GB for the root partition, useful for capacity alerts.
InfluxQL Query Example 4: Network Traffic Rate
Calculate network throughput using the DERIVATIVE() function.
Understanding DERIVATIVE():
- non_negative_derivative(): Calculates rate of change, ignoring decreases (counter resets)
- 1s parameter: Normalizes rate to per-second (use 1m for per-minute)
- Combined with mean(): First aggregates, then calculates rate
š” Panel Configuration for Traffic Graph
Unit: Data rate ā bytes/sec (SI) or bytes/sec (IEC)
Transform: Add "Organize fields" to rename series
Override: For TX series, use negative Y-axis to create symmetrical graph
InfluxQL Advanced: Using Subqueries
InfluxQL supports nested queries for complex calculations and filtering.
Example: Highest CPU Usage in Last Hour
Example: Moving Average
The moving average smooths out short-term fluctuations and highlights longer-term trends.
š Telegraf Measurements Reference
Telegraf organizes metrics into measurements, which are similar to database tables. Each measurement contains fields (actual values) and tags (metadata for grouping).
| Measurement | Key Fields | Common Tags | Use Case |
|---|---|---|---|
cpu |
usage_idle, usage_system, usage_user, usage_iowait | host, cpu (cpu0, cpu-total) | CPU performance monitoring, capacity planning |
mem |
used_percent, available_percent, used, available, total | host | Memory utilization tracking, OOM prevention |
disk |
used_percent, free, total, used | host, path, device, fstype | Disk capacity monitoring, storage planning |
diskio |
reads, writes, read_bytes, write_bytes, io_time | host, name (device name) | I/O performance analysis, bottleneck detection |
net |
bytes_sent, bytes_recv, packets_sent, packets_recv, err_in, err_out | host, interface | Network traffic monitoring, bandwidth analysis |
netstat |
tcp_established, tcp_listen, tcp_close_wait | host | Connection state tracking, troubleshooting |
system |
load1, load5, load15, n_cpus, uptime | host | System load monitoring, uptime tracking |
processes |
running, sleeping, stopped, zombies, blocked, total | host | Process state monitoring, zombie detection |
swap |
used_percent, used, free, total, in, out | host | Swap usage monitoring, memory pressure detection |
kernel |
context_switches, interrupts, processes_forked | host | Low-level system performance analysis |
ā” Query Performance Optimization
Optimizing your InfluxDB queries is crucial for dashboard responsiveness, especially with high-cardinality data or long time ranges.
Best Practices for Fast Queries
1. Use Appropriate Time Ranges
Avoid querying more data than necessary. For real-time monitoring, use shorter time ranges (last 1-6 hours). For trend analysis, use downsampling or continuous queries.
2. Filter Early and Often
3. Leverage Aggregation Windows
Use aggregateWindow() in Flux or GROUP BY time($__interval) in InfluxQL to
reduce data points returned to Grafana. For a 24-hour view, 1-minute aggregation is usually sufficient.
4. Avoid High-Cardinality Grouping
Grouping by tags with many unique values (like process ID or transaction ID) can create thousands of series. Limit grouping to essential tags like hostname, interface, or mount point.
5. Use Retention Policies (InfluxDB 1.x) or Buckets (InfluxDB 2.x)
Configure automatic data downsampling for older data to maintain query performance:
š” InfluxDB 2.x Tasks for Downsampling
In InfluxDB 2.x, use Tasks instead of Continuous Queries:
6. Monitor Query Execution Time
Enable query logging in InfluxDB to identify slow queries. In InfluxDB 2.x, check the Task runs page. In InfluxDB 1.x, enable query logs in the configuration file.
š Security Best Practices
Securing your InfluxDB connection and data is critical, especially in production environments.
ā ļø Critical Security Reminders
- Never use admin credentials in Grafana - Always create read-only users/tokens
- Enable TLS/SSL for production - Encrypt data in transit
- Rotate API tokens regularly - Especially if they may have been exposed
- Use network segmentation - InfluxDB should not be directly exposed to the internet
- Enable authentication - Never run InfluxDB without auth in production
- Monitor access logs - Track who queries what data and when
Enabling TLS for InfluxDB 2.x
Then update your Grafana data source URL to https://influxdb-server:8086 and configure
TLS certificate validation.
š ļø Troubleshooting Common Issues
Issue: "No data" in Grafana Panels
Diagnosis Steps:
-
Verify data exists in InfluxDB
# InfluxDB 2.x (Flux) from(bucket: "telegraf") |> range(start: -1h) |> filter(fn: (r) => r["_measurement"] == "cpu") |> limit(n: 5) # InfluxDB 1.x (InfluxQL) SELECT * FROM "cpu" LIMIT 5
- Check time range - Ensure dashboard time picker matches data availability
- Verify filters - Check that host/tag filters match actual tag values (case-sensitive)
- Review query syntax - Look for syntax errors in Query Inspector (panel menu ā Inspect ā Query)
Issue: Queries Running Slowly
Solutions:
- Reduce time range or increase aggregation interval
- Add more specific tag filters to reduce cardinality
- Create continuous queries or tasks for pre-aggregated data
- Check InfluxDB server resources (CPU, memory, disk I/O)
- Consider upgrading InfluxDB or scaling horizontally (Enterprise)
Issue: "Unsupported protocol scheme" Error
Cause:
This error occurs when the URL is malformed or uses an incorrect protocol.
Solution:
- Ensure URL starts with
http://orhttps:// - Do not include trailing slashes:
http://server:8086(nothttp://server:8086/) - Verify port number (default is 8086)
- Check that InfluxDB is actually running:
systemctl status influxdb
š Next Steps
You've now configured InfluxDB as a data source and learned how to query system metrics collected by Telegraf. Here are some recommended next steps:
- Create a comprehensive system dashboard - Combine CPU, memory, disk, and network panels
- Set up alerts - Configure Grafana alerts for critical thresholds (CPU > 90%, disk > 85%)
- Explore Telegraf plugins - Add custom input plugins for application-specific metrics
- Implement data retention - Configure retention policies or tasks to manage storage
- Build team-specific views - Create dashboards tailored for dev, ops, and management
š” Pro Tips for Production
- Use dashboard variables for multi-server views (create
$hostnamedropdown) - Set default time range to "Last 6 hours" for operational dashboards
- Enable auto-refresh (30s or 1m) for NOC dashboards
- Use alert annotations to mark incidents on graphs
- Create separate dashboards for real-time monitoring vs. capacity planning
ā Module 4 Complete!
You've successfully configured InfluxDB/Telegraf integration and created multiple query examples. You're now ready to visualize system metrics and move on to log aggregation with VictoriaLogs.