Module 4: InfluxDB Data Source | Grafana Integration Lab

📚 Module Overview

InfluxDB is a time-series database purpose-built for handling high-write and query loads of time-stamped data. When paired with Telegraf (an agent for collecting and reporting metrics), it becomes a powerful system monitoring solution. This module will guide you through configuring InfluxDB as a Grafana data source and creating queries to visualize system metrics collected by Telegraf agents deployed across your infrastructure.

What you'll learn:

Differences between InfluxDB 1.x and 2.x architectures
Creating API tokens and authentication for both versions
Writing Flux queries for InfluxDB 2.x
Writing InfluxQL queries for InfluxDB 1.x
Understanding measurements, fields, and tags
Time aggregation and automatic interval adjustment
Creating reusable dashboard variables
Performance optimization and retention policies

Estimated completion time: 25-30 minutes

🔍 InfluxDB Version Comparison

Before configuring your data source, it's essential to understand which version of InfluxDB you're working with. The two versions have significantly different query languages, authentication methods, and organizational structures.

InfluxDB 1.x (InfluxQL)

Query Language: InfluxQL (SQL-like)
Authentication: Username/Password
Organization: Database → Retention Policy
Port: 8086 (HTTP)
Best For: Legacy systems, simpler queries
Support Status: Maintenance mode

InfluxDB 2.x (Flux)

Query Language: Flux (functional)
Authentication: API Tokens
Organization: Organization → Bucket
Port: 8086 (HTTP)
Best For: Complex transformations, new deployments
Support Status: Active development

💡 How to Check Your Version

Run this command on your InfluxDB server:

influx version (for 1.x shows "InfluxDB shell version: 1.x.x")
influx version (for 2.x shows "Influx CLI 2.x.x")

Or check the web UI: InfluxDB 2.x has a modern web interface at http://server:8086, while 1.x requires Chronograf for UI.

🔧 InfluxDB 2.x Configuration (Recommended)

InfluxDB 2.x represents the modern architecture with improved performance, better query capabilities, and a unified authentication system. We'll start with this version as it's the current standard for new deployments.

Create an API Token in InfluxDB 2.x

Before configuring Grafana, you need to create an API token in InfluxDB. This token will authenticate Grafana's requests and define what data it can access.

Using the Web UI (Recommended):

1. Navigate to your InfluxDB 2.x web interface: http://your-influxdb-server:8086

2. Log in with your InfluxDB credentials

3. Click on Data (left sidebar) → API Tokens

4. Click + Generate API Token → Read/Write Token

5. Configure the token permissions:

Description: "Grafana Read Token"
Read Buckets: Select "telegraf" (or your metrics bucket)
Write Buckets: None (Grafana only needs read access)

6. Click Generate and immediately copy the token - you won't be able to see it again!

Using the CLI:

# Create a read-only token for Grafana influx auth create \ --org your-org-name \ --read-bucket telegraf \ --description "Grafana Read Token" # Output will show your token - save it immediately: # ID Description Token User Name User ID Permissions # 0abc123def456 Grafana Read Token rW1pT3k...FaS8xYnM= admin 0123456789abcdef [read:orgs/...]

⚠️ Token Security

Store your API token securely! If you lose it, you'll need to generate a new one. Never commit tokens to version control or share them in plain text. Consider using environment variables or secrets management systems in production.

Add InfluxDB 2.x Data Source in Grafana

Now that you have your API token, let's configure Grafana to connect to InfluxDB 2.x.

Navigation Path:

Connections → Data Sources → Add data source → InfluxDB

Configuration Settings:

# Basic Settings Name: InfluxDB-Telegraf Default: [✓] Toggle ON if this is your primary metrics source # Query Language Query Language: Flux # HTTP Settings URL: http://influxdb-server:8086 Note: Use http:// for internal networks, https:// for production Use localhost or 127.0.0.1 if InfluxDB is on the same server as Grafana Access: Server (default) - Server = Grafana server connects to InfluxDB - Browser = User's browser connects directly (rarely used) # InfluxDB Details Organization: your-org-name Note: Find this in InfluxDB UI → Settings → Organization Settings Token: your-api-token-here Paste the token you created in Step 1 Default Bucket: telegraf This is where Telegraf writes system metrics by default # Advanced HTTP Settings (Optional) Timeout: 60s (increase if you have slow queries) Custom HTTP Headers: Leave empty unless required # TLS Settings (for HTTPS connections) Skip TLS Verify: OFF (keep security on in production) With CA Cert: Enable if using self-signed certificates

Testing the Connection:

After entering all settings, scroll down and click Save & Test. You should see:

✓ Connection Successful

Data source is working

• Found 1 bucket: telegraf

• InfluxDB version: 2.x.x

⚠️ Common Connection Errors

"Unauthorized" - Check your API token has read permissions for the bucket

"Not Found" - Verify organization name and bucket name are correct (case-sensitive)

"Connection Refused" - Check URL, port (8086), and firewall rules

"Timeout" - Increase timeout in Advanced settings or check network connectivity

Understanding Flux Query Structure

Flux is a functional data scripting language designed for querying, analyzing, and acting on time-series data. Unlike SQL, Flux uses a pipeline model where data flows through a series of functions.

Basic Flux Query Pattern:

from(bucket: "telegraf") // 1. Select data source |> range(start: v.timeRangeStart, stop: v.timeRangeStop) // 2. Time range |> filter(fn: (r) => r["_measurement"] == "cpu") // 3. Filter measurement |> filter(fn: (r) => r["_field"] == "usage_idle") // 4. Filter field |> filter(fn: (r) => r["cpu"] == "cpu-total") // 5. Filter tag |> aggregateWindow(every: v.windowPeriod, fn: mean) // 6. Aggregate data

Key Flux Concepts:

|> - Pipe operator: passes output of one function to input of next
from() - Specifies which bucket to query
range() - Defines the time window for data retrieval
filter() - Reduces data based on conditions
_measurement - The metric type (cpu, mem, disk, etc.)
_field - The actual metric value being measured
Tags - Metadata for grouping (hostname, cpu, interface, etc.)

💡 Grafana Variables in Flux

v.timeRangeStart - Dashboard time picker start (automatic)

v.timeRangeStop - Dashboard time picker end (automatic)

v.windowPeriod - Calculated aggregation interval based on time range

${variable} - Custom dashboard variable (e.g., ${hostname})

Query Example 1: CPU Usage by Core

This query retrieves CPU usage for all cores and displays idle time inverted to show utilization percentage.

from(bucket: "telegraf") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => r["_measurement"] == "cpu") |> filter(fn: (r) => r["_field"] == "usage_idle") |> filter(fn: (r) => r["cpu"] != "cpu-total") // Exclude total, show per-core |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false) |> map(fn: (r) => ({ r with _value: 100.0 - r._value })) // Convert idle to usage |> yield(name: "cpu_usage")

Query Explanation:

Line 3: Filters for CPU measurement type
Line 4: Selects the idle percentage field (easier to invert than sum of all usage types)
Line 5: Excludes the cpu-total aggregate, showing individual cores (cpu0, cpu1, etc.)
Line 6: Aggregates data points into time windows using mean average
Line 7: Inverts idle to usage (100 - idle = used)
Line 8: Names the output stream for reference

Expected Output:

Time series showing CPU usage percentage for each core, typically resulting in multiple lines on a graph (one per CPU core).

Query Example 2: Total CPU Usage with Alert Threshold

This query shows overall system CPU usage and can be used with Grafana alerts.

from(bucket: "telegraf") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => r["_measurement"] == "cpu") |> filter(fn: (r) => r["_field"] == "usage_idle") |> filter(fn: (r) => r["cpu"] == "cpu-total") // Only total CPU |> filter(fn: (r) => r["host"] =~ /${hostname}/) // Dashboard variable |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false) |> map(fn: (r) => ({ r with _value: 100.0 - r._value })) |> yield(name: "total_cpu")

Query Features:

Uses cpu-total for system-wide CPU percentage
Includes hostname variable for multi-server dashboards
=~ operator allows regex matching for flexible filtering
Single line output perfect for gauge panels or alert rules

💡 Creating the $hostname Variable

Dashboard Settings → Variables → Add variable:

Name: hostname

Type: Query

Query:


import "influxdata/influxdb/schema"

schema.tagValues(bucket: "telegraf", tag: "host")

Query Example 3: Memory Usage Percentage

Displays memory utilization as a percentage of total available memory.

from(bucket: "telegraf") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => r["_measurement"] == "mem") |> filter(fn: (r) => r["_field"] == "used_percent") |> filter(fn: (r) => r["host"] =~ /${hostname}/) |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false) |> yield(name: "memory_usage")

Alternative: Memory Usage in Bytes

from(bucket: "telegraf") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => r["_measurement"] == "mem") |> filter(fn: (r) => r["_field"] == "used" or r["_field"] == "available") |> filter(fn: (r) => r["host"] =~ /${hostname}/) |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false) |> yield(name: "memory_bytes")

Panel Configuration Tip:

For the bytes query, configure the Y-axis with Unit: Data → bytes (IEC) in the panel settings. This will automatically display values as KiB, MiB, or GiB for better readability.

Query Example 4: Disk Usage by Mount Point

Monitor disk space across multiple filesystems with automatic mount point detection.

from(bucket: "telegraf") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => r["_measurement"] == "disk") |> filter(fn: (r) => r["_field"] == "used_percent") |> filter(fn: (r) => r["host"] =~ /${hostname}/) // Exclude virtual filesystems |> filter(fn: (r) => r["fstype"] != "tmpfs" and r["fstype"] != "devtmpfs") |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false) |> group(columns: ["path"]) // Group by mount point |> yield(name: "disk_usage")

Understanding Disk Metrics:

Field	Description	Use Case
`used_percent`	Percentage of disk space used	Quick capacity check, alerting
`used`	Bytes used on filesystem	Capacity planning, trend analysis
`free`	Bytes available on filesystem	Free space monitoring
`total`	Total filesystem size in bytes	Inventory, capacity documentation
`inodes_used`	Number of inodes consumed	File count monitoring (Linux/Unix)

Common Tag Filters:

path: Mount point (/, /home, /var, etc.)
device: Physical device (/dev/sda1, etc.)
fstype: Filesystem type (ext4, xfs, ntfs, etc.)
mode: Mount mode (rw, ro)

Query Example 5: Network Interface Traffic

Track inbound and outbound network traffic rates across all interfaces.

from(bucket: "telegraf") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => r["_measurement"] == "net") |> filter(fn: (r) => r["_field"] == "bytes_sent" or r["_field"] == "bytes_recv") |> filter(fn: (r) => r["host"] =~ /${hostname}/) // Exclude loopback interface |> filter(fn: (r) => r["interface"] != "lo") |> derivative(unit: 1s, nonNegative: true) // Convert cumulative to rate per second |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false) |> yield(name: "network_traffic")

Query Breakdown:

bytes_sent: Cumulative outbound bytes (TX)
bytes_recv: Cumulative inbound bytes (RX)
derivative(): Calculates rate of change (bytes per second)
nonNegative: true: Handles counter resets (e.g., interface restart)
unit: 1s: Normalizes to bytes per second

Panel Configuration:

Set the unit to Data rate → bytes/sec (IEC) to display as Mbps, Gbps, etc. Use Transform: Organize fields to rename "bytes_sent" to "TX" and "bytes_recv" to "RX" for clarity.

💡 Advanced: Show TX as Negative for Traffic Graph

// Apply this transform after the query for a symmetric traffic view:
|> map(fn: (r) => ({ 
    r with _value: if r._field == "bytes_sent" then -r._value else r._value 
}))

This creates the traditional up/down network traffic visualization.

Query Example 6: System Load Average

Monitor system load averages to identify performance bottlenecks.

from(bucket: "telegraf") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => r["_measurement"] == "system") |> filter(fn: (r) => r["_field"] == "load1" or r["_field"] == "load5" or r["_field"] == "load15") |> filter(fn: (r) => r["host"] =~ /${hostname}/) |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false) |> yield(name: "load_average")

Understanding Load Average:

load1: 1-minute load average (immediate activity)
load5: 5-minute load average (recent trend)
load15: 15-minute load average (long-term trend)

Interpretation: Load average represents the number of processes waiting for CPU time. Compare to CPU core count: load of 4.0 on a 4-core system = 100% utilization. Load of 8.0 on 4-core system = significant queuing and potential performance issues.

Query Example 7: Disk I/O Operations

Track read and write operations per second to identify I/O bottlenecks.

from(bucket: "telegraf") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => r["_measurement"] == "diskio") |> filter(fn: (r) => r["_field"] == "reads" or r["_field"] == "writes") |> filter(fn: (r) => r["host"] =~ /${hostname}/) // Exclude loopback and ram devices |> filter(fn: (r) => r["name"] !~ /loop|ram/) |> derivative(unit: 1s, nonNegative: true) // Convert to IOPS |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false) |> yield(name: "disk_iops")

Related Disk I/O Metrics:

Field	Description	Unit
`reads`	Cumulative read operations	Counter (use derivative for IOPS)
`writes`	Cumulative write operations	Counter (use derivative for IOPS)
`read_bytes`	Cumulative bytes read	Bytes (use derivative for throughput)
`write_bytes`	Cumulative bytes written	Bytes (use derivative for throughput)
`io_time`	Time spent doing I/Os	Milliseconds

Query Example 8: Process Count by State

Monitor process states to identify zombie processes or system resource exhaustion.

from(bucket: "telegraf") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => r["_measurement"] == "processes") |> filter(fn: (r) => r["_field"] =~ /running|sleeping|stopped|zombies|blocked/) |> filter(fn: (r) => r["host"] =~ /${hostname}/) |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false) |> yield(name: "process_states")

Process States Explained:

running: Actively executing or waiting for CPU time
sleeping: Waiting for an event (normal state for most processes)
stopped: Halted by signal (debugging or job control)
zombies: Completed but parent hasn't acknowledged (should be 0)
blocked: Waiting for I/O to complete (high values indicate I/O bottleneck)

⚠️ Zombie Process Alert

If zombie count is consistently above 0, it indicates processes not properly cleaned up by their parent. While not immediately harmful, large numbers of zombies can exhaust the process table. Identify the parent process and restart it if necessary.

🔧 InfluxDB 1.x Configuration (Legacy)

While InfluxDB 1.x is in maintenance mode, many production environments still run this version. The configuration process differs significantly from 2.x, using InfluxQL (SQL-like) queries instead of Flux.

Create Read-Only User in InfluxDB 1.x

Connect to your InfluxDB 1.x instance and create a dedicated user for Grafana with read-only permissions.

# Connect to InfluxDB CLI influx -username admin -password admin_password # Create read-only user CREATE USER "grafana_reader" WITH PASSWORD 'StrongPassword123!' # Grant read permission on telegraf database GRANT READ ON "telegraf" TO "grafana_reader" # Verify user creation SHOW USERS # name admin # ---- ----- # admin true # grafana_reader false # Exit exit

💡 InfluxDB 1.x Authentication

If authentication is not enabled in your InfluxDB 1.x instance, you can skip user creation and leave the username/password fields empty in Grafana. However, this is not recommended for production.

To enable authentication, edit /etc/influxdb/influxdb.conf:

[http]
auth-enabled = true

Then restart InfluxDB: systemctl restart influxdb

Add InfluxDB 1.x Data Source in Grafana

Configure Grafana to connect using InfluxQL instead of Flux.

Navigation Path:

Connections → Data Sources → Add data source → InfluxDB

Configuration Settings:

# Basic Settings Name: InfluxDB-Telegraf-1x Default: [✓] Toggle ON if primary metrics source # Query Language Query Language: InfluxQL ← IMPORTANT: Select InfluxQL, not Flux # HTTP Settings URL: http://influxdb-server:8086 Access: Server (default) # InfluxDB Details Database: telegraf Note: In 1.x this is called "database", not "bucket" User: grafana_reader Password: StrongPassword123! # HTTP Method: GET (default) POST is also supported but GET is more compatible # Min time interval: Leave empty (use default) Or set to "10s" to prevent over-querying # Advanced Settings Max series: Leave empty unless you have very high cardinality Set to 1000 if experiencing performance issues

Testing the Connection:

Click Save & Test. You should see:

✓ Data source is working

• Database: telegraf

• InfluxDB version: 1.x.x

Understanding InfluxQL Query Structure

InfluxQL is similar to SQL, making it familiar for those with relational database experience. It uses SELECT, FROM, WHERE, and GROUP BY clauses.

Basic InfluxQL Query Pattern:

SELECT mean("usage_idle") -- 1. Select field and aggregation function FROM "cpu" -- 2. From measurement WHERE ("host" =~ /^$hostname$/) -- 3. Filter by tags AND $timeFilter -- 4. Grafana time range filter GROUP BY time($__interval), "cpu" -- 5. Time aggregation and grouping

Key InfluxQL Concepts:

Aggregation Functions: mean(), sum(), min(), max(), count(), median(), stddev()
$timeFilter: Grafana variable for dashboard time range (required)
$__interval: Automatically calculated aggregation interval
~ / =~ : Regex matching operators
^$variable$: Grafana variable syntax in WHERE clause

💡 Grafana Variables in InfluxQL

$timeFilter - Required in WHERE clause for time filtering

$__interval - Auto-calculated GROUP BY time interval

$hostname - Dashboard variable (create in Variables settings)

$__interval_ms - Interval in milliseconds

InfluxQL Query Example 1: CPU Usage

Calculate CPU usage percentage from idle time, similar to the Flux example.

SELECT 100 - mean("usage_idle") AS "CPU Usage %" FROM "cpu" WHERE ("host" =~ /^$hostname$/) AND ("cpu" = 'cpu-total') AND $timeFilter GROUP BY time($__interval) FILL(null)

Query Components:

100 - mean("usage_idle"): Converts idle to usage percentage
AS "CPU Usage %": Sets display name for the series
cpu-total: System-wide aggregate (use 'cpu0', 'cpu1', etc. for per-core)
FILL(null): Handles gaps in data (alternatives: FILL(0), FILL(previous))

Alternative: All CPU Cores

SELECT 100 - mean("usage_idle") AS "CPU Usage %" FROM "cpu" WHERE ("host" =~ /^$hostname$/) AND ("cpu" != 'cpu-total') -- Exclude total, show individual cores AND $timeFilter GROUP BY time($__interval), "cpu" -- Group by time AND cpu tag FILL(null)

InfluxQL Query Example 2: Memory Usage

Display memory utilization with both percentage and absolute values.

SELECT mean("used_percent") AS "Memory Usage %" FROM "mem" WHERE ("host" =~ /^$hostname$/) AND $timeFilter GROUP BY time($__interval) FILL(null)

Alternative: Memory in Bytes with Labels

SELECT mean("used") AS "Used Memory", mean("available") AS "Available Memory", mean("total") AS "Total Memory" FROM "mem" WHERE ("host" =~ /^$hostname$/) AND $timeFilter GROUP BY time($__interval)

Configure the panel with Unit: bytes (IEC) for automatic KB/MB/GB formatting.

InfluxQL Query Example 3: Disk Usage by Partition

Monitor disk space across all mounted filesystems with mount point labels.

SELECT mean("used_percent") AS "Disk Usage %" FROM "disk" WHERE ("host" =~ /^$hostname$/) AND ("fstype" != 'tmpfs') -- Exclude temporary filesystems AND ("fstype" != 'devtmpfs') AND $timeFilter GROUP BY time($__interval), "path" -- Group by mount point FILL(null)

Using Subqueries for Free Space Alerts:

SELECT mean("free") / 1024 / 1024 / 1024 AS "Free Space GB" FROM "disk" WHERE ("host" =~ /^$hostname$/) AND ("path" = '/') AND $timeFilter GROUP BY time($__interval)

This query shows free space in GB for the root partition, useful for capacity alerts.

InfluxQL Query Example 4: Network Traffic Rate

Calculate network throughput using the DERIVATIVE() function.

SELECT non_negative_derivative(mean("bytes_recv"), 1s) AS "RX", non_negative_derivative(mean("bytes_sent"), 1s) AS "TX" FROM "net" WHERE ("host" =~ /^$hostname$/) AND ("interface" != 'lo') -- Exclude loopback AND $timeFilter GROUP BY time($__interval), "interface"

Understanding DERIVATIVE():

non_negative_derivative(): Calculates rate of change, ignoring decreases (counter resets)
1s parameter: Normalizes rate to per-second (use 1m for per-minute)
Combined with mean(): First aggregates, then calculates rate

💡 Panel Configuration for Traffic Graph

Unit: Data rate → bytes/sec (SI) or bytes/sec (IEC)

Transform: Add "Organize fields" to rename series

Override: For TX series, use negative Y-axis to create symmetrical graph

InfluxQL Advanced: Using Subqueries

InfluxQL supports nested queries for complex calculations and filtering.

Example: Highest CPU Usage in Last Hour

SELECT max("cpu_usage") FROM ( SELECT 100 - mean("usage_idle") AS "cpu_usage" FROM "cpu" WHERE ("host" =~ /^$hostname$/) AND ("cpu" = 'cpu-total') AND $timeFilter GROUP BY time($__interval) )

Example: Moving Average

SELECT moving_average(mean("used_percent"), 5) AS "5-point Moving Avg" FROM "mem" WHERE ("host" =~ /^$hostname$/) AND $timeFilter GROUP BY time($__interval)

The moving average smooths out short-term fluctuations and highlights longer-term trends.

📊 Telegraf Measurements Reference

Telegraf organizes metrics into measurements, which are similar to database tables. Each measurement contains fields (actual values) and tags (metadata for grouping).

Measurement	Key Fields	Common Tags	Use Case
`cpu`	usage_idle, usage_system, usage_user, usage_iowait	host, cpu (cpu0, cpu-total)	CPU performance monitoring, capacity planning
`mem`	used_percent, available_percent, used, available, total	host	Memory utilization tracking, OOM prevention
`disk`	used_percent, free, total, used	host, path, device, fstype	Disk capacity monitoring, storage planning
`diskio`	reads, writes, read_bytes, write_bytes, io_time	host, name (device name)	I/O performance analysis, bottleneck detection
`net`	bytes_sent, bytes_recv, packets_sent, packets_recv, err_in, err_out	host, interface	Network traffic monitoring, bandwidth analysis
`netstat`	tcp_established, tcp_listen, tcp_close_wait	host	Connection state tracking, troubleshooting
`system`	load1, load5, load15, n_cpus, uptime	host	System load monitoring, uptime tracking
`processes`	running, sleeping, stopped, zombies, blocked, total	host	Process state monitoring, zombie detection
`swap`	used_percent, used, free, total, in, out	host	Swap usage monitoring, memory pressure detection
`kernel`	context_switches, interrupts, processes_forked	host	Low-level system performance analysis

⚡ Query Performance Optimization

Optimizing your InfluxDB queries is crucial for dashboard responsiveness, especially with high-cardinality data or long time ranges.

Best Practices for Fast Queries

1. Use Appropriate Time Ranges

Avoid querying more data than necessary. For real-time monitoring, use shorter time ranges (last 1-6 hours). For trend analysis, use downsampling or continuous queries.

2. Filter Early and Often

# GOOD: Filters applied early reduce data volume from(bucket: "telegraf") |> range(start: -1h) -- Limit time range |> filter(fn: (r) => r["host"] == "web1") -- Filter by tag early |> filter(fn: (r) => r["_measurement"] == "cpu") |> aggregateWindow(every: 30s, fn: mean) # BAD: Processing all data before filtering from(bucket: "telegraf") |> range(start: -1h) |> aggregateWindow(every: 30s, fn: mean) |> filter(fn: (r) => r["host"] == "web1") -- Filter after aggregation

3. Leverage Aggregation Windows

Use aggregateWindow() in Flux or GROUP BY time($__interval) in InfluxQL to reduce data points returned to Grafana. For a 24-hour view, 1-minute aggregation is usually sufficient.

4. Avoid High-Cardinality Grouping

Grouping by tags with many unique values (like process ID or transaction ID) can create thousands of series. Limit grouping to essential tags like hostname, interface, or mount point.

5. Use Retention Policies (InfluxDB 1.x) or Buckets (InfluxDB 2.x)

Configure automatic data downsampling for older data to maintain query performance:

# InfluxDB 1.x: Create downsampled retention policy CREATE RETENTION POLICY "one_month" ON "telegraf" DURATION 30d REPLICATION 1 CREATE RETENTION POLICY "one_year_downsampled" ON "telegraf" DURATION 365d REPLICATION 1 # Create continuous query for downsampling CREATE CONTINUOUS QUERY "cq_cpu_1h" ON "telegraf" BEGIN SELECT mean("usage_idle") AS "usage_idle" INTO "one_year_downsampled"."cpu" FROM "cpu" GROUP BY time(1h), * END

💡 InfluxDB 2.x Tasks for Downsampling

In InfluxDB 2.x, use Tasks instead of Continuous Queries:

option task = {name: "downsample-cpu", every: 1h}

from(bucket: "telegraf")
  |> range(start: -2h)
  |> filter(fn: (r) => r._measurement == "cpu")
  |> aggregateWindow(every: 1h, fn: mean)
  |> to(bucket: "telegraf-downsampled")

6. Monitor Query Execution Time

Enable query logging in InfluxDB to identify slow queries. In InfluxDB 2.x, check the Task runs page. In InfluxDB 1.x, enable query logs in the configuration file.

🔐 Security Best Practices

Securing your InfluxDB connection and data is critical, especially in production environments.

⚠️ Critical Security Reminders

Never use admin credentials in Grafana - Always create read-only users/tokens
Enable TLS/SSL for production - Encrypt data in transit
Rotate API tokens regularly - Especially if they may have been exposed
Use network segmentation - InfluxDB should not be directly exposed to the internet
Enable authentication - Never run InfluxDB without auth in production
Monitor access logs - Track who queries what data and when

Enabling TLS for InfluxDB 2.x

# Edit /etc/influxdb/influxdb.conf or config.yml http-bind-address: ":8086" https-enabled: true https-certificate: "/etc/ssl/influxdb-cert.pem" https-private-key: "/etc/ssl/influxdb-key.pem" # Restart InfluxDB systemctl restart influxdb

Then update your Grafana data source URL to https://influxdb-server:8086 and configure TLS certificate validation.

🛠️ Troubleshooting Common Issues

Issue: "No data" in Grafana Panels

Diagnosis Steps:

Verify data exists in InfluxDB
# InfluxDB 2.x (Flux) from(bucket: "telegraf") |> range(start: -1h) |> filter(fn: (r) => r["_measurement"] == "cpu") |> limit(n: 5) # InfluxDB 1.x (InfluxQL) SELECT * FROM "cpu" LIMIT 5
Check time range - Ensure dashboard time picker matches data availability
Verify filters - Check that host/tag filters match actual tag values (case-sensitive)
Review query syntax - Look for syntax errors in Query Inspector (panel menu → Inspect → Query)

Issue: Queries Running Slowly

Solutions:

Reduce time range or increase aggregation interval
Add more specific tag filters to reduce cardinality
Create continuous queries or tasks for pre-aggregated data
Check InfluxDB server resources (CPU, memory, disk I/O)
Consider upgrading InfluxDB or scaling horizontally (Enterprise)

Issue: "Unsupported protocol scheme" Error

Cause:

This error occurs when the URL is malformed or uses an incorrect protocol.

Solution:

Ensure URL starts with http:// or https://
Do not include trailing slashes: http://server:8086 (not http://server:8086/)
Verify port number (default is 8086)
Check that InfluxDB is actually running: systemctl status influxdb

📈 Next Steps

You've now configured InfluxDB as a data source and learned how to query system metrics collected by Telegraf. Here are some recommended next steps:

Create a comprehensive system dashboard - Combine CPU, memory, disk, and network panels
Set up alerts - Configure Grafana alerts for critical thresholds (CPU > 90%, disk > 85%)
Explore Telegraf plugins - Add custom input plugins for application-specific metrics
Implement data retention - Configure retention policies or tasks to manage storage
Build team-specific views - Create dashboards tailored for dev, ops, and management

💡 Pro Tips for Production

Use dashboard variables for multi-server views (create $hostname dropdown)
Set default time range to "Last 6 hours" for operational dashboards
Enable auto-refresh (30s or 1m) for NOC dashboards
Use alert annotations to mark incidents on graphs
Create separate dashboards for real-time monitoring vs. capacity planning

✅ Module 4 Complete!

You've successfully configured InfluxDB/Telegraf integration and created multiple query examples. You're now ready to visualize system metrics and move on to log aggregation with VictoriaLogs.

📡 Telegraf/InfluxDB Integration

📚 Module Overview

🔍 InfluxDB Version Comparison

InfluxDB 1.x (InfluxQL)

InfluxDB 2.x (Flux)

💡 How to Check Your Version

🔧 InfluxDB 2.x Configuration (Recommended)

Create an API Token in InfluxDB 2.x

Using the Web UI (Recommended):

Using the CLI:

⚠️ Token Security

Add InfluxDB 2.x Data Source in Grafana

Navigation Path:

Configuration Settings:

Testing the Connection:

✓ Connection Successful

⚠️ Common Connection Errors

Understanding Flux Query Structure

Basic Flux Query Pattern:

Key Flux Concepts:

💡 Grafana Variables in Flux

Query Example 1: CPU Usage by Core

Query Explanation:

Expected Output:

Query Example 2: Total CPU Usage with Alert Threshold

Query Features:

💡 Creating the $hostname Variable

Query Example 3: Memory Usage Percentage

Alternative: Memory Usage in Bytes

Panel Configuration Tip:

Query Example 4: Disk Usage by Mount Point

Understanding Disk Metrics:

Common Tag Filters:

Query Example 5: Network Interface Traffic

Query Breakdown:

Panel Configuration:

💡 Advanced: Show TX as Negative for Traffic Graph

Query Example 6: System Load Average

Understanding Load Average:

Query Example 7: Disk I/O Operations

Related Disk I/O Metrics:

Query Example 8: Process Count by State

Process States Explained:

⚠️ Zombie Process Alert

🔧 InfluxDB 1.x Configuration (Legacy)

Create Read-Only User in InfluxDB 1.x

💡 InfluxDB 1.x Authentication

Add InfluxDB 1.x Data Source in Grafana

Navigation Path:

Configuration Settings:

Testing the Connection:

✓ Data source is working

Understanding InfluxQL Query Structure

Basic InfluxQL Query Pattern:

Key InfluxQL Concepts:

💡 Grafana Variables in InfluxQL

InfluxQL Query Example 1: CPU Usage

Query Components:

Alternative: All CPU Cores

InfluxQL Query Example 2: Memory Usage

Alternative: Memory in Bytes with Labels

InfluxQL Query Example 3: Disk Usage by Partition

Using Subqueries for Free Space Alerts:

InfluxQL Query Example 4: Network Traffic Rate

Understanding DERIVATIVE():

💡 Panel Configuration for Traffic Graph

InfluxQL Advanced: Using Subqueries

Example: Highest CPU Usage in Last Hour

Example: Moving Average

📊 Telegraf Measurements Reference

⚡ Query Performance Optimization

Best Practices for Fast Queries

1. Use Appropriate Time Ranges

2. Filter Early and Often

3. Leverage Aggregation Windows

4. Avoid High-Cardinality Grouping

5. Use Retention Policies (InfluxDB 1.x) or Buckets (InfluxDB 2.x)

💡 InfluxDB 2.x Tasks for Downsampling

6. Monitor Query Execution Time

🔐 Security Best Practices