← Back to Menu Module 4 of 8

šŸ“” Telegraf/InfluxDB Integration

System Metrics Collection and Visualization

šŸ“š Module Overview

InfluxDB is a time-series database purpose-built for handling high-write and query loads of time-stamped data. When paired with Telegraf (an agent for collecting and reporting metrics), it becomes a powerful system monitoring solution. This module will guide you through configuring InfluxDB as a Grafana data source and creating queries to visualize system metrics collected by Telegraf agents deployed across your infrastructure.

What you'll learn:

  • Differences between InfluxDB 1.x and 2.x architectures
  • Creating API tokens and authentication for both versions
  • Writing Flux queries for InfluxDB 2.x
  • Writing InfluxQL queries for InfluxDB 1.x
  • Understanding measurements, fields, and tags
  • Time aggregation and automatic interval adjustment
  • Creating reusable dashboard variables
  • Performance optimization and retention policies

Estimated completion time: 25-30 minutes

šŸ” InfluxDB Version Comparison

Before configuring your data source, it's essential to understand which version of InfluxDB you're working with. The two versions have significantly different query languages, authentication methods, and organizational structures.

InfluxDB 1.x (InfluxQL)

  • Query Language: InfluxQL (SQL-like)
  • Authentication: Username/Password
  • Organization: Database → Retention Policy
  • Port: 8086 (HTTP)
  • Best For: Legacy systems, simpler queries
  • Support Status: Maintenance mode

InfluxDB 2.x (Flux)

  • Query Language: Flux (functional)
  • Authentication: API Tokens
  • Organization: Organization → Bucket
  • Port: 8086 (HTTP)
  • Best For: Complex transformations, new deployments
  • Support Status: Active development

šŸ’” How to Check Your Version

Run this command on your InfluxDB server:

influx version (for 1.x shows "InfluxDB shell version: 1.x.x")
influx version (for 2.x shows "Influx CLI 2.x.x")

Or check the web UI: InfluxDB 2.x has a modern web interface at http://server:8086, while 1.x requires Chronograf for UI.

šŸ”§ InfluxDB 2.x Configuration (Recommended)

InfluxDB 2.x represents the modern architecture with improved performance, better query capabilities, and a unified authentication system. We'll start with this version as it's the current standard for new deployments.

1

Create an API Token in InfluxDB 2.x

Before configuring Grafana, you need to create an API token in InfluxDB. This token will authenticate Grafana's requests and define what data it can access.

Using the Web UI (Recommended):

1. Navigate to your InfluxDB 2.x web interface: http://your-influxdb-server:8086

2. Log in with your InfluxDB credentials

3. Click on Data (left sidebar) → API Tokens

4. Click + Generate API Token → Read/Write Token

5. Configure the token permissions:

  • Description: "Grafana Read Token"
  • Read Buckets: Select "telegraf" (or your metrics bucket)
  • Write Buckets: None (Grafana only needs read access)

6. Click Generate and immediately copy the token - you won't be able to see it again!

Using the CLI:

# Create a read-only token for Grafana influx auth create \ --org your-org-name \ --read-bucket telegraf \ --description "Grafana Read Token" # Output will show your token - save it immediately: # ID Description Token User Name User ID Permissions # 0abc123def456 Grafana Read Token rW1pT3k...FaS8xYnM= admin 0123456789abcdef [read:orgs/...]

āš ļø Token Security

Store your API token securely! If you lose it, you'll need to generate a new one. Never commit tokens to version control or share them in plain text. Consider using environment variables or secrets management systems in production.

2

Add InfluxDB 2.x Data Source in Grafana

Now that you have your API token, let's configure Grafana to connect to InfluxDB 2.x.

Navigation Path:

Connections → Data Sources → Add data source → InfluxDB

Configuration Settings:

# Basic Settings Name: InfluxDB-Telegraf Default: [āœ“] Toggle ON if this is your primary metrics source # Query Language Query Language: Flux # HTTP Settings URL: http://influxdb-server:8086 Note: Use http:// for internal networks, https:// for production Use localhost or 127.0.0.1 if InfluxDB is on the same server as Grafana Access: Server (default) - Server = Grafana server connects to InfluxDB - Browser = User's browser connects directly (rarely used) # InfluxDB Details Organization: your-org-name Note: Find this in InfluxDB UI → Settings → Organization Settings Token: your-api-token-here Paste the token you created in Step 1 Default Bucket: telegraf This is where Telegraf writes system metrics by default # Advanced HTTP Settings (Optional) Timeout: 60s (increase if you have slow queries) Custom HTTP Headers: Leave empty unless required # TLS Settings (for HTTPS connections) Skip TLS Verify: OFF (keep security on in production) With CA Cert: Enable if using self-signed certificates

Testing the Connection:

After entering all settings, scroll down and click Save & Test. You should see:

āœ“ Connection Successful

Data source is working

• Found 1 bucket: telegraf

• InfluxDB version: 2.x.x

āš ļø Common Connection Errors

"Unauthorized" - Check your API token has read permissions for the bucket

"Not Found" - Verify organization name and bucket name are correct (case-sensitive)

"Connection Refused" - Check URL, port (8086), and firewall rules

"Timeout" - Increase timeout in Advanced settings or check network connectivity

3

Understanding Flux Query Structure

Flux is a functional data scripting language designed for querying, analyzing, and acting on time-series data. Unlike SQL, Flux uses a pipeline model where data flows through a series of functions.

Basic Flux Query Pattern:

from(bucket: "telegraf") // 1. Select data source |> range(start: v.timeRangeStart, stop: v.timeRangeStop) // 2. Time range |> filter(fn: (r) => r["_measurement"] == "cpu") // 3. Filter measurement |> filter(fn: (r) => r["_field"] == "usage_idle") // 4. Filter field |> filter(fn: (r) => r["cpu"] == "cpu-total") // 5. Filter tag |> aggregateWindow(every: v.windowPeriod, fn: mean) // 6. Aggregate data

Key Flux Concepts:

  • |> - Pipe operator: passes output of one function to input of next
  • from() - Specifies which bucket to query
  • range() - Defines the time window for data retrieval
  • filter() - Reduces data based on conditions
  • _measurement - The metric type (cpu, mem, disk, etc.)
  • _field - The actual metric value being measured
  • Tags - Metadata for grouping (hostname, cpu, interface, etc.)

šŸ’” Grafana Variables in Flux

v.timeRangeStart - Dashboard time picker start (automatic)

v.timeRangeStop - Dashboard time picker end (automatic)

v.windowPeriod - Calculated aggregation interval based on time range

${variable} - Custom dashboard variable (e.g., ${hostname})

4

Query Example 1: CPU Usage by Core

This query retrieves CPU usage for all cores and displays idle time inverted to show utilization percentage.

from(bucket: "telegraf") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => r["_measurement"] == "cpu") |> filter(fn: (r) => r["_field"] == "usage_idle") |> filter(fn: (r) => r["cpu"] != "cpu-total") // Exclude total, show per-core |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false) |> map(fn: (r) => ({ r with _value: 100.0 - r._value })) // Convert idle to usage |> yield(name: "cpu_usage")

Query Explanation:

  • Line 3: Filters for CPU measurement type
  • Line 4: Selects the idle percentage field (easier to invert than sum of all usage types)
  • Line 5: Excludes the cpu-total aggregate, showing individual cores (cpu0, cpu1, etc.)
  • Line 6: Aggregates data points into time windows using mean average
  • Line 7: Inverts idle to usage (100 - idle = used)
  • Line 8: Names the output stream for reference

Expected Output:

Time series showing CPU usage percentage for each core, typically resulting in multiple lines on a graph (one per CPU core).

5

Query Example 2: Total CPU Usage with Alert Threshold

This query shows overall system CPU usage and can be used with Grafana alerts.

from(bucket: "telegraf") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => r["_measurement"] == "cpu") |> filter(fn: (r) => r["_field"] == "usage_idle") |> filter(fn: (r) => r["cpu"] == "cpu-total") // Only total CPU |> filter(fn: (r) => r["host"] =~ /${hostname}/) // Dashboard variable |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false) |> map(fn: (r) => ({ r with _value: 100.0 - r._value })) |> yield(name: "total_cpu")

Query Features:

  • Uses cpu-total for system-wide CPU percentage
  • Includes hostname variable for multi-server dashboards
  • =~ operator allows regex matching for flexible filtering
  • Single line output perfect for gauge panels or alert rules

šŸ’” Creating the $hostname Variable

Dashboard Settings → Variables → Add variable:

Name: hostname

Type: Query

Query:

import "influxdata/influxdb/schema"
schema.tagValues(bucket: "telegraf", tag: "host")
6

Query Example 3: Memory Usage Percentage

Displays memory utilization as a percentage of total available memory.

from(bucket: "telegraf") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => r["_measurement"] == "mem") |> filter(fn: (r) => r["_field"] == "used_percent") |> filter(fn: (r) => r["host"] =~ /${hostname}/) |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false) |> yield(name: "memory_usage")

Alternative: Memory Usage in Bytes

from(bucket: "telegraf") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => r["_measurement"] == "mem") |> filter(fn: (r) => r["_field"] == "used" or r["_field"] == "available") |> filter(fn: (r) => r["host"] =~ /${hostname}/) |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false) |> yield(name: "memory_bytes")

Panel Configuration Tip:

For the bytes query, configure the Y-axis with Unit: Data → bytes (IEC) in the panel settings. This will automatically display values as KiB, MiB, or GiB for better readability.

7

Query Example 4: Disk Usage by Mount Point

Monitor disk space across multiple filesystems with automatic mount point detection.

from(bucket: "telegraf") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => r["_measurement"] == "disk") |> filter(fn: (r) => r["_field"] == "used_percent") |> filter(fn: (r) => r["host"] =~ /${hostname}/) // Exclude virtual filesystems |> filter(fn: (r) => r["fstype"] != "tmpfs" and r["fstype"] != "devtmpfs") |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false) |> group(columns: ["path"]) // Group by mount point |> yield(name: "disk_usage")

Understanding Disk Metrics:

Field Description Use Case
used_percent Percentage of disk space used Quick capacity check, alerting
used Bytes used on filesystem Capacity planning, trend analysis
free Bytes available on filesystem Free space monitoring
total Total filesystem size in bytes Inventory, capacity documentation
inodes_used Number of inodes consumed File count monitoring (Linux/Unix)

Common Tag Filters:

  • path: Mount point (/, /home, /var, etc.)
  • device: Physical device (/dev/sda1, etc.)
  • fstype: Filesystem type (ext4, xfs, ntfs, etc.)
  • mode: Mount mode (rw, ro)
8

Query Example 5: Network Interface Traffic

Track inbound and outbound network traffic rates across all interfaces.

from(bucket: "telegraf") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => r["_measurement"] == "net") |> filter(fn: (r) => r["_field"] == "bytes_sent" or r["_field"] == "bytes_recv") |> filter(fn: (r) => r["host"] =~ /${hostname}/) // Exclude loopback interface |> filter(fn: (r) => r["interface"] != "lo") |> derivative(unit: 1s, nonNegative: true) // Convert cumulative to rate per second |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false) |> yield(name: "network_traffic")

Query Breakdown:

  • bytes_sent: Cumulative outbound bytes (TX)
  • bytes_recv: Cumulative inbound bytes (RX)
  • derivative(): Calculates rate of change (bytes per second)
  • nonNegative: true: Handles counter resets (e.g., interface restart)
  • unit: 1s: Normalizes to bytes per second

Panel Configuration:

Set the unit to Data rate → bytes/sec (IEC) to display as Mbps, Gbps, etc. Use Transform: Organize fields to rename "bytes_sent" to "TX" and "bytes_recv" to "RX" for clarity.

šŸ’” Advanced: Show TX as Negative for Traffic Graph

// Apply this transform after the query for a symmetric traffic view: |> map(fn: (r) => ({ r with _value: if r._field == "bytes_sent" then -r._value else r._value }))

This creates the traditional up/down network traffic visualization.

9

Query Example 6: System Load Average

Monitor system load averages to identify performance bottlenecks.

from(bucket: "telegraf") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => r["_measurement"] == "system") |> filter(fn: (r) => r["_field"] == "load1" or r["_field"] == "load5" or r["_field"] == "load15") |> filter(fn: (r) => r["host"] =~ /${hostname}/) |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false) |> yield(name: "load_average")

Understanding Load Average:

  • load1: 1-minute load average (immediate activity)
  • load5: 5-minute load average (recent trend)
  • load15: 15-minute load average (long-term trend)

Interpretation: Load average represents the number of processes waiting for CPU time. Compare to CPU core count: load of 4.0 on a 4-core system = 100% utilization. Load of 8.0 on 4-core system = significant queuing and potential performance issues.

10

Query Example 7: Disk I/O Operations

Track read and write operations per second to identify I/O bottlenecks.

from(bucket: "telegraf") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => r["_measurement"] == "diskio") |> filter(fn: (r) => r["_field"] == "reads" or r["_field"] == "writes") |> filter(fn: (r) => r["host"] =~ /${hostname}/) // Exclude loopback and ram devices |> filter(fn: (r) => r["name"] !~ /loop|ram/) |> derivative(unit: 1s, nonNegative: true) // Convert to IOPS |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false) |> yield(name: "disk_iops")

Related Disk I/O Metrics:

Field Description Unit
reads Cumulative read operations Counter (use derivative for IOPS)
writes Cumulative write operations Counter (use derivative for IOPS)
read_bytes Cumulative bytes read Bytes (use derivative for throughput)
write_bytes Cumulative bytes written Bytes (use derivative for throughput)
io_time Time spent doing I/Os Milliseconds
11

Query Example 8: Process Count by State

Monitor process states to identify zombie processes or system resource exhaustion.

from(bucket: "telegraf") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => r["_measurement"] == "processes") |> filter(fn: (r) => r["_field"] =~ /running|sleeping|stopped|zombies|blocked/) |> filter(fn: (r) => r["host"] =~ /${hostname}/) |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false) |> yield(name: "process_states")

Process States Explained:

  • running: Actively executing or waiting for CPU time
  • sleeping: Waiting for an event (normal state for most processes)
  • stopped: Halted by signal (debugging or job control)
  • zombies: Completed but parent hasn't acknowledged (should be 0)
  • blocked: Waiting for I/O to complete (high values indicate I/O bottleneck)

āš ļø Zombie Process Alert

If zombie count is consistently above 0, it indicates processes not properly cleaned up by their parent. While not immediately harmful, large numbers of zombies can exhaust the process table. Identify the parent process and restart it if necessary.

šŸ”§ InfluxDB 1.x Configuration (Legacy)

While InfluxDB 1.x is in maintenance mode, many production environments still run this version. The configuration process differs significantly from 2.x, using InfluxQL (SQL-like) queries instead of Flux.

12

Create Read-Only User in InfluxDB 1.x

Connect to your InfluxDB 1.x instance and create a dedicated user for Grafana with read-only permissions.

# Connect to InfluxDB CLI influx -username admin -password admin_password # Create read-only user CREATE USER "grafana_reader" WITH PASSWORD 'StrongPassword123!' # Grant read permission on telegraf database GRANT READ ON "telegraf" TO "grafana_reader" # Verify user creation SHOW USERS # name admin # ---- ----- # admin true # grafana_reader false # Exit exit

šŸ’” InfluxDB 1.x Authentication

If authentication is not enabled in your InfluxDB 1.x instance, you can skip user creation and leave the username/password fields empty in Grafana. However, this is not recommended for production.

To enable authentication, edit /etc/influxdb/influxdb.conf:

[http]
auth-enabled = true

Then restart InfluxDB: systemctl restart influxdb

13

Add InfluxDB 1.x Data Source in Grafana

Configure Grafana to connect using InfluxQL instead of Flux.

Navigation Path:

Connections → Data Sources → Add data source → InfluxDB

Configuration Settings:

# Basic Settings Name: InfluxDB-Telegraf-1x Default: [āœ“] Toggle ON if primary metrics source # Query Language Query Language: InfluxQL ← IMPORTANT: Select InfluxQL, not Flux # HTTP Settings URL: http://influxdb-server:8086 Access: Server (default) # InfluxDB Details Database: telegraf Note: In 1.x this is called "database", not "bucket" User: grafana_reader Password: StrongPassword123! # HTTP Method: GET (default) POST is also supported but GET is more compatible # Min time interval: Leave empty (use default) Or set to "10s" to prevent over-querying # Advanced Settings Max series: Leave empty unless you have very high cardinality Set to 1000 if experiencing performance issues

Testing the Connection:

Click Save & Test. You should see:

āœ“ Data source is working

• Database: telegraf

• InfluxDB version: 1.x.x

14

Understanding InfluxQL Query Structure

InfluxQL is similar to SQL, making it familiar for those with relational database experience. It uses SELECT, FROM, WHERE, and GROUP BY clauses.

Basic InfluxQL Query Pattern:

SELECT mean("usage_idle") -- 1. Select field and aggregation function FROM "cpu" -- 2. From measurement WHERE ("host" =~ /^$hostname$/) -- 3. Filter by tags AND $timeFilter -- 4. Grafana time range filter GROUP BY time($__interval), "cpu" -- 5. Time aggregation and grouping

Key InfluxQL Concepts:

  • Aggregation Functions: mean(), sum(), min(), max(), count(), median(), stddev()
  • $timeFilter: Grafana variable for dashboard time range (required)
  • $__interval: Automatically calculated aggregation interval
  • ~ / =~ : Regex matching operators
  • ^$variable$: Grafana variable syntax in WHERE clause

šŸ’” Grafana Variables in InfluxQL

$timeFilter - Required in WHERE clause for time filtering

$__interval - Auto-calculated GROUP BY time interval

$hostname - Dashboard variable (create in Variables settings)

$__interval_ms - Interval in milliseconds

15

InfluxQL Query Example 1: CPU Usage

Calculate CPU usage percentage from idle time, similar to the Flux example.

SELECT 100 - mean("usage_idle") AS "CPU Usage %" FROM "cpu" WHERE ("host" =~ /^$hostname$/) AND ("cpu" = 'cpu-total') AND $timeFilter GROUP BY time($__interval) FILL(null)

Query Components:

  • 100 - mean("usage_idle"): Converts idle to usage percentage
  • AS "CPU Usage %": Sets display name for the series
  • cpu-total: System-wide aggregate (use 'cpu0', 'cpu1', etc. for per-core)
  • FILL(null): Handles gaps in data (alternatives: FILL(0), FILL(previous))

Alternative: All CPU Cores

SELECT 100 - mean("usage_idle") AS "CPU Usage %" FROM "cpu" WHERE ("host" =~ /^$hostname$/) AND ("cpu" != 'cpu-total') -- Exclude total, show individual cores AND $timeFilter GROUP BY time($__interval), "cpu" -- Group by time AND cpu tag FILL(null)
16

InfluxQL Query Example 2: Memory Usage

Display memory utilization with both percentage and absolute values.

SELECT mean("used_percent") AS "Memory Usage %" FROM "mem" WHERE ("host" =~ /^$hostname$/) AND $timeFilter GROUP BY time($__interval) FILL(null)

Alternative: Memory in Bytes with Labels

SELECT mean("used") AS "Used Memory", mean("available") AS "Available Memory", mean("total") AS "Total Memory" FROM "mem" WHERE ("host" =~ /^$hostname$/) AND $timeFilter GROUP BY time($__interval)

Configure the panel with Unit: bytes (IEC) for automatic KB/MB/GB formatting.

17

InfluxQL Query Example 3: Disk Usage by Partition

Monitor disk space across all mounted filesystems with mount point labels.

SELECT mean("used_percent") AS "Disk Usage %" FROM "disk" WHERE ("host" =~ /^$hostname$/) AND ("fstype" != 'tmpfs') -- Exclude temporary filesystems AND ("fstype" != 'devtmpfs') AND $timeFilter GROUP BY time($__interval), "path" -- Group by mount point FILL(null)

Using Subqueries for Free Space Alerts:

SELECT mean("free") / 1024 / 1024 / 1024 AS "Free Space GB" FROM "disk" WHERE ("host" =~ /^$hostname$/) AND ("path" = '/') AND $timeFilter GROUP BY time($__interval)

This query shows free space in GB for the root partition, useful for capacity alerts.

18

InfluxQL Query Example 4: Network Traffic Rate

Calculate network throughput using the DERIVATIVE() function.

SELECT non_negative_derivative(mean("bytes_recv"), 1s) AS "RX", non_negative_derivative(mean("bytes_sent"), 1s) AS "TX" FROM "net" WHERE ("host" =~ /^$hostname$/) AND ("interface" != 'lo') -- Exclude loopback AND $timeFilter GROUP BY time($__interval), "interface"

Understanding DERIVATIVE():

  • non_negative_derivative(): Calculates rate of change, ignoring decreases (counter resets)
  • 1s parameter: Normalizes rate to per-second (use 1m for per-minute)
  • Combined with mean(): First aggregates, then calculates rate

šŸ’” Panel Configuration for Traffic Graph

Unit: Data rate → bytes/sec (SI) or bytes/sec (IEC)

Transform: Add "Organize fields" to rename series

Override: For TX series, use negative Y-axis to create symmetrical graph

19

InfluxQL Advanced: Using Subqueries

InfluxQL supports nested queries for complex calculations and filtering.

Example: Highest CPU Usage in Last Hour

SELECT max("cpu_usage") FROM ( SELECT 100 - mean("usage_idle") AS "cpu_usage" FROM "cpu" WHERE ("host" =~ /^$hostname$/) AND ("cpu" = 'cpu-total') AND $timeFilter GROUP BY time($__interval) )

Example: Moving Average

SELECT moving_average(mean("used_percent"), 5) AS "5-point Moving Avg" FROM "mem" WHERE ("host" =~ /^$hostname$/) AND $timeFilter GROUP BY time($__interval)

The moving average smooths out short-term fluctuations and highlights longer-term trends.

šŸ“Š Telegraf Measurements Reference

Telegraf organizes metrics into measurements, which are similar to database tables. Each measurement contains fields (actual values) and tags (metadata for grouping).

Measurement Key Fields Common Tags Use Case
cpu usage_idle, usage_system, usage_user, usage_iowait host, cpu (cpu0, cpu-total) CPU performance monitoring, capacity planning
mem used_percent, available_percent, used, available, total host Memory utilization tracking, OOM prevention
disk used_percent, free, total, used host, path, device, fstype Disk capacity monitoring, storage planning
diskio reads, writes, read_bytes, write_bytes, io_time host, name (device name) I/O performance analysis, bottleneck detection
net bytes_sent, bytes_recv, packets_sent, packets_recv, err_in, err_out host, interface Network traffic monitoring, bandwidth analysis
netstat tcp_established, tcp_listen, tcp_close_wait host Connection state tracking, troubleshooting
system load1, load5, load15, n_cpus, uptime host System load monitoring, uptime tracking
processes running, sleeping, stopped, zombies, blocked, total host Process state monitoring, zombie detection
swap used_percent, used, free, total, in, out host Swap usage monitoring, memory pressure detection
kernel context_switches, interrupts, processes_forked host Low-level system performance analysis

⚔ Query Performance Optimization

Optimizing your InfluxDB queries is crucial for dashboard responsiveness, especially with high-cardinality data or long time ranges.

20

Best Practices for Fast Queries

1. Use Appropriate Time Ranges

Avoid querying more data than necessary. For real-time monitoring, use shorter time ranges (last 1-6 hours). For trend analysis, use downsampling or continuous queries.

2. Filter Early and Often

# GOOD: Filters applied early reduce data volume from(bucket: "telegraf") |> range(start: -1h) -- Limit time range |> filter(fn: (r) => r["host"] == "web1") -- Filter by tag early |> filter(fn: (r) => r["_measurement"] == "cpu") |> aggregateWindow(every: 30s, fn: mean) # BAD: Processing all data before filtering from(bucket: "telegraf") |> range(start: -1h) |> aggregateWindow(every: 30s, fn: mean) |> filter(fn: (r) => r["host"] == "web1") -- Filter after aggregation

3. Leverage Aggregation Windows

Use aggregateWindow() in Flux or GROUP BY time($__interval) in InfluxQL to reduce data points returned to Grafana. For a 24-hour view, 1-minute aggregation is usually sufficient.

4. Avoid High-Cardinality Grouping

Grouping by tags with many unique values (like process ID or transaction ID) can create thousands of series. Limit grouping to essential tags like hostname, interface, or mount point.

5. Use Retention Policies (InfluxDB 1.x) or Buckets (InfluxDB 2.x)

Configure automatic data downsampling for older data to maintain query performance:

# InfluxDB 1.x: Create downsampled retention policy CREATE RETENTION POLICY "one_month" ON "telegraf" DURATION 30d REPLICATION 1 CREATE RETENTION POLICY "one_year_downsampled" ON "telegraf" DURATION 365d REPLICATION 1 # Create continuous query for downsampling CREATE CONTINUOUS QUERY "cq_cpu_1h" ON "telegraf" BEGIN SELECT mean("usage_idle") AS "usage_idle" INTO "one_year_downsampled"."cpu" FROM "cpu" GROUP BY time(1h), * END

šŸ’” InfluxDB 2.x Tasks for Downsampling

In InfluxDB 2.x, use Tasks instead of Continuous Queries:

option task = {name: "downsample-cpu", every: 1h} from(bucket: "telegraf") |> range(start: -2h) |> filter(fn: (r) => r._measurement == "cpu") |> aggregateWindow(every: 1h, fn: mean) |> to(bucket: "telegraf-downsampled")

6. Monitor Query Execution Time

Enable query logging in InfluxDB to identify slow queries. In InfluxDB 2.x, check the Task runs page. In InfluxDB 1.x, enable query logs in the configuration file.

šŸ” Security Best Practices

Securing your InfluxDB connection and data is critical, especially in production environments.

āš ļø Critical Security Reminders

  • Never use admin credentials in Grafana - Always create read-only users/tokens
  • Enable TLS/SSL for production - Encrypt data in transit
  • Rotate API tokens regularly - Especially if they may have been exposed
  • Use network segmentation - InfluxDB should not be directly exposed to the internet
  • Enable authentication - Never run InfluxDB without auth in production
  • Monitor access logs - Track who queries what data and when

Enabling TLS for InfluxDB 2.x

# Edit /etc/influxdb/influxdb.conf or config.yml http-bind-address: ":8086" https-enabled: true https-certificate: "/etc/ssl/influxdb-cert.pem" https-private-key: "/etc/ssl/influxdb-key.pem" # Restart InfluxDB systemctl restart influxdb

Then update your Grafana data source URL to https://influxdb-server:8086 and configure TLS certificate validation.

šŸ› ļø Troubleshooting Common Issues

21

Issue: "No data" in Grafana Panels

Diagnosis Steps:

  1. Verify data exists in InfluxDB
    # InfluxDB 2.x (Flux) from(bucket: "telegraf") |> range(start: -1h) |> filter(fn: (r) => r["_measurement"] == "cpu") |> limit(n: 5) # InfluxDB 1.x (InfluxQL) SELECT * FROM "cpu" LIMIT 5
  2. Check time range - Ensure dashboard time picker matches data availability
  3. Verify filters - Check that host/tag filters match actual tag values (case-sensitive)
  4. Review query syntax - Look for syntax errors in Query Inspector (panel menu → Inspect → Query)
22

Issue: Queries Running Slowly

Solutions:

  • Reduce time range or increase aggregation interval
  • Add more specific tag filters to reduce cardinality
  • Create continuous queries or tasks for pre-aggregated data
  • Check InfluxDB server resources (CPU, memory, disk I/O)
  • Consider upgrading InfluxDB or scaling horizontally (Enterprise)
23

Issue: "Unsupported protocol scheme" Error

Cause:

This error occurs when the URL is malformed or uses an incorrect protocol.

Solution:

  • Ensure URL starts with http:// or https://
  • Do not include trailing slashes: http://server:8086 (not http://server:8086/)
  • Verify port number (default is 8086)
  • Check that InfluxDB is actually running: systemctl status influxdb

šŸ“ˆ Next Steps

You've now configured InfluxDB as a data source and learned how to query system metrics collected by Telegraf. Here are some recommended next steps:

  • Create a comprehensive system dashboard - Combine CPU, memory, disk, and network panels
  • Set up alerts - Configure Grafana alerts for critical thresholds (CPU > 90%, disk > 85%)
  • Explore Telegraf plugins - Add custom input plugins for application-specific metrics
  • Implement data retention - Configure retention policies or tasks to manage storage
  • Build team-specific views - Create dashboards tailored for dev, ops, and management

šŸ’” Pro Tips for Production

  • Use dashboard variables for multi-server views (create $hostname dropdown)
  • Set default time range to "Last 6 hours" for operational dashboards
  • Enable auto-refresh (30s or 1m) for NOC dashboards
  • Use alert annotations to mark incidents on graphs
  • Create separate dashboards for real-time monitoring vs. capacity planning

āœ… Module 4 Complete!

You've successfully configured InfluxDB/Telegraf integration and created multiple query examples. You're now ready to visualize system metrics and move on to log aggregation with VictoriaLogs.