Ship Google Kubernetes Engine metrics over Telegraf

Overview
Setup
Troubleshooting

Overview

Google Kubernetes Engine (GKE) is a managed, production-ready environment for running containerized applications. Telegraf is a plug-in driven server agent for collecting and sending metrics and events from databases, systems and IoT sensors.

To send your Prometheus-format Google Kubernetes Engine metrics to Logz.io, you need to add the inputs.stackdriver and outputs.http plug-ins to your Telegraf configuration file.

Configuring Telegraf to send your metrics data to Logz.io

Before you begin, you’ll need: GCP project

Set relevant credentials in GCP

Navigate to the Project selector and choose the project to send metrics from.
In the Service account details screen, give the service account a unique name and select Create and continue.
In the Grant this service account access to project screen, add the following roles: Compute Viewer, Monitoring Viewer, and Cloud Asset Viewer.
Select Done.
Select your project in the Service accounts for project list.
Select KEYS.
Select Keys > Add Key > Create new key and choose JSON as the type.
Select Create and Save.

You must be a Service Account Key Admin to select Compute Engine and Cloud Asset roles.

Add an environment variable for the key

On your machine, run:

export GOOGLE_APPLICATION_CREDENTIALS=<<PATH-TO-YOUR-GCP-KEY>>

Replace <<PATH-TO-YOUR-GCP-KEY>> with the path to the JSON file created in the previous step.

Set up Telegraf v1.17 on a dedicated machine

For Windows:

wget https://dl.influxdata.com/telegraf/releases/telegraf-1.19.2_windows_amd64.zip

After downloading the archive, extract its content into C:\Program Files\Logzio\telegraf\.

The configuration file is located at C:\Program Files\Logzio\telegraf\.

For MacOS:

brew install telegraf

The configuration file is located at /usr/local/etc/telegraf.conf.

For Linux:

Ubuntu & Debian

sudo apt-get update && sudo apt-get install telegraf

The configuration file is located at /etc/telegraf/telegraf.conf.

RedHat and CentOS

sudo yum install telegraf

The configuration file is located at /etc/telegraf/telegraf.conf.

SLES & openSUSE

# add go repository
zypper ar -f obs://devel:languages:go/ go
# install latest telegraf
zypper in telegraf

The configuration file is located at /etc/telegraf/telegraf.conf.

FreeBSD/PC-BSD

sudo pkg install telegraf

The configuration file is located at /etc/telegraf/telegraf.conf.

Add the inputs.stackdriver plug-in

First you need to configure the input plug-in to enable Telegraf to scrape the GCP data from your hosts. To do this, add the below code to the configuration file.

[[inputs.stackdriver]]

  project = "<<YOUR-PROJECT>>"

  metric_type_prefix_include = [
    "kubernetes.io",
  ]

  interval = "1m"

Replace <<YOUR-PROJECT>> with the name of your GCP project.

The full list of data scraping and configuring options can be found here

If you need to restrict the number of metrics you receive, adjust the metric_type_prefix_include URL to your scope, e.g. kubernetes.io/anthos/APIService. For more information on the metric types, see GCP documentation.

Add the outputs.http plug-in

After you create the configuration file, configure the output plug-in to enable Telegraf to send your data to Logz.io in Prometheus-format. To do this, add the following code to the configuration file:

[[outputs.http]]
  url = "https://<<LISTENER-HOST>>:8053"
  data_format = "prometheusremotewrite"
  [outputs.http.headers]
     Content-Type = "application/x-protobuf"
     Content-Encoding = "snappy"
     X-Prometheus-Remote-Write-Version = "0.1.0"
     Authorization = "Bearer <<PROMETHEUS-METRICS-SHIPPING-TOKEN>>"

Replace the placeholders to match your specifics. (They are indicated by the double angle brackets << >>):

Replace <<PROMETHEUS-METRICS-SHIPPING-TOKEN>> with a token for the Metrics account you want to ship to.
Here’s how to look up your Metrics token.
Replace <<LISTENER-HOST>> with the Logz.io Listener URL for your region, configured to use port 8052 for http traffic, or port 8053 for https traffic. For example, listener.logz.io if your account is hosted on AWS US East, or listener-nl.logz.io if hosted on Azure West Europe.

Start Telegraf

On Windows:

telegraf.exe --service start

On MacOS:

telegraf --config telegraf.conf

On Linux:

Linux (sysvinit and upstart installations)

sudo service telegraf start

Linux (systemd installations)

systemctl start telegraf

Check Logz.io for your metrics

Give your data some time to get from your system to ours, then log in to your Logz.io Metrics account, and open the Logz.io Metrics tab.

This section contains some guidelines for handling errors that you may encounter when trying to collect Kubernetes metrics.

Overview
- Configuring Telegraf to send your metrics data to Logz.io
Problem: Permanent error - context deadline exceeded
Problem: Incorrect listener and/or token
Problem: Windows nodes error
Problem: Invalid helm chart version
Problem: The prometheusremotewrite exporter timeout
Problem: Permanent error - log state shows as waiting
Problem: You have reached your pull rate limit

Problem: Permanent error - context deadline exceeded

The following error appears:

Permanent error: Post \"https://<<LISTENER-HOST>>:8053\": context deadline exceeded
meaning that the post request timeout.

Possible cause - Connectivity issue

A connectivity issue may be causing this error.

Suggested remedy

Check your shipper’s connectivity as follows.

For macOS and Linux, use telnet to make sure your log shipper can connect to Logz.io listeners.

As of macOS High Sierra (10.13), telnet is not installed by default. You can install telnet with Homebrew by running brew install telnet.

Run this command from the environment you’re shipping from, after adding the appropriate port number:

telnet listener.logz.io {port-number}

For Windows servers running Windows 8/Server 2012 and later, run the following command in PowerShell:

Test-NetConnection listener.logz.io -Port {port-number}

The port numbers are 8052 and 8053.

Possible cause - Service exposing the metrics need more time

A service exposing the metrics may need more time to send the response to the OpenTelemetry collector.

Suggested remedy

Increase the OpenTelemetry collector timeout as follows.

In values.yaml,under: config: receivers: prometheus: config: global: scrape_timeout: <<timeout time>>.

Problem: Incorrect listener and/or token

You may be using an incorrect listener and/or token.

You will need to look in the logs of a pod whose name contains otel-collector.

Possible cause - The token is not valid

In the logs, for the token the error will be: "error": "Permanent error: remote write returned HTTP status 401 Unauthorized; err = <nil>: Shipping token is not valid".

Possible cause - The listener is not valid

For the Url the error will be: "error": "Permanent error: Post \"https://liener.logz.io:8053\": dial tcp: lookup <<provided listener>> on <<ip>>: no such host".

Suggested remedy

Check that the listener and token of your account are correct. You can view them in the Manage tokens section.

Problem: Windows nodes error

Possible cause - Incorrect username and/or password for Windows nodes

You may be using an incorrect username and/or password for Windows nodes.

You will need to look in the logs of the windows-exporter-installer pod. The error will look like this: INFO:paramiko.transport:Authentication (password) failed.

ERROR:root:SSH connection to node aksnpwin000002 failed, please check username and password.

Suggested remedy

Ensure the username and password to Windows nodes are correct.

Problem: Invalid helm chart version

Possible cause - The version of the helm chart is not up to date

The helm chart version that you are using may have expired.

Suggested remedy

Update the helm chart by running:

helm repo update

Problem: The prometheusremotewrite exporter timeout

When checking the Logz.io app you don’t see any metrics, or you only see some of your metrics, but when checking your otel-collector pod for logs, you don’t see any errors. This might indicate this issue.

Possible cause - The timeout in prometheusremotewrite exporter too short

The timeout setting in the prometheusremotewrite exporter is too short.

Suggested remedy

Increase the timeout setting in the prometheusremotewrite exporter.

For example, if our timeout setting is 5s:

endpoint: ${LISTENER_URL}
      timeout: 5s
      external_labels:
        p8s_logzio_name: ${P8S_LOGZIO_NAME}
      headers:
        Authorization: "Bearer ${METRICS_TOKEN}"

You can increase it to 20s:

endpoint: ${LISTENER_URL}
      timeout: 20s
      external_labels:
        p8s_logzio_name: ${P8S_LOGZIO_NAME}
      headers:
        Authorization: "Bearer ${METRICS_TOKEN}"

Problem: Permanent error - log state shows as waiting

The log shows the following:

State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Exit Code: 137

Possible cause

Insufficient memory allocated to the pod.

Suggested remedy

In values.yaml, increase the memory of the standaloneCollector resources by approximately 100Mi.

For example, if you are using 512Mi:

standaloneCollector:
  enabled: true

  containerLogs:
    enabled: false

  resources:
    limits:
      cpu: 256m
      memory: 512Mi

You can increase it as much as needed. In this example, it’s 612Mi:

standaloneCollector:
  enabled: true

  containerLogs:
    enabled: false

  resources:
    limits:
      cpu: 256m
      memory: 612Mi

When running apps on Kubernetes

You need to make sure that the prometheus.io/scrape is set to true:

prometheus.io/scrape: true

Problem: You have reached your pull rate limit

In some cases (i.e. spot clusters) where the pods or nodes are replaced frequently, they might reach the pull rate limit for images pulled from dockerhub with the following error:

You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: 
https://www.docker.com/increase-rate-limits

Suggested remedy

You can use the following --set commands to use an alternative image repository:

For the monitoring chart and the Telemetry Collector Kubernetes installation:

--set logzio-k8s-telemetry.image.repository=ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib --set logzio-k8s-telemetry.prometheus-pushgateway.image.repository=public.ecr.aws/logzio/prom-pushgateway

For the telemetry chart:

--set image.repository=ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib --set prometheus-pushgateway.image.repository=public.ecr.aws/logzio/prom-pushgateway