Overview
Google Kubernetes Engine (GKE) is a managed, production-ready environment for running containerized applications. Telegraf is a plug-in driven server agent for collecting and sending metrics and events from databases, systems and IoT sensors.
To send your Prometheus-format Google Kubernetes Engine metrics to Logz.io, you need to add the inputs.stackdriver and outputs.http plug-ins to your Telegraf configuration file.
Configuring Telegraf to send your metrics data to Logz.io
Before you begin, you’ll need: GCP project
Set relevant credentials in GCP
- Navigate to the Project selector and choose the project to send metrics from.
- In the Service account details screen, give the service account a unique name and select Create and continue.
- In the Grant this service account access to project screen, add the following roles: Compute Viewer, Monitoring Viewer, and Cloud Asset Viewer.
- Select Done.
- Select your project in the Service accounts for project list.
- Select KEYS.
- Select Keys > Add Key > Create new key and choose JSON as the type.
- Select Create and Save.
You must be a Service Account Key Admin to select Compute Engine and Cloud Asset roles.
Add an environment variable for the key
On your machine, run:
export GOOGLE_APPLICATION_CREDENTIALS=<<PATH-TO-YOUR-GCP-KEY>>
Replace <<PATH-TO-YOUR-GCP-KEY>>
with the path to the JSON file created in the previous step.
Set up Telegraf v1.17 on a dedicated machine
For Windows:
wget https://dl.influxdata.com/telegraf/releases/telegraf-1.19.2_windows_amd64.zip
After downloading the archive, extract its content into C:\Program Files\Logzio\telegraf\
.
The configuration file is located at C:\Program Files\Logzio\telegraf\
.
For MacOS:
brew install telegraf
The configuration file is located at /usr/local/etc/telegraf.conf
.
For Linux:
Ubuntu & Debian
sudo apt-get update && sudo apt-get install telegraf
The configuration file is located at /etc/telegraf/telegraf.conf
.
RedHat and CentOS
sudo yum install telegraf
The configuration file is located at /etc/telegraf/telegraf.conf
.
SLES & openSUSE
# add go repository
zypper ar -f obs://devel:languages:go/ go
# install latest telegraf
zypper in telegraf
The configuration file is located at /etc/telegraf/telegraf.conf
.
FreeBSD/PC-BSD
sudo pkg install telegraf
The configuration file is located at /etc/telegraf/telegraf.conf
.
Add the inputs.stackdriver plug-in
First you need to configure the input plug-in to enable Telegraf to scrape the GCP data from your hosts. To do this, add the below code to the configuration file.
[[inputs.stackdriver]]
project = "<<YOUR-PROJECT>>"
metric_type_prefix_include = [
"kubernetes.io",
]
interval = "1m"
- Replace
<<YOUR-PROJECT>>
with the name of your GCP project.
The full list of data scraping and configuring options can be found here
If you need to restrict the number of metrics you receive, adjust the metric_type_prefix_include
URL to your scope, e.g. kubernetes.io/anthos/APIService
. For more information on the metric types, see GCP documentation.
Add the outputs.http plug-in
After you create the configuration file, configure the output plug-in to enable Telegraf to send your data to Logz.io in Prometheus-format. To do this, add the following code to the configuration file:
[[outputs.http]]
url = "https://<<LISTENER-HOST>>:8053"
data_format = "prometheusremotewrite"
[outputs.http.headers]
Content-Type = "application/x-protobuf"
Content-Encoding = "snappy"
X-Prometheus-Remote-Write-Version = "0.1.0"
Authorization = "Bearer <<PROMETHEUS-METRICS-SHIPPING-TOKEN>>"
Replace the placeholders to match your specifics. (They are indicated by the double angle brackets << >>
):
- Replace
<<PROMETHEUS-METRICS-SHIPPING-TOKEN>>
with a token for the Metrics account you want to ship to.
Here’s how to look up your Metrics token. - Replace
<<LISTENER-HOST>>
with the Logz.io Listener URL for your region, configured to use port 8052 for http traffic, or port 8053 for https traffic. For example,listener.logz.io
if your account is hosted on AWS US East, orlistener-nl.logz.io
if hosted on Azure West Europe.
Start Telegraf
On Windows:
telegraf.exe --service start
On MacOS:
telegraf --config telegraf.conf
On Linux:
Linux (sysvinit and upstart installations)
sudo service telegraf start
Linux (systemd installations)
systemctl start telegraf
Check Logz.io for your metrics
Give your data some time to get from your system to ours, then log in to your Logz.io Metrics account, and open the Logz.io Metrics tab.
This section contains some guidelines for handling errors that you may encounter when trying to collect Kubernetes metrics.
- Overview
- Problem: Permanent error - context deadline exceeded
- Problem: Incorrect listener and/or token
- Problem: Windows nodes error
- Problem: Invalid helm chart version
- Problem: The prometheusremotewrite exporter timeout
- Problem: Permanent error - log state shows as waiting
- Problem: You have reached your pull rate limit
Problem: Permanent error - context deadline exceeded
The following error appears:
Permanent error: Post \"https://<<LISTENER-HOST>>:8053\": context deadline exceeded
meaning that the post request timeout.
Possible cause - Connectivity issue
A connectivity issue may be causing this error.
Suggested remedy
Check your shipper’s connectivity as follows.
For macOS and Linux, use telnet to make sure your log shipper can connect to Logz.io listeners.
As of macOS High Sierra (10.13),
telnet is not installed by default.
You can install telnet with Homebrew
by running brew install telnet
.
Run this command from the environment you’re shipping from, after adding the appropriate port number:
telnet listener.logz.io {port-number}
For Windows servers running Windows 8/Server 2012 and later, run the following command in PowerShell:
Test-NetConnection listener.logz.io -Port {port-number}
The port numbers are 8052 and 8053.
Possible cause - Service exposing the metrics need more time
A service exposing the metrics may need more time to send the response to the OpenTelemetry collector.
Suggested remedy
Increase the OpenTelemetry collector timeout as follows.
In values.yaml,under: config: receivers: prometheus: config: global: scrape_timeout: <<timeout time>>
.
Problem: Incorrect listener and/or token
You may be using an incorrect listener and/or token.
You will need to look in the logs of a pod whose name contains otel-collector
.
Possible cause - The token is not valid
In the logs, for the token the error will be: "error": "Permanent error: remote write returned HTTP status 401 Unauthorized; err = <nil>: Shipping token is not valid"
.
Possible cause - The listener is not valid
For the Url the error will be: "error": "Permanent error: Post \"https://liener.logz.io:8053\": dial tcp: lookup <<provided listener>> on <<ip>>: no such host"
.
Suggested remedy
Check that the listener and token of your account are correct. You can view them in the Manage tokens section.
Problem: Windows nodes error
Possible cause - Incorrect username and/or password for Windows nodes
You may be using an incorrect username and/or password for Windows nodes.
You will need to look in the logs of the windows-exporter-installer
pod. The error will look like this: INFO:paramiko.transport:Authentication (password) failed.
ERROR:root:SSH connection to node aksnpwin000002 failed, please check username and password
.
Suggested remedy
Ensure the username and password to Windows nodes are correct.
Problem: Invalid helm chart version
Possible cause - The version of the helm chart is not up to date
The helm chart version that you are using may have expired.
Suggested remedy
Update the helm chart by running:
helm repo update
Problem: The prometheusremotewrite exporter timeout
When checking the Logz.io app you don’t see any metrics, or you only see some of your metrics, but when checking your otel-collector pod for logs, you don’t see any errors. This might indicate this issue.
Possible cause - The timeout in prometheusremotewrite exporter too short
The timeout
setting in the prometheusremotewrite
exporter is too short.
Suggested remedy
Increase the timeout
setting in the prometheusremotewrite
exporter.
For example, if our timeout setting is 5s
:
endpoint: ${LISTENER_URL}
timeout: 5s
external_labels:
p8s_logzio_name: ${P8S_LOGZIO_NAME}
headers:
Authorization: "Bearer ${METRICS_TOKEN}"
You can increase it to 20s
:
endpoint: ${LISTENER_URL}
timeout: 20s
external_labels:
p8s_logzio_name: ${P8S_LOGZIO_NAME}
headers:
Authorization: "Bearer ${METRICS_TOKEN}"
Problem: Permanent error - log state shows as waiting
The log shows the following:
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Possible cause
Insufficient memory allocated to the pod.
Suggested remedy
In values.yaml
, increase the memory of the standaloneCollector
resources by approximately 100Mi
.
For example, if you are using 512Mi
:
standaloneCollector:
enabled: true
containerLogs:
enabled: false
resources:
limits:
cpu: 256m
memory: 512Mi
You can increase it as much as needed. In this example, it’s 612Mi
:
standaloneCollector:
enabled: true
containerLogs:
enabled: false
resources:
limits:
cpu: 256m
memory: 612Mi
When running apps on Kubernetes
You need to make sure that the prometheus.io/scrape
is set to true
:
prometheus.io/scrape: true
Problem: You have reached your pull rate limit
In some cases (i.e. spot clusters) where the pods or nodes are replaced frequently, they might reach the pull rate limit for images pulled from dockerhub with the following error:
You have reached your pull rate limit. You may increase the limit by authenticating and upgrading:
https://www.docker.com/increase-rate-limits
Suggested remedy
You can use the following --set
commands to use an alternative image repository:
For the monitoring chart and the Telemetry Collector Kubernetes installation:
--set logzio-k8s-telemetry.image.repository=ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib
--set logzio-k8s-telemetry.prometheus-pushgateway.image.repository=public.ecr.aws/logzio/prom-pushgateway
For the telemetry chart:
--set image.repository=ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib
--set prometheus-pushgateway.image.repository=public.ecr.aws/logzio/prom-pushgateway