Monitoring Jobs with Prometheus and Grafana
This guide will give you everything you need to start observing metrics for Immerok Jobs with Prometheus and Grafana.
If you haven't created a Job already, check out the tutorial to get going.
Interested in other ways to consume metrics from your Jobs? Hitting some limitations for your use case? Reach out to us on Slack, email, wherever you find your local Immerokers.
Configuring Prometheus
We'll assume our Org is named immerok
and we have a Job named window-aggregation-0.1
in the default
Project.
Each Job exposes all its metrics for scraping via the Immerok API Server on the endpoint:
https://api.immerok.cloud/apis/core/v1alpha1/orgs/$ORG/projects/$PROJECT/jobs/$JOB/metrics
We'll use the static_config
directive to specify the scrape endpoint.
We'll just need the Org, Project, and Job names as well as a token to authenticate.
Let's use the rok
CLI to generate one:
_4$ rok auth generate_4eyJhbGciOiJIUzUxMiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJhdXN0aW5jZSIsImV4cCI6MTY2ODQ3NjIwOCwibmJmIjoxNjY3ODcxNDA4LCJpYXQiOjE2Njc4NzE0MDgsInJvbGUiOiJvcmc6YWRtaW4iLCJvcmdzIjpbImltbWVyb2siXX0.U8j_V1K-hYIy0WnRjpmpVbBiooWLWqM_V7TXrYdLBRGd8Od8YUpeQr94QeAmkgQfCSQ_c6FpIt1G3FAzsBPOgQ_4# Optionally, specify an expiration time_4$ rok auth generate --expires-at="3mo"
_15global:_15 evaluation_interval: 1m_15 scrape_interval: 10s_15 scrape_timeout: 10s_15scrape_configs:_15- honor_labels: true_15 job_name: 'immerok-job'_15 scheme: https_15 metrics_path: /apis/core/v1alpha1/orgs/immerok/projects/default/jobs/window-aggregation-0.1/metrics_15 authorization:_15 # Outside local testing, you should load the token via a file instead of leaving it in plaintext within the config_15 credentials: $TOKEN_15 static_configs:_15 - targets:_15 - api.immerok.cloud
Immerok-Specific Labels
The following labels are attached to each metric to help identify workloads:
immerok_org
: The Org the Job belongs toimmerok_project
: The Project the Job resides inimmerok_zone
: The Zone the Job runs inimmerok_job
: The name of the Job itself
Observing the Metrics
Once ingested, you can use the rich set of metrics made available by Flink (docs here).
We do not expose deprecated metrics. Please consult Flink's documentation for replacements.
In our dashboard provided-below, we sketch out possibilities to get you started exploring health, throughput, and state checkpointing.