Observability»

This guide provides recommendations for monitoring your Self-Hosted Spacelift installation to ensure it's running correctly. Proper monitoring helps identify potential issues before they impact your operations and ensures the reliability of your Spacelift infrastructure.

Metrics to Monitor»

Core Services»

The following table shows the core metrics that you should monitor for each of your Spacelift services:

Service	Metric	Description
Server	CPU usage	Processor utilization
	Memory usage	RAM consumption
Load balancer	Response time	Time to process API requests
	Error rate	Percentage of 5xx responses
Scheduler	CPU usage	Processor utilization
	Memory usage	RAM consumption
Drain	CPU usage	Processor utilization
	Memory usage	RAM consumption
Database	CPU usage	Processor utilization
	Memory usage	RAM consumption
	Connection count	Active DB connections

Message queues»

The Drain service uses a number of different message queues to perform asynchronous processing of certain operations. The main metric you should monitor for the message queues is the queue length. When the drain is operating correctly, messages should be processed very quickly and you should not expect to see large backlogs (hundreds of messages) for long periods of time.

One caveat to this is the webhooks queue. Because webhooks processing involves making lots of requests to your source control system, they can sometimes take several minutes to process. It is not unusual to see small backlogs on the webhooks queue, or messages that take several minutes to process. This is ok as long as the messages are eventually being processed and the queue length is not constantly increasing.

If backlogs start to build up and require immediate attention, you have three mitigation options:

Adjust the message processing timeout - The MESSAGE_PROCESSING_TIMEOUT_SECONDS environment variable controls how long messages can process before timing out (default: 900 seconds). Reducing this value (e.g., to 300 seconds) causes long-running messages to return to the queue faster, allowing other messages to be processed. This variable is exposed via the .services.drain.message_processing_timeout_seconds configuration option for CloudFormation deployments, and can be directly passed as an environment variable for Terraform deployments.
Scale the Drain service - For CloudFormation deployments, increase the .services.drain.desired_count parameter to run more Drain instances in parallel. For Terraform deployments, you could configure autoscaling based on queue length metrics - this is an easy task for SQS queues as they automatically provide metrics through CloudWatch, but require some additional setup for Postgres-based queues where you need to rely on telemetry (see below).
Increase per-queue Drain concurrency - By default, each Drain instance processes messages from each queue sequentially, one at a time. This means a single slow message can block every other message on the same queue until it finishes. The DRAIN_CONCURRENCY_* environment variables allow a single Drain instance to process multiple messages from the same queue in parallel. See Tuning Drain concurrency below.

Tuning Drain concurrency»

Each Drain instance runs a configurable number of independent receivers per queue. The default is 1 for every queue, which preserves the sequential behavior described above. Raising the value for a queue allows each Drain instance to process that many messages from the queue in parallel.

To decide which parameter to raise, look at the per-queue depth metrics described below (SQS or Postgres): the queue that shows a growing backlog is the one to tune. For a description of what each queue handles, see the Queues reference.

Queue	Environment variable
`async-jobs`	`DRAIN_CONCURRENCY_ASYNC_JOBS`
`async-jobs.fifo`	`DRAIN_CONCURRENCY_ASYNC_JOBS_FIFO`
`cronjobs`	`DRAIN_CONCURRENCY_CRONJOBS`
`dlq`	`DRAIN_CONCURRENCY_DLQ`
`dlq.fifo`	`DRAIN_CONCURRENCY_DLQ_FIFO`
`events-inbox`	`DRAIN_CONCURRENCY_EVENTS`
`iot`	`DRAIN_CONCURRENCY_IOT`
`webhooks`	`DRAIN_CONCURRENCY_WEBHOOKS`

A good starting point is to double the value for the affected queue (for example, DRAIN_CONCURRENCY_WEBHOOKS=2), observe the queue depth, and increase further if the backlog persists.

Warning

Higher concurrency values increase resource usage: each additional receiver can hold a message in memory and use additional database connections. Very large values can lead to out-of-memory errors in the Drain service or exhaustion of the database connection pool. Increase the values gradually and monitor Drain CPU and memory usage as well as database connection counts.

SQS metrics»

The message queue length is easy to monitor for SQS-based message queues - you can use the ApproximateNumberOfMessagesVisible metric provided by SQS.

Postgres metrics»

For the postgres-based message queue however, you will need telemetry enabled. When telemetry is enabled, we expose the following metrics:

postgres_queue.messages.sent (counter) - incremented when a message is sent to the queue.
postgres_queue.messages.received (counter) - incremented when a message is received from the queue.
postgres_queue.messages.changed_visibility (counter) - incremented when a message visibility is changed.
postgres_queue.messages.deleted (counter) - incremented when a message is deleted from the queue.
postgres_queue.messages.total (gauge) - total number of messages in the queue.
postgres_queue.messages.visible (gauge) - number of visible messages in the queue.

Worker Pool Controller (Kubernetes)»

For Kubernetes worker pool deployments, you can monitor the worker pool controller using Prometheus metrics. These metrics are available in the spacelift_workerpool_controller namespace. See the Controller metrics section for more details.

Metric	Description
`spacelift_workerpool_controller_run_startup_duration_seconds` (histogram)	Time between when a job assignment is received and the worker container is started
`spacelift_workerpool_controller_worker_creation_errors_total` (counter)	Total number of worker creation errors
`spacelift_workerpool_controller_worker_idle_total` (gauge)	Number of idle workers
`spacelift_workerpool_controller_worker_total` (gauge)	Total number of workers

Telemetry»

Telemetry and tracing can help diagnose complex issues but is not required for basic monitoring. If you decide to implement tracing:

Configure an appropriate backend (Datadog, AWS X-Ray, or OpenTelemetry).
Focus on high-value traces (API requests, run execution, etc.).
Use sampling in production to reduce overhead.

Refer to the Telemetry reference for configuration options.

Logging»

Setting up proper log collection is strongly recommended - it’s a key part of running a healthy self-hosted installation. Without it, identifying and fixing issues becomes much harder and more time-consuming.

The sections below outline what logs are available and how to collect them across the different components of your Spacelift setup.

Core services»

All 3 core services (server, scheduler, and drain) log to stdout and stderr. We at Spacelift primarily use traces for debugging, so you won't find many "info" level logs. On the other hand, errors and terminal failures will be present.

Docker-based worker pools»

Our Docker-based worker pools log to /var/log/spacelift/error|info.log files.

Note that in case of a startup failure, the worker will terminate immediately so you won't have a chance to see the logs. We provide an option to not terminate on failure for the below two types of deployments:

Cloudformation - the worker pool deployment stack has a PowerOffOnError variable. If set to false, the worker pool will not terminate on startup failure.
terraform-aws-spacelift-workerpool-on-ec2 Terraform module - this module has a selfhosted_configuration variable that must be provided for self-hosted installations. The variable has an embedded power_off_on_error field.

Kubernetes-based worker pools»

The Kubernetes-based worker pools log to stdout and stderr. The documentation has a dedicated section on troubleshooting that provides more details on how to retrieve logs. You can use any Kubernetes log collection tool (e.g., Fluentd, Fluent Bit, Loki) to collect and aggregate these logs.