The Threat Stack Agent is engineered to provide maximum functionality with the lowest possible resource consumption and system impact. We’re continuously optimizing the agent’s performance and resource consumption to ensure as small a footprint as possible. This document outlines various agent performance scenarios using Threat Stack’s production infrastructure.
Agent Performance Overview:
Our agent’s components are designed to be as efficient with system resources as possible. We include additional, dynamic behaviors that allow us to adapt to various system load and configuration profiles.
For instance, on multi-core systems, the agent will avoid binding to core 0, which is typically where network and file I/O are bound on Linux systems, to minimize the chance of introducing blocking.
We use adaptive CPU profiling to throttle our activities when load levels are high, for instance when the kernel audit framework begins to emit a large number of events at high frequency. There are configuration parameters that can be set to adjust these levels to allow better tuning for your particular system.
Processes, CPU, and Memory Usage:
The Threat Stack Agent is a composite entity, made up of several components that you can view in your process viewer of choice:
- cloudsight-main: the main process supervisor and command-and-control component
- cloudsight-worker: worker process spawned by cloudsight-main to manage inputs from various sensors
- tsauditd: our replacement for auditd that consumes, processes, and transforms raw kernel audit events with high-performance, low-latency, and minimal compute resource usage
- tsfim: sensor to perform targeted Filesystem Integrity Monitoring (FIM) based on Rules created in the Cloud Security Platform
- tscontainersd: sensor to gather events from Docker containers
At any given time, and depending on your feature plan level and configuration, you will see this set of processes executing on your server.
When evaluating our agent’s CPU and memory consumption profile, keep the following scenarios in mind.
- High number of kernel audit messages: Certain workloads, especially those with extremely high rates of forking or execve’ing of subprocesses will generate an increased number of kernel audit messages. If these subprocesses are short-lived, in the case of a forking process manager for instance, this can generate many events per second. Our agent will attempt to decode, process, and transform all of these events and keep with the output rate of the audit framework, up to the CPU limits we’ve placed (by default, we cap at 40% CPU utilize of the core we’re running on).
In addition to increased CPU load, you may notice increased memory load as our internal caches and batching of events will grow to accommodate the increased flow of events.
- Broad FIM rules: Customers can create FIM rules in the Cloud Security Platform, which are then delivered to agents. We place very few restrictions on these watches, so there exists a potential for a customer to create a FIM rule to monitor an extremely busy filesystem location or one that may be too broad in scope. We leverage built-in Linux filesystem APIs (inotify and fanotify) and each file, directory, or combination will generate additional events, consuming CPU and memory resources.
- Agent connectivity issues: Our agent maintains a persistent connection to our data ingestion services to ensure rapid event delivery. If you experience network connectivity issues, our agent will attempt to cache events to a local file. When connectivity resumes, we continue to ingest and send new events, while sending up the cached events. This can cause increased CPU and memory consumption as we process the additional load. After the event stream returns to normal, we’ll start flushing internal caches, and memory usage should decrease. In some cases, the agent must be restarted to resolve connectivity problems due to underlying networking issues.
When reviewing our memory consumption, it’s important to focus on resident (RES) memory and not virtual memory usage of our agent’s components. The memory footprint for cloudsight-main and cloudsight-worker are the components that will generally see increases in memory usage as event counts increase. Those processes will attempt to garbage collect, resulting in some periods of increased memory consumption capped with a steep drop back to more “normal” levels.
The Threat Stack CSP platform can give you insight into process execution behaviors on a global scale — we provide aggregate information on the Events view of all processes executing across your fleet, as well as at a local, server level, which you can view in Servers > Server Details. From there, you can view detailed process execution detail and see, at a granular level, which processes are generating the most events. From there, we can offer additional tuning to reduce the event flow (some events may be unimportant and filtered out).
Agent Performance at Threat Stack
At Threat Stack, we run the current production version of the agent on our production infrastructure — every system we run to provide the Threat Stack platform is running our agent. Our development environment runs a mixture of production and development agents. For the purposes of this document, we’ll be using statistics from our production environment to describe our agent’s performance. We do no special tuning of the agent for our platform and run the exact same code as customers.