Reducing Kafka connections by 10x with a sidecar pattern

Kafka clients open a TCP connection to each broker with the partitions they must write to. If an app uses multiprocessing, the connection count multiplies quickly because each subprocess maintains separate connections, as our Kafka library isn’t fork-safe.

At a large scale, with many partitions across many brokers, plus dozens of worker processes per pod and hundreds of pods, this can easily balloon into hundreds of thousands of live TCP connections during peak hours. As a toy example, 1000 pods * 10 subprocesses * 10 brokers is 100,000 total connections.

Besides socket buffers, brokers maintain per-connection metadata. Together, that increases heap usage and, under pressure, triggers aggressive garbage collection (GC). GC spikes CPU, and CPU spikes hurt availability, leading to timeouts and potential data loss.

What didn’t work (and why)

“Just add more brokers?”

Adding brokers increases total cluster resources and often reduces per‑broker load. However, producers keep persistent TCP connections to the brokers that host the leader partitions they write to.

As partitions are rebalanced and spread across more brokers, each producer maintains connections to more brokers. So, total connection cardinality across the cluster will likely increase, so it does not fix the multiplicative effect from many processes per pod.

“Just use bigger brokers?”

Kafka brokers typically run with a relatively small JVM heap and rely on the operating system page cache for throughput. Industry guidance commonly keeps the broker heap around 4-8 GB.

Adding more machine RAM mainly increases page cache, not heap, unless you explicitly raise -Xmx (which usually isn’t beneficial beyond roughly 6-8 GB because of GC tradeoffs).

“Just add a proxy?”

Database pools like PgBouncer work because SQL sessions can be pooled and reused at transaction boundaries, allowing many clients to share a smaller set of database connections.

Kafka’s protocol, by contrast, is stateful across the entire client session, assuming a single long-lived client per broker connection for correctness and throughput.

“Just tune the client?”

Tuning linger, idle connection timeouts, batching, and network buffers helped a bit, but didn’t address the core multiplicative effect from multiprocessing.

The solution: A producer sidecar per pod

We introduced a gRPC sidecar that runs a single Kafka producer for a pod. Application processes (including forked workers) send emit RPCs to the sidecar instead of opening broker connections.

Why it helped

Fewer connections: Consolidates “many producers per pod” into a single producer per pod. In multiprocessing workloads, this can be an order-of-magnitude reduction.
Preserves ordering: Within a pod, all writes to a partition flow through the same producer, preserving Kafka’s partition-level ordering guarantees.
Scales horizontally: Each pod has its own sidecar. Scaling the deployment scales sidecars linearly without reintroducing per-process fan-out.
Safe fallback: If the sidecar is unavailable, the client can fall back to direct emission to Kafka to avoid data loss. This is rare in practice and limited to the affected pod.
Lightweight: CPU and memory overhead of the sidecar are tiny for our producer workloads.
Minimal refactoring: We needed the solution to be as “drop-in” as possible. We didn’t want to risk introducing bugs by refactoring existing services, many of which are mission-critical, to support some new pattern. Adopting the sidecar only required adjustments to configuration, not code.

Before and after connection topology with a producer sidecar

Results

After migrating our largest deployments to the sidecar:

Peak producer connections: Down about 10x cluster-wide at peak.
Broker heap usage at peak: Down about 70 percentage points at peak.
CPU spikes: The recurring spikes we saw during peak traffic windows disappeared post-rollout.

Implementation notes to adapt to your stack

Client toggle: Add a “sidecar mode” flag to your Kafka client wrapper. In sidecar mode, the wrapper uses gRPC to emit events to the local sidecar.
Backpressure and retries: The sidecar producer buffers events and enforces per-partition ordering.
Metrics: Record metrics from both client wrapper and sidecar: request counts, queue depth, batch size, latency percentiles, and error rates.
SLOs: Define SLOs on sidecar availability and tail latency. Alert on sustained fallback to direct Kafka emits.
Compression and batching: Consolidation can increase average batch size, which improves compression and reduces network bandwidth.
Rollout: Start with the highest connection offenders (multiprocessing producers). Roll out pod by pod with a feature flag. Monitor per-broker connections, GC, and client success rates.

Minimal interface (pseudocode)

service ProducerSidecar {

rpc Emit(ProduceRequest) returns (ProduceAck);

}

ProduceRequest: topic, key (optional), headers, event_bytes

ProduceAck/Result: received=true/false, error_code, retry_after_ms

What we built along the way (and kept)

Client and sidecar metrics: Visibility into end-to-end emit latency, success rates, and queueing. These metrics enable us to set up alerts and understand how our system performs over time.
Emission quality tracking: Record an attempt when emitting an event and a success when the broker acks. Use a lightweight store to aggregate by app and topic. We enabled this for direct Kafka emits to get a baseline “success rate” for asynchronous event emission. We could then compare the old way to our sidecar’s “success rate” as we rolled out to ensure no degradation.
Native delivery reports: While the sidecar pattern is mostly drop-in, one lost feature is delivery reports. The reports tell us if the broker successfully received the emission. We use this to trigger fallbacks to a secondary event store, log errors, or increment our Emission Quality Tracking success counter.

But since the callbacks are now tied to the sidecar Producer, we do not have access to the callback on the worker side. We implemented a separate streaming endpoint on our gRPC server to solve this. This endpoint sends delivery reports from the sidecar back to the primary worker container, allowing clients to leverage custom delivery callbacks written in whatever language the worker container uses. The sidecar maintains a ring buffer, continuously sending delivery reports over the gRPC stream. At the same time, the worker runs a separate thread to handle processing these reports and running the callbacks.

Takeaways

Count connections, not just throughput. Multiprocessing can silently multiply broker connections, increasing the broker’s memory usage.
Operationally, boring is better than clever. A per-pod sidecar keeps blast radius small and failure semantics clear. Additionally, adopting the sidecar only requires configuration changes, avoiding extra complexity from refactoring.
Metrics first. Measuring attempt-to-ack success and tail latency before and after adopting the sidecar patterns gives you confidence as you roll out.

Credits: Robinhood’s talk on their consumer-side Kafkaproxy sidecar was a major inspiration to us. Additionally, a big thanks to all members of Team Events for their contributions to this project: Mahsa Khoshab, Sugat Mahanti, and Sarah Story.

Source link