Skip to content
Back to Interview Guides
Interview Guide

30 Advanced Apache Kafka Interview Questions for Experienced Developers

· 17 min read
Apache Kafka Q&A Component

Jump to Category

⚙️ Core Architecture & Internals Producers & Consumers
✔️ Delivery Semantics & Transactions Kafka Streams & ksqlDB
️ Operations & Ecosystem

Core Architecture & Internals

1. What is the role of the Controller broker in a Kafka cluster?

The **Controller** is one of the brokers in the cluster that is elected to be responsible for cluster-wide administrative tasks. Its primary responsibility is managing the state of partitions and replicas and performing leader elections.

When a broker goes down, the controller detects this (via ZooKeeper or KRaft) and elects a new leader from the in-sync replicas for all partitions that had their leader on the failed broker. It then communicates this change to all other brokers. There is only one active controller in a cluster at any given time.

Learn more about the Kafka Controller Broker.

2. Explain log compaction. What is its use case?

Log compaction** is a retention policy that ensures Kafka retains at least the *last known value* for each message key within a topic partition. Instead of deleting old log segments based on time or size, compaction periodically runs in the background, removing records if a more recent record with the same key has appeared in the log.

Its primary use case is for maintaining a full snapshot of state. For example, it’s ideal for storing the latest state of database records (change data capture), user profiles, or entity snapshots, as you can always replay the topic to restore the latest state for every key.

Read the official documentation on Log Compaction.

3. How does Kafka store data on disk? What is a log segment?

Kafka stores all messages for a topic partition in a directory on disk. This directory contains a set of **log segments**. A log segment is a file that contains a chunk of the partition’s messages. Once a segment reaches a certain size (e.g., 1 GB) or age, it is closed, and a new active segment is created. This approach allows Kafka to easily delete old data by simply deleting the entire segment file, which is much more efficient than deleting individual records.

Each data segment file has a corresponding index file to quickly locate messages by their offset.

4. What is the difference between ZooKeeper-based and KRaft-based Kafka clusters?

ZooKeeper-based clusters rely on an external Apache ZooKeeper ensemble to store cluster metadata, such as broker information, topic configurations, and ACLs. The controller broker is elected via ZooKeeper. This adds operational complexity as you have to manage a separate distributed system.

KRaft (Kafka Raft) mode** removes the ZooKeeper dependency. The cluster metadata is stored internally within a Kafka topic itself, and the brokers use the Raft consensus protocol to elect a controller and maintain consistent metadata. This simplifies Kafka’s architecture, reduces operational overhead, and allows for faster controller failover.

Read the blog post on Kafka without ZooKeeper.

5. What is the role of an In-Sync Replica (ISR)?

An **In-Sync Replica** is a replica of a partition that is fully caught up with the leader. To be in the ISR, a replica must have fetched all messages from the leader within a configured time limit (`replica.lag.time.max.ms`).

The ISR list is crucial for data durability and availability. When a producer uses `acks=all`, the leader will only acknowledge the write after all replicas in the ISR have confirmed they have received the message. Furthermore, only replicas in the ISR are eligible to be elected as the new leader if the current leader fails.

6. What is the Page Cache and why is it important for Kafka’s performance?

The **Page Cache** is memory managed by the operating system that caches recently accessed data from disk. Kafka leverages the OS page cache heavily instead of maintaining its own cache in the JVM heap. When data is read or written, it’s served from or written to this highly efficient cache.

This approach simplifies Kafka’s design, reduces the burden on the JVM’s garbage collector, and allows Kafka to efficiently use all available free memory on a server for caching, leading to extremely fast read/write performance for “hot” data.

Producers & Consumers

7. Explain the producer’s `acks` configuration.

The `acks` setting controls the level of acknowledgment the producer requires from the broker before considering a request complete. This is the primary lever for tuning durability.

  • `acks=0`: The producer does not wait for any acknowledgment. This offers the highest throughput but risks data loss.
  • `acks=1` (default): The producer waits for an acknowledgment from the partition’s leader broker. The write is considered successful if the leader receives it, but data could be lost if the leader fails before replicating it.
  • `acks=all` (or `-1`): The producer waits for an acknowledgment from the leader *and* all in-sync replicas. This provides the strongest durability guarantee but has the highest latency.
Learn more about Producer Acknowledgements.

8. What is an idempotent producer? How do you enable it?

An **idempotent producer** ensures that messages are written to a partition exactly once, even if the producer retries sending a message due to a network error. It prevents duplicate messages from being written during retries.

This is achieved by assigning a Producer ID (PID) and a sequence number to each message. The broker keeps track of the latest sequence number for each PID and will discard any message with a number less than or equal to the one it has already seen. You enable it by setting `enable.idempotence=true` in the producer configuration.

9. How does a consumer group rebalance work?

A **rebalance** is the process where the partitions of a topic are reassigned among the consumers in a consumer group. A rebalance is triggered when:

  • A new consumer joins the group.
  • An existing consumer leaves the group (either cleanly or by crashing).
  • The topic’s partition count changes.

During a rebalance, all consumers stop processing, revoke their currently assigned partitions, and the **group coordinator** (a broker) reassigns the partitions among all current members according to a partition assignment strategy. This can cause a short pause in consumption, often called a “stop-the-world” event for the consumer group.

Read a blog post on the Rebalance Protocol.

10. What is the role of `max.poll.interval.ms` and `max.poll.records`?

  • `max.poll.records`: The maximum number of records returned in a single call to `poll()`. It helps control how much data your application processes in each polling loop.
  • `max.poll.interval.ms`: The maximum amount of time allowed between calls to `poll()`. If a consumer fails to call `poll()` within this interval (e.g., because its processing logic is stuck or too slow), it is considered to have failed. The group coordinator will kick it out of the group and trigger a rebalance.

Tuning these is a trade-off between throughput and responsiveness to rebalancing.

11. What is a custom partitioner and when would you need one?

A custom partitioner is a class you can provide to a producer to control how messages are assigned to partitions. By default, if a message has a key, Kafka uses a hash of the key to choose a partition (guaranteeing messages with the same key go to the same partition). If the key is null, it uses a sticky round-robin strategy.

You would need a custom partitioner if you have specific business logic for partitioning that isn’t based on the message key. For example, you might want to partition based on a value in the message’s header or payload to ensure locality for a specific data attribute.

12. Explain how a consumer commits offsets, comparing automatic and manual commits.

Committing an offset marks a message as having been processed by a consumer group. This is the position where the consumer will resume from after a restart or rebalance.

  • Automatic Commit (`enable.auto.commit=true`): The consumer client automatically commits the latest offset returned by the `poll()` method at a regular interval (`auto.commit.interval.ms`). This is convenient but can lead to missed or duplicate messages if the consumer crashes after committing but before processing.
  • Manual Commit: The developer is responsible for committing offsets by calling `commitSync()` or `commitAsync()`. This provides much better control over delivery semantics, allowing you to commit only after your processing logic is complete.

Delivery Semantics & Transactions

13. Explain the three message delivery semantics: at-most-once, at-least-once, and exactly-once.

  • At-most-once: Messages may be lost but are never redelivered. This can be achieved by having the producer use `acks=0` or by having a consumer commit offsets before processing messages.
  • At-least-once (default): Messages are never lost but may be redelivered. This is achieved by having the producer use `acks=1` or `acks=all` and having the consumer commit offsets *after* processing messages. If the consumer crashes after processing but before committing, it will re-process the message upon restart.
  • Exactly-once (EOS): Messages are delivered and processed exactly one time. This is the strongest guarantee.

14. How does Kafka achieve Exactly-Once Semantics (EOS)?

EOS in Kafka is achieved through a combination of two key features:

  1. Idempotent Producer: As described earlier, this prevents duplicate messages from being written by the producer during retries (`enable.idempotence=true`).
  2. Transactional API: For read-process-write patterns (like in Kafka Streams), the producer can wrap a series of operations in a transaction. It can consume messages from a source topic, produce messages to a destination topic, and commit the consumer offsets all as a single, atomic operation. If any part fails, the entire transaction is aborted. Consumers configured with `isolation.level=read_committed` will only see messages from committed transactions.
Read the definitive guide on Exactly-Once Semantics.

15. What is the role of the transaction coordinator?

The transaction coordinator is a broker in the cluster responsible for managing transactions for a given transactional producer. When a producer initiates a transaction, it registers with the coordinator. The coordinator keeps track of the state of the transaction (ongoing, preparing to commit, committed, aborted) and writes this state to an internal transaction log. It orchestrates the two-phase commit protocol across the brokers involved in the transaction to ensure atomicity.

Kafka Streams & ksqlDB

16. What is the difference between a KStream and a KTable? Explain their duality.

  • A **KStream** is an unbounded, append-only stream of records representing a sequence of events. Each record is an independent piece of data. Think of it as a history of “what happened.”
  • A **KTable** is a representation of a stream of updates. It is interpreted as a changelog where records with the same key are considered updates to the previous record. It represents the latest state for each key. Think of it as a table of “what is the current value.”

The **duality** means you can convert between them: you can aggregate a KStream to create a KTable (e.g., counting events), and you can convert a KTable into a KStream to see its history of changes.

Read about the Duality of Streams and Tables.

17. What is a state store in Kafka Streams?

A state store is a database used by Kafka Streams applications to store and query state required for stateful operations like joins, aggregations, and windowing. By default, Kafka Streams uses an embedded **RocksDB** instance for persistent state stores. These state stores are backed up by a changelog topic in Kafka, which provides fault tolerance. If an application instance fails, a new instance can restore its state by replaying the changelog topic.

18. Explain the different types of windowing available in Kafka Streams.

  • Tumbling Windows: Fixed-size, non-overlapping, gap-less windows. A record belongs to exactly one window. Useful for fixed-interval analysis (e.g., counts per minute).
  • Hopping Windows: Fixed-size, overlapping windows. A record can belong to multiple windows. Defined by a size and an advance interval. Useful for moving averages.
  • Sliding Windows: Fixed-size windows that “slide” over the data. A new window is created for each new record, and it compares records within a fixed time difference. Useful for joining streams on a time-based condition.
  • Session Windows: Key-based windows with a dynamic size. They group records for a key that occur within a defined “inactivity gap.” Useful for analyzing user sessions.

19. How does Kafka Streams handle joins between two streams?

A `KStream-KStream` join requires a **join window**. When a record arrives on one stream, it is held in a state store and compared against any records that have arrived on the other stream within the defined time window. This is necessary because, unlike in a database, there is no guarantee that matching records will arrive at the same time.

20. What is ksqlDB?

ksqlDB is an event streaming database built on top of Kafka and Kafka Streams. It allows you to build stream processing applications using a familiar SQL-like syntax. You can run continuous queries that process, filter, aggregate, and join streams of data in real-time, without writing Java or Scala code. It provides a powerful, high-level abstraction for stream processing.

Operations & Ecosystem

21. What is Schema Registry and why is it important?

The **Schema Registry** is a centralized repository for managing and validating schemas (like Avro, Protobuf, or JSON Schema) for your Kafka topics. It provides a way to enforce data governance and ensure that producers and consumers are compatible.

It’s important because it allows for **schema evolution**. You can evolve your schemas over time (e.g., by adding a new optional field) in a safe, backward- or forward-compatible way. The producer will serialize data with a specific schema version ID, and the consumer can fetch that schema from the registry to deserialize the data correctly, even if it was written with an older schema.

Read the Confluent Schema Registry documentation.

22. What is Kafka Connect?

Kafka Connect is a framework for reliably streaming data between Apache Kafka and other systems. It is a separate, scalable service that runs “connectors.”

  • Source Connectors import data from an external system (like a database, S3, or MQTT broker) into Kafka topics.
  • Sink Connectors export data from Kafka topics to an external system (like Elasticsearch, a data warehouse, or HDFS).

It simplifies data integration by providing a fault-tolerant, declarative way to move data without writing custom producer/consumer code.

23. How would you increase the number of partitions for a topic? What are the implications?

You can increase the number of partitions for an existing topic using the `kafka-topics.sh –alter` command. This is a common way to increase the parallelism and throughput of a topic.

Implications:

  • This operation is irreversible; you cannot decrease the number of partitions.
  • It can break the ordering guarantees for messages with the same key. A key that was previously mapped to partition 1 might now be mapped to a new partition (e.g., partition 5). This means you cannot assume that all messages for a given key will be in the same partition if you increase the partition count.

24. What are some key metrics to monitor in a Kafka cluster?

  • Broker Metrics: Under-replicated partitions, CPU/Memory usage, disk usage, network throughput.
  • Producer Metrics: Record send rate, request latency, batch size average.
  • Consumer Metrics: Consumer lag (the most critical one), records-consumed-rate, join/leave rate (indicates rebalancing).
  • Topic Metrics: Bytes in/out per second, total log size.

25. What is the difference between message compression and batching?

Both are producer-side optimizations to improve throughput.

  • Batching (`batch.size`, `linger.ms`): The producer groups multiple messages destined for the same partition into a single “batch” before sending them. This reduces the number of network requests and improves efficiency.
  • Compression (`compression.type`): The producer compresses an entire batch of messages before sending it to the broker. This reduces the size of the data sent over the network and stored on disk, at the cost of some CPU overhead for compression/decompression.

They work together: batching creates larger chunks of data, which makes compression much more effective.

26. How do you handle poison pill messages?

A “poison pill” is a message that a consumer cannot process, causing it to crash or get stuck in a consumption loop. A common strategy is to implement a Dead Letter Queue (DLQ).

In your consumer logic, you wrap the processing in a `try-catch` block. If an unrecoverable error occurs, instead of crashing, you catch the exception, send the problematic message to a separate DLQ topic for later analysis, and then manually commit the offset for the original message to move on.

27. What is a “zombie” instance in Kafka Streams?

A “zombie” instance is a Kafka Streams application instance that has been fenced off by the group coordinator but continues to run. This can happen if the instance experienced a long GC pause, exceeding the session timeout. The coordinator assumes it’s dead and triggers a rebalance. If the original instance then resumes, it might try to write to its state stores, which can lead to corruption. This is prevented by a “fencing” mechanism where each new generation of an instance gets a new ID, and brokers will reject any requests from an older, fenced-off generation.

28. What is the difference between a broker and a cluster?

A **broker** is a single Kafka server instance. It is responsible for receiving messages from producers, storing them, and serving them to consumers. A **cluster** is a group of one or more brokers working together. The cluster provides fault tolerance and scalability. Data is partitioned and replicated across the brokers in the cluster.

29. What are leader and follower epochs?

A leader epoch is a counter that is incremented every time a new leader is elected for a partition. This mechanism is used to protect against “zombie leaders” — a former leader that was temporarily disconnected from the cluster and is unaware that a new leader has been elected. When the old leader reconnects, replicas will reject its requests because its leader epoch is stale (lower than the current epoch), preventing data inconsistency.

30. How would you design a topic for high throughput vs. low latency?

  • For High Throughput: You would optimize for sending large batches of data.
    • Increase the producer’s `linger.ms` and `batch.size` to encourage larger batches.
    • Enable compression (`gzip` or `zstd`) to reduce network and disk I/O.
    • Increase the number of partitions to allow for greater parallelism.
  • For Low Latency: You would optimize for sending messages immediately.
    • Set the producer’s `linger.ms` to 0 to disable any artificial delay.
    • Disable compression, as it adds a small amount of CPU latency.
    • Use a lower `acks` setting if some risk of data loss is acceptable.

Skip the interview marathon.

We pre-vet senior engineers across Asia using these exact questions and more. Get matched in 24 hours, $0 upfront.

Get Pre-Vetted Talent
WhatsApp