Jump to Category
| Core Observability Principles | Logging |
| Metrics | ιχ Tracing |
| Alerting & SLOs |
Core Observability Principles
1. What is the difference between Monitoring and Observability?
Monitoring** is the practice of collecting and analyzing data about a system to watch for pre-defined failure modes. It’s about asking known questions, like “Is the CPU usage over 90%?”. You set up dashboards and alerts for things you already know can go wrong.
Observability** is a property of a system that allows you to understand its internal state from its external outputs. It’s about being able to ask *new* questions about your system to debug unknown problems (“unknown unknowns”). A system is observable if you can effectively troubleshoot novel issues using its logs, metrics, and traces without having to ship new code to get more information.
2. What are the three pillars of observability? How do they complement each other?
The three pillars are **Logs**, **Metrics**, and **Traces**.
- Metrics: Aggregated, numerical data over time (e.g., requests per second, error rate). They are great for dashboards and alerting, telling you *that* a problem is occurring.
- Traces: Represent the end-to-end journey of a single request as it flows through a distributed system. They are essential for pinpointing latency bottlenecks and understanding which service in a chain is failing, telling you *where* a problem is.
- Logs: Detailed, timestamped records of discrete events. Once a trace has shown you where a problem is, logs provide the high-granularity context to understand *why* it happened.
They work together: Metrics alert you to a problem, traces show you the location, and logs give you the detailed cause.
Read the Observability Primer from OpenTelemetry.3. What is high-cardinality data and what challenge does it pose for observability systems?
Cardinality** refers to the number of unique values for a given dimension or label. **High-cardinality** data is data with a very large number of unique values, such as user IDs, request IDs, or container IDs.
The challenge is that traditional metrics systems (time-series databases like Prometheus) are designed for low-cardinality data. Storing a unique time series for every single user ID would cause a “cardinality explosion,” overwhelming the database and making it extremely expensive and slow. Modern observability tools are being designed to handle high-cardinality data better, but it remains a primary challenge in balancing cost and granularity.
4. What is the RED method for monitoring microservices?
The RED method is a framework for identifying the most important “golden signal” metrics for any service. It stands for:
- Rate: The number of requests the service is handling per second.
- Errors: The number of failing requests per second.
- Duration: The amount of time it takes to process a request, typically measured in distributions (e.g., 95th/99th percentiles).
By monitoring these three metrics for every service, you can get a consistent, high-level view of its health and performance.
Learn about the RED Method.Logging
5. Why is structured logging essential for a modern backend system?
Structured logging** is the practice of writing logs in a consistent, machine-readable format like JSON, instead of plain text strings. It’s essential because:
- Search and Filtering: It allows you to reliably search and filter logs based on specific fields (e.g., `userId`, `traceId`, `orderId`). This is impossible with unstructured text.
- Analysis: You can perform powerful analytics on your log data, such as calculating metrics or creating dashboards based on log content.
- Automation: Machines can easily parse and act upon structured logs, enabling automated alerting and analysis.
In a distributed system with thousands of log streams, it’s the only way to make sense of the data.
Read an introduction to Structured Logging.6. What is a correlation ID and how is it used?
A **correlation ID** (often the same as a `traceId`) is a unique identifier that is attached to a request at the very beginning of its lifecycle (e.g., at the API Gateway). This ID is then passed along in the headers of every subsequent downstream request made as part of that operation.
Every log line produced by every service for that request should include this correlation ID. This allows you to filter your centralized logging system for that one ID and instantly see the entire, ordered sequence of logs for a single request as it traversed through your entire microservices architecture.
7. What information should you avoid putting in logs?
You should be extremely careful to avoid logging sensitive information, as logs are often less secure than primary databases. This includes:
- Personally Identifiable Information (PII): Social security numbers, credit card numbers, home addresses, etc.
- Credentials: Passwords, API keys, session tokens, and raw JWTs.
- Security-related data: Raw encryption keys or financial data.
This data should be redacted or masked before it is ever written to a log file.
8. What are log-based metrics?
Log-based metrics are metrics that are generated by aggregating and counting occurrences of specific patterns or fields in your structured logs. For example, a logging system can scan the log stream and create a metric for `http_requests_total` with a label `status_code=500` by counting every log line that has those attributes.
This can be a powerful way to generate detailed business or application metrics without needing to add custom metrics instrumentation directly in your code.
Metrics
9. Compare the push vs. pull model for metrics collection.
- Push Model (e.g., StatsD, Graphite): The application is responsible for actively “pushing” its metrics to a central monitoring service at regular intervals. This is simpler for ephemeral jobs or serverless functions that may not be around long enough to be scraped.
- Pull Model (e.g., Prometheus): The central monitoring server is responsible for “pulling” (or “scraping”) metrics from an HTTP endpoint exposed by each application instance. This is generally more robust, as the monitoring system controls the collection interval, can automatically discover new targets, and can easily tell if a target is down because the scrape will fail.
10. What are the four main metric types used by systems like Prometheus?
- Counter: A cumulative metric that only ever goes up (or resets to zero on a restart). Used for things like the total number of requests or errors. You use the `rate()` function to see its per-second value.
- Gauge: A value that can go up and down. Used for things like current memory usage, temperature, or the number of items in a queue.
- Histogram: Samples observations (like request durations) and counts them in configurable buckets. It also provides a `_sum` and `_count` of all observations. This allows for calculating accurate percentiles on the server side (e.g., `histogram_quantile(0.95, …)`).
- Summary: Similar to a histogram, it samples observations but calculates configurable quantiles on the client side and exposes them directly. It is less flexible for aggregation than a histogram.
11. Why are histograms often preferred over summaries for measuring latency?
Histograms are generally preferred because they allow for more flexible aggregation. A histogram stores the raw bucket counts. This means you can aggregate histograms from multiple application instances and then calculate a correct global percentile across the entire service.
A summary pre-calculates the percentiles on the client side. You cannot average percentiles from different instances and get a meaningful result. Therefore, histograms are better suited for modern, distributed systems.
Distributed Tracing
12. What are the core components of a distributed trace? (Trace, Span)
- A **Trace** represents the entire end-to-end journey of a single request through a distributed system. It is identified by a globally unique `TraceId`.
- A **Span** represents a single, named, timed unit of work within a trace. Each service call or significant operation in the request’s lifecycle creates a span. Spans have a start and end time, and can have parent-child relationships, forming a tree structure that shows the flow of execution.
13. What is OpenTelemetry and what problem does it solve?
OpenTelemetry (OTel)** is an open-source observability framework from the CNCF. It provides a single, vendor-neutral standard for instrumenting, generating, collecting, and exporting telemetry data (logs, metrics, and traces).
It solves the problem of **vendor lock-in**. Before OTel, if you wanted to switch your observability backend (e.g., from Jaeger to Datadog), you would have to completely re-instrument your entire application with the new vendor’s proprietary agents and SDKs. With OTel, you instrument your code once with the standard OTel APIs, and you can then configure it to export your data to any OTel-compatible backend without changing your application code.
Visit the OpenTelemetry official website.14. What is trace context propagation?
Trace context propagation is the mechanism for passing the trace identifier (`traceId`) and the parent span identifier (`spanId`) from one service to another as a request flows through the system. This is typically done by injecting a set of standard HTTP headers (like the W3C `traceparent` header) into the outgoing request. The receiving service then extracts this context from the incoming headers and uses it to create a new child span, linking it to the overall trace. This is what allows a tracing backend to stitch all the individual spans together into a single, coherent trace.
15. Compare head-based vs. tail-based sampling for traces.
- Head-based Sampling: The decision to keep or drop a trace is made at the very beginning, when the first span is created. For example, you might decide to sample 10% of all incoming requests. This is simple and efficient but means you might drop traces that later turn out to be interesting (e.g., a trace that contains an error).
- Tail-based Sampling: The decision to keep or drop a trace is made *after* all the spans in the trace have been completed. The system collects all spans for a trace and then decides whether to persist it based on its overall characteristics (e.g., did it have an error? Was it unusually slow?). This is much more powerful for capturing interesting data but requires a more complex and stateful collection infrastructure.
Alerting & Service Level Objectives (SLOs)
16. What is the difference between symptom-based alerting and cause-based alerting? Which is preferred?
- Cause-based Alerting: Alerts on a low-level cause, like “CPU utilization is over 90%”.
- Symptom-based Alerting: Alerts on a high-level symptom that directly impacts users, like “the error rate for the login service has exceeded 1%”.
Symptom-based alerting is strongly preferred**. An alert should only fire if users are being measurably affected. High CPU usage is not necessarily a problem if users are still getting fast, successful responses. Alerting on causes leads to “alert fatigue” where operators are woken up for non-issues. Symptom-based alerts are tied to SLOs and focus on what truly matters: the user experience.
17. Explain SLOs, SLIs, and SLAs.
- SLI (Service Level Indicator): A quantitative measure of some aspect of your service. Example: “the percentage of successful HTTP requests”.
- SLO (Service Level Objective): A target value or range of values for an SLI. Example: “99.9% of HTTP requests will be successful over a 28-day period”. This is an internal goal for the team.
- SLA (Service Level Agreement): A formal contract between a service provider and a customer that defines the expected level of service and includes consequences (e.g., financial penalties) for failing to meet those expectations. An SLA is typically a more lenient version of an SLO.
18. What is an error budget?
An error budget is derived from your SLO. If your availability SLO is 99.9%, your error budget is the remaining 0.1% of the time that your service is allowed to be unavailable without breaching the objective. For a 30-day month, this is about 43 minutes.
The error budget is a powerful tool for making data-driven decisions. If the team has plenty of error budget left, they can feel confident in launching new features. If the error budget is nearly exhausted, the team’s priority must shift to reliability and stability work, and all new feature releases should be frozen.
19. What is alert fatigue and how do you combat it?
Alert fatigue is a state of desensitization that occurs when operators are exposed to a large number of frequent, non-actionable, or irrelevant alerts. It leads to alerts being ignored, which can cause real incidents to be missed.
To combat it:
- Make Alerts Actionable: Every alert should have a clear, documented runbook explaining what to do. If there’s nothing to do, it’s not an alert, it’s a log.
- Use Symptom-based Alerting: Only alert on user-facing problems (based on your SLOs).
- Tune and Refine: Ruthlessly tune or delete noisy alerts that are not providing value.
- Use Appropriate Severity: Differentiate between critical pages that require immediate action and lower-priority warnings that can be handled during business hours.
20. What is eBPF and what is its role in modern observability?
eBPF (extended Berkeley Packet Filter)** is a revolutionary technology in the Linux kernel that allows you to run sandboxed programs directly in the kernel without changing kernel source code or loading kernel modules. For observability, this is incredibly powerful because it allows tools to collect detailed telemetry data (like network traffic, system calls, and application performance metrics) directly from the kernel with extremely low overhead.
It enables a new generation of observability tools that can provide deep insights into application and system performance without requiring code instrumentation.
Read the introduction to eBPF.21. What is continuous profiling?
Continuous profiling is the practice of running a low-overhead performance profiler against your application in production, all the time. It continuously collects CPU and memory usage data across your entire fleet of servers. This allows you to analyze performance data for any past event, compare performance between releases, and identify the root cause of performance regressions down to the specific line of code, even for issues that are difficult to reproduce in a test environment.


