Jump to Topic
| ☁️ Compute (Serverless & Containers) | Databases & Caching |
| ️ Networking & API Gateway | Messaging & Event-Driven Architecture |
| ️ Security, Identity & Access | ️ Architecture, Operations & Cost |
Compute (Serverless & Containers)
1. What is a Lambda cold start, and what are some strategies to mitigate it?
A **cold start** occurs when you invoke a Lambda function that has not been used recently. AWS needs to provision a new execution environment, which involves downloading your code, starting the runtime, and running your initialization code. This adds latency to the first request.
Strategies to mitigate it include:
- Provisioned Concurrency: Pre-warms a specified number of execution environments, keeping them ready to serve requests instantly. This is the most effective solution but has a cost.
- Lambda Layers: Reduces the size of the deployment package by separating dependencies, which can speed up the code download phase.
- Choosing an appropriate language/runtime: Interpreted languages like Python or Node.js generally have faster cold start times than compiled languages like Java or C#.
- Optimizing initialization code: Keep your init logic outside the main handler function lean and efficient.
2. Compare AWS Fargate, ECS on EC2, and EKS. When would you choose each?
- ECS on EC2: You manage a cluster of EC2 instances, and ECS places your Docker containers onto those instances. You have full control over the underlying instances but are responsible for patching, scaling, and managing them. Choose this for maximum control or if you have specific instance requirements (e.g., GPUs).
- AWS Fargate: A serverless compute engine for containers. You define your container requirements, and Fargate launches and manages the underlying infrastructure for you. Choose this for simplicity, to avoid managing servers, and for applications with spiky workloads. It works with both ECS and EKS.
- EKS (Elastic Kubernetes Service): A managed Kubernetes service. It provides the full power and flexibility of the Kubernetes ecosystem but comes with a steeper learning curve. Choose this if your organization is already standardized on Kubernetes or if you need the advanced orchestration and portability features that Kubernetes offers.
3. What is the difference between Provisioned Concurrency and Reserved Concurrency for Lambda?
- Provisioned Concurrency: You pay to keep a specified number of execution environments “warm” and ready to respond instantly. Its purpose is to eliminate cold start latency for predictable, high-traffic workloads.
- Reserved Concurrency: You set a maximum number of concurrent executions for a function. This does *not* keep functions warm. Its purpose is to act as a safety throttle, preventing a single function from consuming all available concurrency in your account and impacting other functions.
4. What are AWS Step Functions and what problem do they solve?
AWS Step Functions is a serverless workflow orchestration service. It allows you to coordinate multiple AWS services (like Lambda functions, ECS tasks, and SNS topics) into a visual workflow defined by a state machine.
It solves the problem of complex, multi-step business processes. Instead of writing complex retry logic, error handling, and state management code inside a single Lambda function, you can define the entire process as a series of steps. This makes the workflow more resilient, easier to debug, and simpler to maintain.
5. What are EC2 Spot Instances and what is a good use case for them?
Spot Instances are spare EC2 computing capacity that AWS offers at a significant discount (up to 90%) compared to On-Demand prices. The drawback is that AWS can reclaim these instances with only a two-minute warning if it needs the capacity back.
They are ideal for fault-tolerant, stateless, or batch processing workloads that can be interrupted and resumed. Good use cases include big data analysis, CI/CD build farms, and high-performance computing tasks.
Databases & Caching
6. Explain what Amazon Aurora is and how it differs from a standard RDS instance.
Amazon Aurora is a cloud-native relational database compatible with MySQL and PostgreSQL. Its key difference is the separation of compute and storage.
- Storage: Aurora uses a unique, log-structured storage volume that is distributed across multiple Availability Zones. Data is written to six copies across three AZs, providing extremely high durability and availability.
- Performance: It offers significantly higher throughput than standard RDS instances due to optimizations like offloading redo logging to the storage layer.
- Features: It offers features not available in standard RDS, like Global Databases, fast database cloning, and Aurora Serverless.
7. What is Aurora Serverless v2 and what problem does it solve?
Aurora Serverless v2 is an on-demand, auto-scaling configuration for Aurora. It automatically starts up, shuts down, and scales capacity up or down based on your application’s workload, with very granular, near-instantaneous scaling.
It solves the problem of over-provisioning or under-provisioning database capacity for applications with intermittent or unpredictable traffic. You pay only for the capacity you consume, making it a cost-effective choice for development environments, multi-tenant applications, and applications with spiky workloads.
Read about Aurora Serverless.8. What is the purpose of RDS Proxy?
RDS Proxy is a fully managed, highly available database proxy for Amazon RDS. Its main purposes are:
- Connection Pooling: It maintains a pool of established connections to your database, allowing applications (especially serverless functions like Lambda) to handle a large number of concurrent connections without exhausting the database’s connection limits.
- Improved Failover: It can reduce failover times by up to 66% for Multi-AZ databases by maintaining client connections while it connects to a new database instance.
- Enhanced Security: It can enforce IAM authentication and store credentials in AWS Secrets Manager.
9. Describe a common caching pattern like cache-aside.
The **cache-aside** (or lazy loading) pattern is a common caching strategy used with a cache like ElastiCache.
- The application first tries to get the data from the cache.
- If the data is in the cache (a “cache hit”), it’s returned to the application.
- If the data is not in the cache (a “cache miss”), the application reads the data from the primary database.
- The application then writes this data into the cache before returning it.
This pattern loads data into the cache on-demand, ensuring that only data that is actually requested gets cached.
10. What is DynamoDB Accelerator (DAX)?
DAX is a fully managed, in-memory cache for DynamoDB. It sits in front of your DynamoDB table and provides microsecond read latency. It’s API-compatible with DynamoDB, so using it requires minimal code changes. It’s ideal for read-heavy applications that require extremely low latency, but its reads are eventually consistent, so it’s not suitable for applications that require strongly consistent reads for every operation.
Networking & API Gateway
11. What is the difference between a Security Group and a Network ACL (NACL)?
- Security Groups (SGs): Act as a virtual firewall for your EC2 instances (at the instance level). They are **stateful**—if you allow an inbound request, the corresponding outbound response is automatically allowed, regardless of outbound rules. All rules are “allow” rules; you cannot create “deny” rules.
- Network ACLs (NACLs): Act as a firewall for a subnet (at the subnet level). They are **stateless**—you must explicitly define rules for both inbound and outbound traffic. They support both “allow” and “deny” rules, which are evaluated in order.
12. What is a VPC Endpoint and why would you use one?
A VPC Endpoint enables you to privately connect your VPC to supported AWS services (like S3, DynamoDB, or SQS) without requiring an Internet Gateway, NAT Gateway, or VPN connection. Traffic between your VPC and the AWS service does not leave the Amazon network.
You use them for:
- Improved Security: It reduces the exposure of your resources to the public internet.
- Better Performance: Provides a more reliable and lower-latency connection to AWS services.
- Cost Savings: Can reduce data transfer costs compared to going over a NAT Gateway.
13. Compare API Gateway REST APIs vs. HTTP APIs.
- REST APIs: The original, feature-rich offering. They provide advanced features like API keys, per-client throttling, request validation, caching, and custom authorizers. They have higher latency and are more expensive.
- HTTP APIs: A newer, more streamlined offering. They are designed for lower latency and are significantly cheaper (up to 70% less) than REST APIs. They support core features like OIDC/JWT authorizers and CORS but lack the more advanced features of REST APIs.
Choose HTTP APIs for general-purpose, high-performance APIs. Choose REST APIs when you need advanced features like request transformation or built-in caching.
14. What is a Lambda authorizer in API Gateway?
A Lambda authorizer (formerly known as a custom authorizer) is an API Gateway feature that uses a Lambda function to control access to your API methods. When a client makes a request, API Gateway invokes your authorizer Lambda function. The function receives the request context (like headers or tokens), implements your custom authentication and authorization logic, and returns an IAM policy. API Gateway then uses this policy to determine if the request should be allowed or denied.
15. How can you use CloudFront to improve the performance of a dynamic API?
While CloudFront is known for caching static content, it also improves dynamic API performance by:
- Terminating SSL/TLS at the Edge: The initial SSL handshake happens at a CloudFront edge location close to the user, reducing latency.
- Using Persistent Connections: CloudFront maintains optimized, persistent connections back to your origin server (e.g., API Gateway), reducing connection setup overhead for subsequent requests.
- Caching API Responses: For idempotent `GET` requests, you can configure CloudFront to cache responses based on headers, query strings, or cookies, serving subsequent identical requests directly from the edge.
Messaging & Event-Driven Architecture
16. Compare SQS, SNS, and EventBridge.
- SQS (Simple Queue Service): A fully managed message queue service. It’s used for decoupling components. A message is sent to a queue and processed by a single consumer. The consumer polls the queue for messages.
- SNS (Simple Notification Service): A fully managed pub/sub messaging service. A message is published to a topic, and it is pushed to *all* subscribers of that topic (e.g., SQS queues, Lambda functions, email addresses). This is a fan-out pattern.
- EventBridge: A serverless event bus that builds on SNS capabilities. It allows you to route events from various sources (AWS services, SaaS partners, your own applications) to targets. Its key feature is advanced filtering and routing rules, allowing you to send specific events to specific targets based on the event’s content, without needing extra code.
17. What is the difference between an SQS standard queue and a FIFO queue?
- Standard Queue (default): Offers maximum throughput and “at-least-once” delivery. It makes a “best effort” to preserve ordering, but ordering is not guaranteed. Duplicate messages can sometimes be delivered.
- FIFO (First-In, First-Out) Queue: Designed to guarantee that messages are processed exactly once, in the exact order that they are sent. It has lower throughput limits than standard queues and requires a message group ID for ordering within logical groups.
18. What is the purpose of an SQS Dead-Letter Queue (DLQ)?
A Dead-Letter Queue is a separate queue used to store messages that a consumer has failed to process successfully. You configure a source queue with a “redrive policy” that specifies a maximum number of times a message can be received. If a message exceeds this receive count, SQS automatically moves it to the configured DLQ. This prevents “poison pill” messages from blocking the queue and allows developers to inspect and analyze failed messages separately.
19. Explain the SQS visibility timeout.
When a consumer receives a message from an SQS queue, that message remains in the queue but is made temporarily invisible to other consumers for a period called the **visibility timeout**. This prevents other consumers from processing the same message. If the consumer successfully processes and deletes the message within this timeout, everything is fine. If the consumer crashes or fails to delete the message before the timeout expires, the message becomes visible again and another consumer can pick it up. This is a key part of SQS’s at-least-once delivery guarantee.
20. What is the fan-out pattern and how would you implement it on AWS?
The fan-out pattern is where a single message or event is pushed to multiple consumers to be processed in parallel. The classic way to implement this on AWS is using **SNS and SQS**. You publish a single message to an SNS topic. You then create multiple SQS queues, each for a different downstream service, and subscribe each of these queues to the SNS topic. When a message is published to the topic, SNS delivers a copy of it to every subscribed queue, allowing each service to process the event independently.
Security, Identity & Access Management
21. Explain the difference between an IAM User, Group, Role, and Policy.
- Policy: A JSON document that defines a set of permissions (e.g., allow `s3:GetObject`).
- User: An entity (a person or application) with long-term credentials (password or access keys).
- Group: A collection of IAM users. You can attach policies to a group to grant permissions to all users in that group.
- Role: An identity with temporary credentials that can be assumed by trusted entities (like an EC2 instance, a Lambda function, or another user). Roles are the most secure way to grant permissions to AWS services, as they avoid the need to store long-term access keys in your code.
22. What is the `sts:AssumeRole` action and why is it important?
The `sts:AssumeRole` action is the core of how IAM Roles work. It allows an entity (the “principal”) to request temporary security credentials from AWS to assume a specific role. When the action is called, the Security Token Service (STS) returns a temporary access key, secret key, and session token. The principal can then use these temporary credentials to make API calls with the permissions defined in the assumed role’s policy.
This is crucial for cross-account access and for securely granting permissions to AWS services without hard-coding credentials.
Read the documentation on Assuming an IAM Role.23. Compare AWS Secrets Manager and Systems Manager Parameter Store.
- Parameter Store: A service for managing configuration data and secrets. It offers a standard tier (free) and an advanced tier. It’s great for storing general configuration data like API endpoints or feature flags.
- Secrets Manager: A service specifically designed for managing secrets. It is more expensive but provides advanced features like automatic secret rotation (e.g., automatically changing database passwords), integration with RDS, and the ability to generate random secrets.
Use Secrets Manager for credentials that require rotation and lifecycle management. Use Parameter Store for other configuration data and simpler secrets.
24. What is envelope encryption and how does KMS use it?
Envelope encryption is a practice where you encrypt your data with a data key, and then encrypt that data key itself with another key (a key-encryption-key). AWS KMS (Key Management Service) uses this pattern. When you ask KMS to encrypt data, it:
- Generates a unique data key.
- Uses this data key to encrypt your plaintext data locally.
- Encrypts the data key using a Customer Master Key (CMK) that you control in KMS.
- Returns both the encrypted data and the encrypted data key to you.
You store both together. This is secure and efficient because the master key never leaves KMS, and you can quickly re-encrypt data by only re-encrypting the small data key.
25. What is the principle of least privilege?
The principle of least privilege is a fundamental security concept which dictates that an entity (a user, service, or application) should only have the exact permissions required to perform its authorized tasks, and no more. In AWS, this means crafting fine-grained IAM policies that grant access only to the specific actions and resources needed, rather than using overly permissive policies like `*:*`.
Architecture, Operations & Cost
26. What are the pillars of the AWS Well-Architected Framework?
The AWS Well-Architected Framework is a guide for building secure, high-performing, resilient, and efficient infrastructure for applications. It is based on six pillars:
- Operational Excellence: Running and monitoring systems to deliver business value and continually improving processes.
- Security: Protecting information, systems, and assets while delivering business value through risk assessments and mitigation strategies.
- Reliability: Ensuring a workload performs its intended function correctly and consistently, including the ability to recover from failures.
- Performance Efficiency: Using computing resources efficiently to meet system requirements and maintaining that efficiency as demand changes.
- Cost Optimization: Avoiding unnecessary costs by understanding and controlling where money is being spent.
- Sustainability: Minimizing the environmental impacts of running cloud workloads.
27. What is Infrastructure as Code (IaC) and what are some AWS tools for it?
Infrastructure as Code is the practice of managing and provisioning infrastructure through machine-readable definition files (code), rather than through physical hardware configuration or interactive configuration tools.
AWS tools for IaC include:
- AWS CloudFormation: A declarative service where you define your infrastructure in YAML or JSON templates. CloudFormation interprets the template and provisions the resources in the correct order.
- AWS CDK (Cloud Development Kit): An imperative framework where you define your cloud infrastructure using familiar programming languages like TypeScript, Python, or Java. The CDK code synthesizes into a CloudFormation template. It offers a higher level of abstraction and better tooling than writing raw CloudFormation.
28. How would you design a system for cost optimization on AWS?
Cost optimization is an ongoing process involving several strategies:
- Right Sizing: Choose the appropriate instance types and sizes for your workload. Monitor utilization and downsize over-provisioned resources.
- Elasticity: Use auto-scaling to match capacity closely to demand, shutting down resources when they are not needed.
- Choosing the right pricing model: Use Savings Plans or Reserved Instances for predictable workloads to get significant discounts over On-Demand pricing. Use Spot Instances for fault-tolerant workloads.
- Leveraging serverless: Use services like Lambda, Fargate, and Aurora Serverless for intermittent workloads to pay only for what you use.
- Data Lifecycle Management: Use S3 lifecycle policies to automatically move infrequently accessed data to cheaper storage tiers like S3 Glacier.
29. What is the difference between latency-based routing and geolocation routing in Route 53?
- Latency-based Routing: Routes traffic to the AWS region that provides the best latency for the end user, regardless of their geographic location. Route 53 determines this by checking latency from the user’s DNS resolver to various AWS regions.
- Geolocation Routing: Routes traffic based on the geographic location of the user (e.g., by country or continent). This is used when you need to serve localized content or have data residency requirements.
30. What is AWS X-Ray used for?
AWS X-Ray is a distributed tracing service. It helps developers analyze and debug production, distributed applications, such as those built using a microservices architecture. It collects data about requests as they travel through your entire application stack. It provides a “service map” that visualizes the connections between your services and can pinpoint performance bottlenecks, errors, and the root cause of issues in your system.
31. What is the purpose of a NAT Gateway?
A NAT (Network Address Translation) Gateway is a managed AWS service that allows instances in a private subnet to connect to the internet or other AWS services, but prevents the internet from initiating a connection with those instances. Instances in a private subnet can route their outbound traffic through the NAT Gateway, which has an Elastic IP address and resides in a public subnet. This is the standard way to allow private instances (like a backend application server) to download updates or access external APIs without exposing them directly to the internet.
32. How can you share data between steps in an AWS Step Functions execution?
Data is passed between states as a JSON object. Each state receives a JSON object as input and can pass a JSON object as output. You can use the `ResultPath`, `InputPath`, and `OutputPath` fields within a state’s definition to control how this JSON data is manipulated and passed along. For example, `ResultPath: ‘$.MyLambdaResult’` will merge the output of a Lambda function into the state at the specified path, allowing subsequent states to access it.
33. What is Amazon Kinesis Data Streams and how does it differ from SQS?
Kinesis Data Streams is a massively scalable and durable real-time data streaming service. It’s designed for high-throughput, real-time ingestion of data from thousands of sources.
Key differences from SQS:
- Consumption Model: In SQS, a message is deleted after being processed by one consumer. In Kinesis, multiple consumers can read the same data from the stream independently. Data persists in the stream for a configurable retention period (e.g., 24 hours).
- Ordering: Kinesis guarantees ordering of records within a single shard.
- Use Case: SQS is for decoupling applications (task processing). Kinesis is for real-time data processing, analytics, and feeding data into data lakes or warehouses.
34. What are Lambda Layers?
Lambda Layers are a way to centrally manage and share common code and dependencies across multiple Lambda functions. A layer is a ZIP archive containing libraries, a custom runtime, or other dependencies. When you include a layer in your function, its contents are extracted to the `/opt` directory in the execution environment. This helps keep your deployment packages small and simplifies dependency management, as you can update a dependency in one layer and have it apply to all functions using that layer.
35. How would you handle a secret (like a database password) in your application’s CI/CD pipeline?
You should never store secrets directly in your source code or CI/CD configuration files. The best practice is to store the secret in AWS Secrets Manager or Systems Manager Parameter Store (SecureString type).
The CI/CD pipeline’s role (e.g., a CodePipeline role) would have a limited IAM policy granting it permission to read *only* the specific secrets it needs during deployment. The deployment script would then fetch the secret value from the service and inject it as an environment variable into the target environment (e.g., an ECS task definition or a Lambda function’s environment configuration). This ensures the secret is never exposed in logs or source control.


