Most AI engineering hires fail because the interview tests the wrong skills. Candidates who memorize “what is gradient descent” can struggle to ship a working RAG pipeline. Candidates who built half a dozen LLM apps may freeze when asked about variance and bias. The fix is to test the right things, in the right order. The Stack Overflow Developer Survey 2025 shows 70% of teams now use AI tools daily but only 24% have a structured interview process for AI skills.
This guide gives you 20 questions across 5 skill areas: foundations and ML math, model building and training, LLMs and RAG, MLOps and production, and AI engineering judgement. Each question has a “what good looks like” answer, common red flags, and a follow-up to push deeper. Use them as a starter rubric for senior AI hires (5+ years) or scale them down for mid-level roles.
The questions reflect what we actually ask candidates at Second Talent when we shortlist for AI roles across 9 Asian markets. We have run this loop on 800+ candidates over the past 24 months, and the patterns below predict on-the-job performance better than any single coding test.
How to Use These Questions
Pick 6-8 questions across at least 3 categories. A 60-minute interview cannot cover all 20. Mix one foundations question (to check rigor), 2-3 model-building or LLM questions (to check practice), and 1-2 MLOps or judgement questions (to check production readiness).
Score each answer on a 1-5 scale. 5 means the candidate gave the “good” answer plus context. 3 means they hit the main points but missed nuance. 1 means red flags. Take notes on the exact phrases they use. Pattern-matching on phrasing matters more than people think. A candidate who says “we tuned alpha and beta” without knowing why is different from one who says “we increased dropout because the validation loss diverged after epoch 12.”
Below is the suggested weighting for a senior AI engineering loop.
| Skill Area | Weight | Best Question Type | Time |
|---|---|---|---|
| Foundations & ML Math | 15% | Conceptual + edge case | 10 min |
| Model Building & Training | 25% | Practical + debugging | 15 min |
| LLMs, Prompting & RAG | 30% | System + tradeoff | 15 min |
| MLOps & Production | 20% | Open-ended scenario | 10 min |
| AI Engineering Judgement | 10% | “What would you do” probe | 10 min |
1. Foundations and ML Math (Questions 1-4)
This bucket is the floor. A senior AI engineer who cannot explain bias-variance tradeoff in plain language probably cannot debug a model that overfits in production. The questions are short. The follow-ups matter more than the headline answer.
Q1. Explain the bias-variance tradeoff using a real model you shipped.
What good looks like: The candidate names a specific model and a specific symptom. “Our churn predictor had high training accuracy (94%) but only 78% on holdout. We had high variance. We added L2 regularization and dropped 4 noisy features. Validation score jumped to 87%.” They can also describe the opposite case (high bias, underfitting) with a separate example.
Red flags: Pure textbook answer with no project. Confusing bias (statistical) with bias (fairness). Cannot say what they would do to address either case.
Follow-up: “What if you cannot add more data and the model still overfits?”
Q2. When would you NOT use cross-validation?
What good looks like: Time-series data (use rolling-window or expanding-window). Very large datasets where compute cost dominates and a single train/val/test split is fine. Production A/B tests where the test set is your live traffic. They can name at least 2 of these.
Red flags: “You always use cross-validation.” Not knowing time-series leakage exists.
Follow-up: “Walk me through why naive k-fold leaks information on time series.”
Q3. Your training accuracy is 99% and your test accuracy is 60%. What are the top 3 things you check?
What good looks like: Three specific checks in order of likelihood. (1) Train/test leakage (same rows in both, or label leakage from a target-encoded feature). (2) Distribution shift between train and test (covariate shift, label shift). (3) Overfitting from model capacity vs data volume. Bonus if they mention checking for duplicate rows or near-duplicates first.
Red flags: Jumping to “use a smaller model” without checking the data. Not mentioning leakage.
Follow-up: “How do you detect a label leak that is not obvious?”
Q4. Explain regularization to a smart business stakeholder in 60 seconds.
What good looks like: A short analogy plus the math. “We add a penalty when the model uses a feature too aggressively. It is like telling a salesperson they can use any pitch but if they overuse one it costs them. The model picks simpler patterns that generalize better.” Bonus if they distinguish L1 (drops features) from L2 (shrinks weights).
Red flags: Math-only answer. Cannot explain to a non-technical audience.
Follow-up: “When would dropout be wrong, even for a deep network?”
2. Model Building and Training (Questions 5-8)
This bucket is where most candidates either prove or disprove they have shipped real models. Hands-on people answer fast. Theorists struggle.
Q5. Walk me through the last model you trained, end to end.
What good looks like: The candidate names the business problem first (not the model). They describe the data (volume, source, labels), the baseline they tried, the model they ended on, and why. They can quote specific metrics and the absolute number that mattered to the business. “We got recall up from 0.62 to 0.79 which translated to $4M of recovered fraud per quarter.”
Red flags: Cannot name the business metric. Names the model first (“we used XGBoost”) without the problem context.
Follow-up: “What did you wish you had done differently?”
Q6. How do you decide between a transformer, a tree-based model, and a logistic regression for a tabular classification problem?
What good looks like: Default to gradient-boosted trees (XGBoost, LightGBM, CatBoost) for most tabular problems. Use logistic regression when interpretability or scoring latency matters. Use transformers only when you have very high-dim sparse features, sequential structure, or 100M+ rows. They can quote a 2025 benchmark like the GBDT vs deep learning paper showing trees still win on tabular.
Red flags: “Always use deep learning.” Not knowing GBDTs are the tabular default.
Follow-up: “What if you only have 5,000 rows?”
Q7. You trained a model. The validation loss bounces around with no clear trend. What is happening?
What good looks like: Three to five specific causes. Learning rate too high. Batch size too small for the noise level. Validation set is too small (high variance in the metric). Wrong loss function for the problem (e.g. MSE on a heavy-tailed regression target). Bad initialization. The candidate gives one fix per cause.
Red flags: “Just train longer.” Not mentioning learning rate or batch size.
Follow-up: “Show me how you would set the initial learning rate.”
Q8. How do you handle a 95% / 5% class imbalance?
What good looks like: Multiple options ranked. Use the right metric (precision-recall AUC, not accuracy). Try class weights first (cheapest). Then resampling (SMOTE, undersampling, focal loss). Then threshold tuning at inference time. The candidate distinguishes “imbalanced training data” from “imbalanced production traffic”. They may not need the same solution.
Red flags: Defaults to oversampling without checking metrics. Reports accuracy on imbalanced data.
Follow-up: “What if the class imbalance shifts week-to-week in production?”
3. LLMs, Prompting and RAG (Questions 9-12)
This is the highest-weighted category in 2026 because almost every AI role now touches LLMs. Test for practical experience, not vibes. The Hugging Face State of Open Source AI 2025 shows 60% of new AI projects in production are LLM-based.
Q9. Walk me through a RAG pipeline you built. Where did it break and how did you fix it?
What good looks like: They name the chunking strategy and why. They describe the embedding model and why they chose it (cost, latency, recall on their domain). They name the vector store. They describe a real failure: bad chunk boundaries cutting off code blocks, retrieval missing rare terms, the LLM ignoring retrieved context. They can quote latency (P50 and P95) and an actual quality metric (e.g. RAGAS scores or human eval).
Red flags: Generic answer that could come from a tutorial. Cannot name a chunking decision. No latency numbers.
Follow-up: “What would you change if your corpus was 10x larger?”
Q10. When would you fine-tune instead of using few-shot prompting?
What good looks like: Default to prompting. Fine-tune when (1) the task has a strict output format that few-shot keeps breaking, (2) latency or cost matters and you can run a smaller fine-tuned model, (3) the domain is so specific that no foundation model has seen it. They mention LoRA or QLoRA as the cheap path. They know fine-tuning rarely teaches new facts (that is what RAG is for).
Red flags: “Always fine-tune for best results.” Not knowing fine-tuning does not add knowledge.
Follow-up: “How do you decide between LoRA and full fine-tuning?”
Q11. Your LLM hallucinates 8% of the time on your benchmark. How do you cut it to under 2%?
What good looks like: Layered approach. Better prompting (chain of thought, structured output, “say I don’t know if unsure”). Retrieval augmentation if facts are the issue. Constrained decoding (function calling, JSON mode, regex). Output validation with a second LLM call. They mention the tradeoff. Every layer adds latency and cost.
Red flags: “Just use GPT-5.” Treating hallucination as a model bug rather than a system property.
Follow-up: “How do you measure hallucination rate without humans-in-the-loop?”
Q12. Design a simple agent that books meetings. What are the failure modes?
What good looks like: They name the loop (perceive, plan, act, verify). They mention tool use for calendar and email APIs. They describe at least 3 failure modes. Wrong attendee disambiguation, infinite loops on tool errors, prompt injection from email content, hallucinated meeting times, double-booking. They mention guardrails: max iterations, confirmation steps for destructive actions, output validation.
Red flags: “Just give it tools and a goal.” No mention of guardrails or failure modes.
Follow-up: “How would you test this in CI without spamming a real calendar?”
4. MLOps and Production AI (Questions 13-16)
This bucket separates engineers who have shipped from those who have only experimented. Google’s MLOps maturity model is a useful reference. Most candidates self-rate at level 1-2. Senior hires you want at level 2+ minimum.
Q13. How do you monitor a model in production?
What good looks like: Three layers of monitoring. (1) Service-level: latency, throughput, error rate. (2) Data-level: input distribution drift (PSI, KL divergence), missing features, schema changes. (3) Model-level: prediction distribution drift, label shift after labels arrive, calibration. They mention specific tools (Evidently, Arize, WhyLabs, custom Prometheus + Grafana). They describe what triggers a retrain.
Red flags: Only mentions latency and uptime. Treats monitoring as something the SRE team owns.
Follow-up: “How do you monitor an LLM where ground truth comes weeks later or never?”
Q14. Describe your model deployment pipeline.
What good looks like: A specific story. They train in a notebook or script. They evaluate against a held-out set with a fixed metric. They package as a Docker image or registered MLflow model. They deploy via Kubernetes, Sagemaker, or Vertex AI. They run shadow traffic before promoting. They use canary or blue-green deployment. They have a rollback path.
Red flags: “We deploy by uploading to S3 and the API picks it up.” No mention of testing, validation, or rollback.
Follow-up: “What is your worst deployment story?”
Q15. How do you handle versioning of data, code, and models together?
What good looks like: They distinguish three things. Code lives in Git. Data lives in DVC, LakeFS, or a versioned data warehouse. Models live in a registry (MLflow, Weights & Biases, Sagemaker Model Registry). They tie all three together with a run ID or experiment hash. They can reproduce any past production model from the registry.
Red flags: “We just use Git tags.” Cannot reproduce a model from 6 months ago.
Follow-up: “Walk me through reproducing a model trained 9 months ago by someone who left the team.”
Q16. Your model serves 10,000 requests per second at P99 latency 200ms. The CPO wants P99 under 80ms. What do you do?
What good looks like: Profile first. Is it the model, the IO, or the pre/post processing? Specific levers in order of impact: model distillation or quantization (FP32 -> FP16 -> INT8). Batch dynamically. Move embeddings to GPU. Cache hot inputs. Compile with TensorRT or ONNX. Move to a smaller architecture if quality permits. They mention measuring quality after every step.
Red flags: “Just add more servers.” Not mentioning quantization or batching.
Follow-up: “Where does quantization break, and how do you catch it before users do?”
5. AI Engineering Judgement (Questions 17-20)
The last bucket is the cheapest to test and the most predictive. These are open-ended scenarios that reveal how a candidate thinks about real tradeoffs.
Q17. The CEO asks why you cannot just use ChatGPT for the customer support bot. What do you say?
What good looks like: They acknowledge ChatGPT might be the right answer for a starting point, then list the actual reasons it might not. Domain knowledge gaps that need RAG. Compliance and data residency. Cost at scale. Latency requirements. Hallucinations on policy questions. Need for fine-grained access control. They give a 30-day plan to test ChatGPT first and decide.
Red flags: Defensive answer (“we are AI engineers, we should build it”). Pure tech-snobbery without the cost case.
Follow-up: “If ChatGPT does work, what is your role on the team?”
Q18. Walk me through a model that worked in offline eval and failed in production.
What good looks like: Specific story. The candidate names the offline metric, the production gap, and the root cause. Common true root causes: train-serving skew, distribution shift, feedback loops in production data, label leakage masked at training, post-deployment changes to upstream features. They share what they did differently next time.
Red flags: Cannot remember an example. Blames the data team.
Follow-up: “How would you have caught this before deploying?”
Q19. Your team is 3 people. You can build a churn predictor or a recommendation engine. The CEO says both are critical. How do you decide?
What good looks like: They get the business numbers first. Current churn rate, revenue impact. Recommendation lift potential, current click-through. They estimate effort for each in time-to-launch terms (often churn is 3-4 weeks for a baseline; recs is 2-3 months for anything that beats popularity). They propose shipping the cheaper one first as a wedge while planning the bigger one. They mention how they would measure each in production.
Red flags: Gives a tech answer without business numbers. Cannot estimate effort within 2x.
Follow-up: “What if both are launched and neither moves the metric?”
Q20. Describe a time you pushed back on a stakeholder about an AI request.
What good looks like: A real story with a real stakeholder. Common patterns: pushed back on a feature that needed labels they did not have. Said no to a deadline because the model would be unsafe. Reframed a “build an AI” request into a smaller non-AI fix that solved the underlying problem. They describe how they handled the conversation, not just the tech decision.
Red flags: Cannot name a real example. Says they always say yes. Says they always say no.
Follow-up: “How did the relationship with that stakeholder evolve afterwards?”
What Top Interviewers Actually Listen For
Across 800+ interviews, three signals predict on-the-job performance better than the headline answer.
First: specific numbers. Strong candidates quote precision, recall, latency, and dollar impact in the same sentence. Weak candidates speak in adjectives (“it worked well”, “performance was good”).
Second: failure stories. Senior engineers tell you what broke and why. Mid-level engineers describe successes only. The willingness to talk about a model that failed in production is a stronger signal than any technical answer.
Third: tradeoff naming. Strong candidates frame every decision as a tradeoff. “We picked LightGBM because we needed sub-100ms latency. It cost us 1.5 points of AUC versus a deep model. The latency mattered more for the use case.”
2026 AI Hiring Context
Demand for AI engineers is up 4-5x in 2025-2026 versus 2023. The WEF Future of Jobs 2025 ranks AI specialists as the second-fastest-growing role globally. Salaries in the US for senior AI engineers run $180-250K base. The same talent in Vietnam or the Philippines runs $40-90K all-in.
The bottleneck for most teams is not budget. It is interview throughput. Hiring managers we work with are screening 200+ candidates per opening. Standardizing on questions like the 20 above cuts interview time by 40-60% and improves offer-acceptance rates because candidates rate the loop as more rigorous and respectful of their time.
Hiring AI Engineers Without the Marathon
If you do not want to run 50 interviews to hire 1 senior AI engineer, we can shortcut the loop. Second Talent pre-vets AI engineers across 9 Asian markets using the exact questions in this guide. We rate candidates on the same 5-area rubric. You see the rubric scores before you ever interview them.
Specifically, you can hire AI and ML Engineers for production model work, AI Agent Developers for autonomous workflows and tool use, and Data Scientists for the analysis-heavy roles. All come with a 7-day trial and 30-day replacement guarantee.
Conclusion
Twenty questions is more than any single loop should run. Pick the 6-8 that match the role and the seniority you are hiring for. Score on a 1-5 scale. Take notes on phrasing. Watch for specific numbers, failure stories, and tradeoff naming.
The pattern that predicts on-the-job success most reliably is not knowing every answer. It is showing the work behind the answer. A candidate who says “I do not know” and then walks through how they would find out is almost always stronger than the one who confidently quotes the wrong textbook answer.


