Focused revision · showing only the selected interview track
Senior GenAI Developer · Remote

GenAI Interview
Saviour

A pressure-tested system for resume defense, GenAI depth, production debugging, coding, system design, mocks, and remote-company interviews. Every main question has a complete answer, hidden initially and revealed on click.

Audited June 14, 2026
158 complete click-to-reveal answers
Every resume claim challenged
35+ code samples
10 remote-company playbooks
Interview readiness
0%
Open Interview Saviour and complete the evidence-based readiness gate. Reading alone does not count.

Choose today’s pressure test

No sheet can guarantee every interview. This one is designed to make gaps visible and train the repeatable reasoning that strong GenAI interviews reward.

Truth gate before memorising: your resume proves the headline metrics (35%, 40%, 60%, 68%, 70%, sub-second, awards, and 50+ interviews), but it does not prove every implementation detail in the sample answers below. Only state baselines, datasets, architectures, client facts, traffic numbers, timelines, and personal ownership that you can defend with real evidence.
2026 GenAI Interview Radar
Current platform vocabulary and durable reasoning, checked against official sources in June 2026
June 14, 202610 current-topic packs
2026 rule: model names, prices, quotas, and regional availability change quickly. In interviews, state the evaluated capability tier and deployment constraints first, then name a currently available model/service after confirming the role’s cloud and region.
26.1
What changed in senior GenAI interviews by 2026?

Expected Now

  1. Agents as production systems: runtime, identity, memory, tools, observability, evals, and recovery.
  2. Context engineering beyond prompt wording.
  3. Multimodal RAG over layout, tables, charts, images, and provenance.
  4. Continuous evaluation and production incident diagnosis.
  5. Security against injection, excessive agency, leakage, and unbounded consumption.

What No Longer Impresses Alone

  1. Framework-name memorization without trade-offs.
  2. A demo chatbot with no evaluation or failure handling.
  3. Claiming a model is “best” without a workload benchmark.
  4. Using an agent where a workflow is sufficient.
  5. Hard-coded old model names and prices.
26.2
AWS 2026 radar: which GenAI services and design choices matter?
  • Amazon Bedrock: managed foundation-model access, Knowledge Bases, Agents, Guardrails, model evaluation/customization, and multimodal RAG capabilities.
  • Amazon Bedrock AgentCore: production agent platform concerns such as secure runtime, enterprise tool/data connectivity, identity, tracing, debugging, evaluation, and AgentCore Policy for agent-to-tool controls.
  • Inference optimization: evaluate prompt caching and intelligent prompt routing where supported; confirm current model, region, and cost behavior.
  • Inference choice: Bedrock for managed models; SageMaker AI endpoints/HyperPod or EC2/EKS/ECS when deeper model/runtime control is required.
  • Architecture: apply the AWS Generative AI Lens and security reference architecture, not only service wiring.
26.3
GCP 2026 radar: which GenAI services and design choices matter?
  • Gemini Enterprise Agent Platform / Vertex AI: build, deploy, govern, and optimize enterprise agents and GenAI workloads.
  • Vertex AI Agent Engine: managed agent runtime with sessions, Memory Bank, code execution, observability, private networking, CMEK, resource/concurrency controls, and enterprise compliance support.
  • Google Agent Development Kit: an open-source, model-agnostic framework with current SDKs across Python, TypeScript, Go, Java, and Kotlin; evaluate it alongside A2A for interoperable agents.
  • RAG choices: RAG Engine, Vector Search, Agent Search, Feature Store vector retrieval, or custom AlloyDB/Cloud SQL patterns depending on scale and control.
  • Serving: Vertex AI managed endpoints/Model Garden or Cloud Run/GKE for application and custom-runtime control.
26.4
How should I discuss models in 2026 without becoming outdated?
CapabilityReasoning, coding, multimodal, tool use, structured output, context, language.
OperationsTTFT, throughput, availability, rate limits, regional support, observability.
GovernanceData use, retention, residency, safety, audit, vendor risk.
EconomicsInput/output tokens, cache/batch discounts, serving utilization, operational cost.

Strong answer: “I maintain a task-specific evaluation and route to the smallest model tier that passes quality, safety, latency, and cost gates. I avoid choosing from benchmark headlines alone.”

26.5
Managed agent platform vs custom LangGraph/SDK runtime in 2026.
DimensionManaged PlatformCustom Runtime
Time to productionFaster; runtime/identity/memory/observability integratedSlower; assemble and operate components
ControlPlatform constraints and regional feature availabilityFull orchestration/runtime/model control
PortabilityPotential cloud/platform couplingPotentially portable with more ownership
Best fitEnterprise team prioritizing managed governance and speedDifferentiated orchestration, unusual runtime, or multi-cloud need
26.6
2026 freshness checklist before every interview.
26.7
What is the current API and model-selection landscape as of June 14, 2026?

Interview-safe answer: use each provider's current unified API and capability discovery, but select models with an evaluation suite rather than a memorized leaderboard. For OpenAI, the Responses API is the primary agentic build path and works with tools, structured outputs, and the Agents SDK; its latest-model guide currently recommends GPT-5.5 for complex reasoning and coding. On AWS and GCP, confirm the model, feature, region, quota, and data-governance combination before committing to an architecture.

Quality gateTask success, groundedness, tool correctness, safety, and schema validity.
Operational gateTTFT, throughput, rate limits, context behavior, availability, and observability.
Governance gateResidency, retention, data use, IAM, audit, and provider risk.
Cost gateMeasure end-to-end cost per successful task, including retries, tools, retrieval, and judges.
26.8
What are the latest MCP and A2A interoperability details?

Exact-date answer: the current stable MCP specification is 2025-11-25. A 2026-07-28 release candidate was announced on May 21, 2026, but its final release is scheduled for July 28, 2026, so do not call it the current stable spec yet. The RC adds a stateless core, Extensions, MCP Apps, Tasks, and authorization hardening.

InterfacePurposeProduction concern
Function/tool callingA model emits a typed request to an application-owned toolSchema validation, authorization, idempotency
MCPAn application connects models/agents to reusable tools, resources, and promptsServer trust, OAuth/authz, data boundaries, approvals
A2AAgents discover and communicate with other agents across systemsIdentity, capability trust, task lifecycle, inter-agent security
26.9
What is the 2026 security baseline for LLM and agentic systems?

Threat-model against the OWASP Top 10 for LLM Applications 2025 and the OWASP Top 10 for Agentic Applications 2026. The agentic list expands the focus to goal hijacking, tool misuse, identity and privilege abuse, memory poisoning, insecure inter-agent communication, cascading failures, and rogue agents.

  1. 1
    Assume model output is untrusted: never use it as authorization; validate every argument and output.
  2. 2
    Bound agency: least-privilege identities, egress allowlists, step/time/token budgets, idempotency, and human approval for high-impact actions.
  3. 3
    Protect state: isolate tenants, provenance-tag memory, expire/delete it, and prevent untrusted content from becoming durable instructions.
  4. 4
    Verify continuously: adversarial evals, red-team tests, audit trails, kill switches, and incident exercises.
26.10
Which facts are durable, and which must I re-check before an interview?
Freshness classExamplesRevision rule
Durable foundationsAttention, embeddings, RAG failure modes, distributed-systems trade-offs, least privilegeLearn deeply; explain from first principles
Review monthlyAgent frameworks, cloud capabilities, protocol specs, security guidanceCheck official docs and release notes
Verify before interviewModel names, prices, quotas, context limits, regions, benchmark claimsUse the provider's current documentation
Verify with recruiterRemote eligibility, interview steps, role scope, team stackCurrent job posting and recruiter always win
This guide was audited on June 14, 2026. Treat every volatile product fact as an example of a decision, not as permanent truth.
Interview Saviour Operating System
Readiness gate, answer frameworks, decision matrices, current GenAI radar, and rescue scripts
Start Here10 survival packs
S1
Evidence-based readiness gate: am I actually ready?

Check an item only after doing it without reading the answer. The score persists in this browser.

Target: 80% means interview-ready; 100% means well prepared, not guaranteed. Unknown questions are inevitable. Your advantage is a strong reasoning process.
S2
How should I structure answers at 30 seconds, 2 minutes, and 10 minutes?
Answer TypeStructureWhat Good Sounds Like
30 secDefinition → distinction → when to useAnswer directly, distinguish the nearest concept, then give one decision rule.
2 minProblem → options → choice → trade-off → metricShow judgment, one rejected option, and evidence.
10 minRequirements → design → deep dive → failure → scale → security → evalDrive the conversation and invite the interviewer to choose a deep dive.
BehavioralSituation → stakes → your action → result → lessonKeep “we” for context and “I” for your contribution.
DebuggingScope → hypotheses → instrumentation → isolate → mitigate → preventDo not jump directly to a favorite fix.
S3
Universal checklist for any GenAI system-design interview.
1 · ClarifyUsers, job, risk, latency/QPS, freshness, modalities, tenancy, compliance.
2 · MeasureQuality definition, golden set, business metric, SLO, cost, human review.
3 · DesignDeterministic baseline; model, retrieval, tools, data flow, APIs, storage.
4 · OperateFailures, fallback, observability, security, rollout, feedback, rollback.

Before Drawing

  1. Which error is unacceptable: false answer, missed answer, unsafe action, or slow response?
  2. Does knowledge change, or does behavior need adaptation?
  3. Is output advisory or does it trigger actions?
  4. What does success mean offline and online?
  5. Which data may enter prompts, logs, vendors, and eval sets?

Close Every Design With

  1. Top three failures and mitigations.
  2. Quality, latency, cost, and safety release gates.
  3. Progressive rollout and rollback plan.
  4. One limitation and next iteration.
  5. How it changes at 10× traffic or corpus.
S4
The decision matrix: RAG, fine-tuning, agents, models, and infrastructure.
DecisionChoose A WhenChoose B WhenSenior Caveat
RAG vs fine-tuneFacts change; citations/access control matterBehavior, style, format, or task skill must changeThey are complementary
Workflow vs agentPath is known; auditability mattersOpen-ended planning adds measured valueStart deterministic
Hosted vs open modelFast iteration and managed capabilityControl, privacy, specialization, scale economicsInclude serving/ops cost
Large vs small modelHard reasoning and broad tasksRouting, extraction, classification, high volumeUse eval-driven routing
Managed vs custom RAGSimple requirements and small teamCustom parsing, ranking, security, evaluationSpeed can outweigh flexibility
Lambda vs containersBursty short-lived orchestrationLong-running, streaming, custom runtime, GPUKnow execution limits
S5
Current senior GenAI interview radar: context, agents, MCP, multimodal, inference, security.

Agent & Context Engineering

  1. Context engineering: optimize instructions, tools, memory, evidence, and state inside a limited token budget.
  2. Agent harness: checkpointing, progress, budgets, permissions, isolation, and recovery across long tasks.
  3. Tool ergonomics: clear schemas, bounded outputs, actionable errors, evaluation-driven improvement.
  4. Agent evals: outcome, path efficiency, tools, safety, and recovery across nondeterministic runs.

Platform & Models

  1. MCP: hosts, clients, servers, tools/resources/prompts, transports, authorization, confused-deputy risk.
  2. Multimodal: OCR/layout/table/chart understanding, provenance, modality-specific evaluation.
  3. Inference: prefill/decode, KV cache, continuous batching, quantization, routing, TTFT.
  4. Security: constrain permissions, tools, outputs, consumption, and autonomy.
S6
How do I design evaluations that interviewers trust?
Task successDid it complete the user’s job?
QualityCorrectness, relevance, faithfulness.
ProcessRetrieval, tools, efficiency, recovery.
SafetyInjection, leakage, unauthorized action.
OperationsLatency, cost, errors, stability.
  1. 1
    Representative tasks: normal, edge, adversarial, multilingual, long-context, and no-answer cases.
  2. 2
    Layered graders: deterministic checks, calibrated human labels, then scalable LLM judges.
  3. 3
    Track slices: document type, user, query complexity, tool, model, and route.
  4. 4
    Gate releases: block critical safety, quality, latency, or cost regressions.
  5. 5
    Close loop: production traces and corrections become new eval cases.
S7
How do I threat-model an agent or RAG system?
ThreatFailureControls
Prompt injectionUntrusted content changes behaviorTrust zones, policy layer, validation; treat retrieved/tool content as data
Excessive agencyToo much autonomy/permissionLeast privilege, scoped tools, budgets, approval, reversible actions
Sensitive disclosurePII/secrets leak via prompts/output/logsMinimize data, DLP, encryption, retention, isolation
Improper output handlingModel output becomes executable inputSchema validation, escaping, parameterization, sandboxing
Unbounded consumptionLoops cause denial of wallet/serviceToken/tool/time budgets, quotas, circuit breakers
MCP authorizationToken theft/confused deputyExplicit consent, scoped short-lived tokens, trusted redirects
Trap: a system prompt does not solve injection. Assume the model can be manipulated and constrain the surrounding system.
S8
Inference and cost deep-dive: answer beyond “use caching.”

Diagnose

  1. Network/queue → retrieval/rerank → prompt → prefill/TTFT → decode → post-process.
  2. Measure P50/P95/P99 by model, route, context, output, concurrency, warm/cold.
  3. Long input raises prefill; long output raises decode. Optimize the correct stage.
  4. Separate perceived streaming latency from total completion time.

Optimize Safely

  1. Model routing, context compression, retrieval precision, parallel independent work.
  2. Semantic/prefix cache with freshness and tenant-safe keys.
  3. Continuous batching, KV cache, quantization, speculative decoding.
  4. Bound outputs, tools, retries, loops; verify quality after every change.
S9
What high-signal questions should I ask the interviewer?

Engineering Maturity

  1. What are the hardest production failure modes?
  2. How do you evaluate changes and what blocks release?
  3. Where do you intentionally avoid LLMs or agents?
  4. How are quality, latency, cost, and safety trade-offs decided?
  5. What does observability cover across retrieval, models, and tools?

Role & Team

  1. What does excellent impact look like in 90 days?
  2. Which decisions would this role own?
  3. What technical disagreement is the team working through?
  4. How are incidents and async handoffs handled?
  5. Why have strong engineers struggled in this role?
S10
Interview rescue scripts: what do I say when I am stuck?
SituationUseful Script
I do not know“I have not implemented that directly. My understanding is X. I would verify Y; the closest system I built was Z.”
Ambiguous“May I clarify failure cost, traffic, freshness, and whether output triggers actions?”
Wrong assumption“That conflicts with the new constraint. I would revise X; the trade-off becomes Y.”
Confidential“I cannot share identifiers or exact volumes, but I can explain architecture, measurement, relative scale, and my contribution.”
Coding stuck“I will state a brute-force baseline, test it on a small example, then optimize the bottleneck.”
Metric challenged“Fair point. The metric measured X, not Y. The limitation is Z, and I would measure A next.”
Timed Mocks & Production War Room
Answer aloud before opening each card; score reasoning, not memorized keywords
Active Recall8 pressure drills
Mock rule: set a timer, speak before opening the card, then score one point each for requirements, trade-offs, metrics, failures, and security. Strong answers score at least 4/5.
M1
45 min Full senior GenAI engineer mock loop.
  1. 8m
    Resume: career walkthrough; defend sub-second latency and one percentage claim.
  2. 8m
    Depth: hybrid retrieval, reranking, evaluation, and when RAG is wrong.
  3. 15m
    Design: secure multi-tenant financial research assistant with citations at 500 QPS.
  4. 8m
    Incident: P95 doubles and faithfulness drops after an index refresh.
  5. 6m
    Behavioral: disagreement, failure, remote collaboration, why this role.
RequirementsClarified assumptions
JudgmentCompared options
EvidenceUsed metrics
ReliabilityHandled failure
CommunicationClear and concise
M2
10 min RAG quality drops after adding 500K documents. Diagnose it.
  1. 1
    Scope: slice by source, parser, document/query type, index version; verify eval set.
  2. 2
    Separate stages: inspect Recall@K/MRR and retrieved chunks before blaming generation.
  3. 3
    Hypotheses: parsing, duplicates/noise, metadata, embedding mismatch, top-k dilution, stale index.
  4. 4
    Mitigate: roll back index alias, isolate bad source, restore configuration.
  5. 5
    Prevent: ingestion gates, canary index, regression suite, versioned promotion.
M3
10 min Agent loops, repeats tool calls, and creates duplicate actions.

Immediate

  1. Restrict destructive tool; stop affected runs.
  2. Use idempotency keys and deduplication.
  3. Inspect state transitions, tool results, retries, model output.
  4. Replay safely with the same state in a sandbox.

Prevent

  1. Max steps/time/token/tool budgets and repeated-state detection.
  2. Explicit transitions and completion criteria.
  3. Retry only transient failures; give tools actionable errors.
  4. Human approval and regression evals.
M4
10 min Offline eval improved, but users say the model is worse.
  1. 1
    Check eval representativeness and recent production failures.
  2. 2
    Slice by user, language, task, context length, tool, and route.
  3. 3
    Validate graders for bias, leakage, weak rubrics, and poor agreement.
  4. 4
    Compare paired traces and UX/latency changes, not only answer scores.
  5. 5
    Roll back/reduce traffic, add failures to evals, recalibrate gates.
M5
10 min LLM cost doubled and P95 breached the SLO.

Instrument

  1. Cost/request by model, route, tenant, tokens, tools, retries.
  2. Latency waterfall: queue, retrieval, rerank, prefill, decode, tools.
  3. Compare deployment, traffic mix, prompt/context, provider changes.

Act Safely

  1. Stop runaway loops/retries and enforce budgets.
  2. Route simple work to small models; compress irrelevant context.
  3. Parallelize independent work; cache with safe freshness/tenant keys.
  4. Canary and prove quality remains acceptable.
M6
12 min Design multimodal financial-document intelligence.
IngestLayout-aware parsing, OCR confidence, tables, charts, provenance.
RepresentText/table/image representations linked to page hierarchy.
RetrieveQuery routing, hybrid retrieval, security filters, reranking.
EvaluateTable QA, chart reasoning, citations, OCR slices, numeric correctness.
Critical risks: wrong units, split tables, OCR digit errors, stale documents, inaccessible citations, confident numeric hallucinations.
M7
10 min Design a secure enterprise MCP platform with hundreds of tools.

Architecture

  1. Host controls UX, consent, policy, model, and clients.
  2. Focused servers expose bounded tools/resources with schemas.
  3. Discover/load tools on demand; avoid flooding context.
  4. Registry, health, versioning, audit, evaluation, revocation.

Security

  1. Per-user scoped auth; no token passthrough.
  2. Trusted servers, explicit consent, short-lived tokens.
  3. Sandbox execution; minimize data entering model context.
  4. Budgets, validation, approval, and kill switch.
M8
12 min Coding: implement a resilient bounded LLM request pipeline.

Clarify: async concurrency, timeout, retryable errors, backoff/jitter, rate limits, idempotency, validation, budget, cancellation, metrics, fallback.

Interview skeleton
async def bounded_llm_call(request, *, timeout_s, max_attempts, budget):
    # validate + reserve budget; acquire concurrency permit
    # call with timeout/idempotency; retry transient errors only
    # validate output; emit trace/metrics; release resources
    ...
State which errors you will not retry and how you prevent retry storms.
Resume Cross-Examination
Defend every role, metric, technology, award, and transition without bluffing
Highest Priority 12 drill packs · 70+ cross-questions
Metric defense rule: For every percentage, prepare the exact baseline, final value, dataset/sample size, measurement window, your personal contribution, and one limitation. If any element is confidential, say what you can disclose and explain the method.
R1
Walk me through your resume, career transitions, and “why now?”
PresentSenior GenAI Engineer at EPAM, building production Hybrid RAG and LLM systems for finance.
PatternBackend/data foundation → applied ML → enterprise GenAI ownership.
Proof5.8+ years, production impact across finance, healthcare, retail, and repeated awards.
NextA senior remote role with deeper ownership of scalable AI product architecture.

Likely Cross-Questions

  1. Why did you move from full-stack to data engineering, then data science, then GenAI?
  2. Why are you considering a move after joining EPAM in August 2025?
  3. What makes you “senior” beyond years of experience?
  4. Which role changed your technical judgment most?
  5. Why remote, and what evidence shows you succeed remotely?
  6. Which parts of your resume are hands-on versus team-level outcomes?

Strong Answer Anchors

  1. Make the progression intentional: software → data platforms → models → LLM products.
  2. For a current-job move, stay positive and name the specific scope you seek.
  3. Define seniority through ambiguity handling, trade-offs, reliability, mentoring, and hiring.
  4. Keep the walkthrough to 90 seconds; let the interviewer choose the deep dive.
R2
Defend the EPAM Hybrid RAG and multimodal document-intelligence claim.
ProblemName document types, users, failure cost, corpus scale, and why search was insufficient.
DesignIngestion → parsing/OCR → chunking → embeddings/BM25 → fusion/rerank → generation → citations.
Trade-offExplain why Bedrock, OpenSearch, and Lambda fit, plus where they do not.
EvidenceBring retrieval, generation, latency, reliability, and cost metrics separately.

Architecture Pressure Test

  1. What made it hybrid? How did you fuse sparse and dense results?
  2. What was multimodal: scanned PDFs, tables, charts, images, or all four?
  3. How did you parse tables without losing row/column relationships?
  4. Why OpenSearch instead of Pinecone, pgvector, or Bedrock Knowledge Bases?
  5. Why Lambda; what happens with cold starts, long jobs, or GPU inference?
  6. How did you enforce tenant isolation and finance-domain access control?
  7. How were citations, abstention, and hallucination controls implemented?
  8. What would break first at 10x corpus size or QPS?

Answer Checklist

  1. Draw the full request path in under two minutes.
  2. Separate decisions you owned from platform constraints you inherited.
  3. Give one rejected design and the reason it lost.
  4. Name the golden dataset and the retrieval metric used to tune hybrid weights.
  5. Explain failure modes: parser errors, stale index, no relevant context, model timeout.
R3
You claim sub-second latency. What exactly was sub-second, and how was it measured?
Do not say only “sub-second.” State whether it was time-to-first-token, retrieval latency, model latency, or end-to-end latency; then give percentile, workload, payload size, and measurement window.

Cross-Questions

  1. Was sub-second measured at P50, P95, or P99?
  2. Did streaming make perceived latency lower than total latency?
  3. What were the before/after values and number of sampled requests?
  4. Which change contributed most: caching, parallelism, smaller model, connection reuse, or retrieval tuning?
  5. How did you handle Lambda cold starts and Bedrock throttling?
  6. What accuracy or cost trade-off did latency optimization introduce?

Measurement Template

  1. Metric: TTFT / retrieval / end-to-end.
  2. Distribution: P50, P95, P99, not average alone.
  3. Conditions: concurrency, prompt/context tokens, model, warm/cold.
  4. Instrumentation: trace spans around every stage.
  5. Guardrail: verify quality and error rate did not regress.
R4
Defend your PEFT and LoRA work on LLaMA and Gemma.

Must-Answer Technical Questions

  1. What task required fine-tuning rather than RAG or prompting?
  2. Which LLaMA/Gemma sizes and base versus instruct variants?
  3. What were the dataset size, schema, train/validation split, and cleaning rules?
  4. Which target modules, rank r, alpha, dropout, learning rate, and epochs?
  5. Did you use QLoRA? Explain NF4, double quantization, and compute dtype.
  6. How did you detect overfitting and catastrophic forgetting?
  7. What baseline and evaluation metric proved improvement?
  8. How were adapters merged, versioned, and served?

Decision Defense

  1. Fine-tune for behavior/style/task adaptation; use RAG for changing factual knowledge.
  2. Explain parameter and memory savings, not only “cheaper.”
  3. Discuss data licensing, PII removal, memorization, and rollback.
  4. Compare LoRA with full fine-tuning, prompt tuning, and continued pretraining.
  5. Bring one failed experiment and what it taught you.
R5
How did the PwC multi-agent healthcare platform improve processing efficiency by 60%?
BaselineDefine the old process, human/system effort, throughput, and bottleneck.
MechanismExplain agents, tools, orchestration, state, and parallelizable work.
MetricSay how “processing efficiency” was calculated and over what period.
RiskCover PHI/PII, auditability, clinical review, and failure escalation.

Cross-Questions

  1. Why multiple agents instead of one workflow with tools?
  2. What did each agent own, and how did they communicate?
  3. How did you prevent duplicate actions and non-deterministic loops?
  4. What state was persisted, and how did retries/checkpointing work?
  5. How were tools authorized and outputs validated?
  6. Where was human-in-the-loop mandatory?
  7. Why microservices, and what operational cost did that add?
  8. How did you evaluate agent-level versus end-to-end success?

Senior-Level Trade-Offs

  1. Agents increase flexibility but reduce predictability and observability.
  2. Use deterministic nodes for rules, calculations, and compliance gates.
  3. Bound execution with budgets, timeouts, allowed transitions, and circuit breakers.
  4. Expose the exact portion you personally architected or implemented.
R6
Defend PwC’s 40% retrieval accuracy, 70% assessment-time, and 68% execution-time improvements.
Resume ClaimInterviewer Will AskYour Required EvidenceCommon Trap
40% retrieval accuracyAccuracy means Recall@K, MRR, NDCG, or answer correctness?Golden queries, labels, baseline, final metric, confidence/error analysisCalling semantic similarity “accuracy”
70% assessment timeWhose time and from how long to how long?Workflow timing, sample count, review rate, quality guardrailIgnoring time shifted to reviewers
68% execution timeWhich stages were parallelized and why safe?Trace waterfall, dependency DAG, before/after percentilesComparing unmatched workloads

Follow-Ups

  1. What is chain parallelism, and how does it differ from batching?
  2. How did you handle partial failure when parallel calls diverged?
  3. How did you tune hybrid retrieval and reranking?
  4. What did the risk-intelligence agents produce, and how was correctness verified?
  5. What changed after production feedback?

Answer Pattern

  1. Give the formula before the percentage.
  2. State absolute values as well as relative improvement.
  3. Name the quality metric held constant during speed optimization.
  4. Acknowledge limitations and what you would measure next.
R7
Deep-dive the Cognizant semantic search, multi-LLM workflows, metadata linking, and security work.

Semantic Search / RAG

  1. Why Pinecone, which index/metric, and how did namespaces/metadata work?
  2. What did the 35% accuracy improvement measure?
  3. The resume says latency reduced but gives no number; what can you honestly claim?
  4. How did you select embeddings, chunk size, top-k, and reranker?
  5. What was required to move an R&D prototype into production?

Other Cognizant Claims

  1. Why Azure OpenAI versus Gemini Pro for each workflow step?
  2. What is metadata linking, and how was link quality evaluated?
  3. How did Checkmarx, Sonar, and Black Duck differ?
  4. What does “100% security compliance” specifically mean?
  5. How did you handle secrets, prompt injection, PII, and dependency risk?
R8
Defend your TCS data-engineering foundation and the move into data science.

Data Engineering Questions

  1. Draw one GCP ETL pipeline end to end: sources, orchestration, transforms, storage, consumers.
  2. Airflow versus Cloud Functions: what belonged where?
  3. How did you reduce cost by 40%: slots, partitioning, scheduling, storage, or serverless changes?
  4. How was manual effort reduced by 80%, and what remained manual?
  5. How did you ensure idempotency, backfills, schema evolution, and data quality?
  6. Explain BigQuery partitioning, clustering, and query-cost control.

Transition Questions

  1. Why leave data engineering for data science?
  2. How does your data-engineering background make you better at GenAI?
  3. Which parts of MLOps/LLMOps are direct extensions of data-platform work?
  4. What ML knowledge did you have to build deliberately?
  5. Would you still be comfortable owning a production data pipeline today?
R9
What did you personally build in your Django role and Python internship?

Oneness Tech Solutions

  1. What prediction did the analytics app make, using which model and features?
  2. How did Django, REST APIs, database, model inference, and UI connect?
  3. What changed to improve organic traffic by 65%, and which analytics source measured it?
  4. What production incident or performance bottleneck did you solve?
  5. What does “end-to-end product lifecycle” mean in concrete deliverables?

CodeSpeedy Internship

  1. How did the weather app handle API errors, caching, and rate limits?
  2. How was certificate generation automated?
  3. Explain the multithreaded socket system and race-condition risks.
  4. Threads versus processes versus asyncio in Python?
  5. How did technical writing improve your engineering communication?
R10
Your skills section is very broad. Which technologies can you genuinely defend?
Never answer “expert in all.” Classify each listed skill as production ownership, working proficiency, or informed exposure. Interviewers often choose the least-supported item.
ClusterLikely Deep ProbeHave Ready
LangChain / LangGraph / LlamaIndexWhen would you avoid each framework?One production example, one limitation, one plain-Python alternative
AWS / GCPMap equivalent services and explain operational trade-offsOne architecture on each cloud, IAM/networking/cost story
PyTorch / TensorFlow / MLflowShow training/evaluation/registry experienceReal model, experiment setup, artifact/version flow
FastAPI / Flask / DjangoAsync, validation, middleware, deployment, testingWhy FastAPI for LLM serving; when Django still wins
Docker / Kubernetes / CI/CDHealth checks, secrets, autoscaling, rollbackA deployment you operated, not only packaged
R11
Defend your awards, top-1% rating, and experience interviewing 50+ candidates.

Leadership / Awards Questions

  1. Why did you receive nine PwC awards and seven Cognizant awards? Give two distinct stories.
  2. How was the Cognizant top-1% / 5-of-5 rating determined?
  3. What did the EPAM Delivery Head Award recognize?
  4. How did you influence without formal authority?
  5. Tell me about a teammate you mentored and the observable result.

Interviewer Questions

  1. What was your A3-level interview rubric?
  2. How did you distinguish memorization from real depth?
  3. Tell me about a close hire/no-hire decision.
  4. How did you reduce bias and calibrate with other interviewers?
  5. What common GenAI weakness did you observe across 50+ candidates?
R12
Final resume-defense questions: failures, gaps, ownership, and why hire you?

Behavioral Cross-Questions

  1. What is your biggest technical failure in production?
  2. When did you disagree with an architect or client?
  3. What metric did you improve but later realize was incomplete?
  4. What work are you least proud of, and what changed afterward?
  5. When did a GenAI approach fail and a simpler approach win?
  6. What is the hardest feedback you received?

Closing Questions

  1. Why should we hire you over a stronger ML researcher?
  2. Why should we hire you over a stronger backend engineer?
  3. What are your two biggest gaps for this role?
  4. What would your 30/60/90-day plan be?
  5. Which architecture decision would you revisit first in our product?
  6. What questions do you have for us that reveal engineering maturity?
Your positioning: You are strongest at turning GenAI ideas into measurable enterprise systems because you combine data pipelines, backend engineering, applied ML, cloud delivery, and stakeholder ownership. Do not position yourself as a frontier-model researcher unless the evidence supports it.
Remote Company & Interview Playbooks
Target roles, typical rounds, difficulty, and preparation emphasis
Role-dependent 10 companies
Process note · audited June 14, 2026: Exact rounds, remote eligibility, and location restrictions change by team and opening. Atlassian and Databricks publish current guides; where a company does not publish a role-specific sequence, the steps below are a preparation hypothesis, not a guaranteed loop. The current job posting and recruiter always win.
C1
Which remote/distributed companies best match this resume?
CompanyRemote SignalBest-Fit RoleTypical EmphasisDifficulty
TwilioRemote-first; remote India roles appearSenior AI/ML, backend platform, applied AIAPIs, distributed systems, coding, ownershipHigh
AtlassianTeam Anywhere; distributed-first with entity/timezone rulesSenior/Principal ML or platform engineerDSA, code design, system design, valuesHigh
DatabricksRemote/hybrid varies strongly by role/locationML platform, GenAI, solutions/field engineeringDSA, distributed data, ML systems, depthVery high
GitLabAll-remoteAI-powered features, backend, MLOpsAsync writing, values, practical technical depthHigh
ElasticDistributed companySearch/GenAI, relevance, platform engineeringSearch internals, distributed systems, collaborationHigh
CanonicalRemote-firstML platform, Python/backend, cloudWritten application, academics, technical breadthHigh / long
AutomatticFully remote/globalApplied AI, experienced software engineerAsync communication, paid trial, product judgmentHigh / practical
ZapierFully remoteApplied AI, automation platform, backendWritten evidence, skills assessment, valuesMedium-high
RemoteRemote-first, globally asyncAI/data/backend platformAsync ownership, product engineering, 4–5 video roundsMedium-high
DeelWork-from-anywhere, globalAI automation, data, backendSpeed, ownership, automation, product impactHigh
C2
Twilio: likely interview steps, difficulty, and targeted preparation.
Typical LoopRecruiter → hiring manager → coding → system/ML design → behavioral/values → team match.
DifficultyHigh. Expect production engineering depth, not GenAI vocabulary alone.
Your AngleFastAPI/microservices, remote delivery, AWS, LLM reliability, measurable customer impact.
RiskYour resume does not show CPaaS/telecom; bridge with API reliability and event-driven systems.

Prepare

  1. Design a globally reliable notification or conversational-AI API.
  2. Idempotency keys, retries, rate limits, queues, webhook security, observability.
  3. Medium DSA in Python with clean tests and complexity analysis.
  4. Explain how you would add safe LLM features to customer communications.

Questions to Expect

  1. How would you prevent duplicate message delivery?
  2. Design a multi-tenant AI support platform with strict latency SLOs.
  3. How do you operate an API during provider/model failure?
  4. Tell me about a time you improved reliability across teams.
C3
Atlassian: official engineering loop and how to prepare.
Typical LoopRecruiter → coding (data structures + code design) → system design → manager → values.
DifficultyHigh. Clean reasoning, maintainable design, and values are heavily weighted.
Your AngleDistributed teamwork, enterprise platforms, interviewer experience, written design decisions.
RiskDo not over-focus on LLMs; show broad distributed-software fundamentals.

Prepare

  1. DSA medium problems while narrating trade-offs and edge cases.
  2. Code design: extensible classes/APIs, tests, readability, change requests.
  3. System design: multi-tenant collaboration, permissions, search, scale, reliability.
  4. Five values stories with conflict, customer impact, and learning.

Likely Questions

  1. Design AI search across Jira and Confluence with permissions preserved.
  2. How would you evaluate and roll out an AI feature safely?
  3. Tell me when you changed your mind after strong disagreement.
  4. Write code that remains easy to extend after a new requirement.
C4
Databricks: likely interview steps, difficulty, and the gap-closing plan.
Official ShapeTalent acquisition → skill assessments → interviews → references → decision/offer.
Engineering ShapeHiring-manager screen → technical screen → virtual panel → references → hiring committee.
DifficultyVery high. Strong DSA plus distributed data/ML systems and deep project defense.
Your GapResume shows BigQuery/Dataflow, not Spark/Delta depth; prepare this explicitly.

Prepare

  1. Spark execution: partitions, shuffle, joins, skew, caching, AQE, failure recovery.
  2. Lakehouse concepts: Delta transaction log, ACID, schema evolution, streaming.
  3. ML/LLM platform design: experiment tracking, feature/data lineage, serving, evaluation.
  4. Medium-hard DSA and rigorous complexity analysis.

Likely Questions

  1. Design an enterprise RAG/evaluation platform for thousands of teams.
  2. Why does a distributed join become slow, and how do you fix skew?
  3. How would you make LLM evaluation reproducible and governed?
  4. Compare warehouse, lake, and lakehouse architectures.
C5
GitLab: all-remote interview strategy.

Typical emphasis: recruiter/hiring-manager conversations, role-specific technical assessment or interview, cross-functional conversations, values, and strong written async communication. Difficulty is high because remote effectiveness is evaluated as an engineering competency.

Prepare

  1. Write a concise design proposal with assumptions, alternatives, risks, and rollout.
  2. Show documentation-first remote habits and transparent decision-making.
  3. Prepare product-aware answers for AI features across the DevSecOps lifecycle.

Expect

  1. How do you unblock yourself asynchronously?
  2. How would you ship and evaluate an AI coding feature?
  3. Tell me about a documented decision that prevented rework.
C6
Elastic: distributed search and GenAI interview strategy.

Typical loop: recruiter screen followed by role-specific interviews with technical and team stakeholders. Your strongest bridge is OpenSearch, hybrid retrieval, vector search, observability, and distributed remote delivery.

Prepare

  1. BM25, inverted indexes, HNSW, shards/replicas, refresh, mappings, filters.
  2. Hybrid retrieval evaluation and relevance tuning.
  3. Distributed failure, consistency, hot shards, and capacity planning.

Expect

  1. Why do vector and keyword search fail differently?
  2. How would you debug poor relevance?
  3. Design search for 10M changing documents.
C7
Canonical: remote-first but unusually writing-heavy and long.

Typical pattern: detailed written application/questions, domain review, assessments, and multiple interviews shown in a personalized candidate dashboard. Difficulty comes from breadth, written precision, and process length.

Prepare

  1. Polished written evidence for every claim; remove vague adjectives.
  2. Python/Linux/cloud fundamentals and open-source awareness.
  3. Academic and career choices, motivation, remote collaboration.

Expect

  1. Why Canonical and open source?
  2. What have you operated on Linux?
  3. Explain a technical decision clearly in writing.
C8
Automattic: fully remote, async, and paid-trial oriented.

Published process: application → interview → paid trial → final interview → offer. Developer hiring is conducted through the tools the company uses, so communication and practical execution matter as much as discussion.

Prepare

  1. A strong written project narrative with personal ownership and trade-offs.
  2. Practical coding, tests, incremental delivery, and async updates.
  3. A truthful explanation of how AI has changed your workflow.

Expect

  1. Why remote and why Automattic?
  2. Complete a realistic paid work trial.
  3. Show how you communicate progress without meetings.
C9
Zapier: remote automation company with evidence-first applications.

Published early stages: application review → recruiter interview → 45–60 minute hiring-manager interview, followed by role-specific skills assessment/final stages. Applications focus strongly on evidence of what you can do.

Prepare

  1. Automation/product thinking and user impact.
  2. Clear written answers using specific outcomes.
  3. Reliable integrations: retries, idempotency, rate limits, webhooks.

Expect

  1. How would you add AI to an automation safely?
  2. How do you prioritize speed versus reliability?
  3. Show an async ownership example.
C10
Remote and Deel: globally distributed product-engineering targets.

Remote

  1. Published average: roughly four weeks and 4–5 video interviews.
  2. Emphasize async ownership, documentation, product judgment, and compliant global systems.
  3. Prepare AI/data use cases for global HR, payroll, and support.

Deel

  1. Work-from-anywhere with high speed and accountability.
  2. Emphasize automation, measurable impact, and comfort across time zones.
  3. Prepare for product/system questions around global employment and compliance.
Concise 14-Day Study Plan
Prioritize defense and active recall instead of passively reading everything
Execution2–3 hours/day
P1
What should I study each day?
DaysPrimary WorkOutput You Must Produce
1–2Resume defense and metric evidence90-second pitch; one evidence sheet for every percentage
3–4RAG, vector DB, evaluation, hallucinationDraw your EPAM/PwC architecture from memory; answer 15 follow-ups aloud
5Agents, LangGraph, reliabilityDefend why agents were needed; design failure controls
6PEFT/LoRA, Transformers, LLM inferenceExplain LoRA math and one real experiment with hyperparameters
7ML/DL/NLP/statistics fundamentalsRapid-fire recall without notes
8AWS/GCP, backend, security, LLMOpsOne production design and one incident/debug story
9–10Python, SQL, DSA6 timed mediums; explain tests and complexity
11System and ML system designTwo 45-minute mocks: RAG platform and AI support system
12Company-specific prepOne-page Twilio, Atlassian, or Databricks brief
13Behavioral and remote storiesEight STAR stories with metrics and lessons
14Full mock and gap repairRecord, review, shorten, and correct weak answers
ML, DL, NLP & LLM Foundations
Compact senior-level rapid fire for everything listed in the resume
Foundation Check8 packs · 55+ prompts
F1
Classical ML: what can be asked from the resume?

Core Questions

  1. Bias versus variance; diagnose each from train/validation curves.
  2. L1 versus L2 regularization; effect on coefficients and feature selection.
  3. Bagging versus boosting; Random Forest versus XGBoost.
  4. How trees choose splits; entropy versus Gini.
  5. When scaling matters and when it does not.
  6. How to handle missing values, outliers, high cardinality, and leakage.
  7. Why cross-validation can still be wrong for time series or grouped data.

Senior Answer Signals

  1. Start with the business cost of false positives/negatives.
  2. Choose splits that mimic production.
  3. Keep preprocessing inside the fitted pipeline.
  4. Explain calibration when probabilities drive decisions.
  5. Compare simple baseline before complex model.
F2
Statistics and evaluation: metrics, experiments, and uncertainty.
QuestionConcise Answer Anchor
Precision vs recall?Precision controls false positives; recall controls false negatives; choose from business cost.
ROC-AUC vs PR-AUC?PR-AUC is more informative for rare positives; ROC-AUC can look optimistic.
Data drift vs concept drift?Input distribution changes versus relationship between input and target changes.
Statistical vs practical significance?A small effect can be statistically real but not worth shipping.
Offline vs online metric?Offline enables fast iteration; online validates real behavior and system effects.

Rapid Fire

  1. Confidence interval and bootstrap.
  2. Type I/II error and power.
  3. A/B-test sample ratio mismatch.
  4. Class imbalance and threshold tuning.
  5. Model calibration and Brier score.

LLM Evaluation Bridge

  1. Why LLM-as-judge can be biased.
  2. How to create a golden set.
  3. Pairwise versus pointwise judging.
  4. How to measure inter-rater agreement.
  5. How to gate releases on quality and latency.
F3
Deep learning fundamentals: training, optimization, and debugging.

Core Questions

  1. Backpropagation and the chain rule.
  2. Vanishing/exploding gradients and mitigations.
  3. ReLU, GELU, sigmoid, softmax: where and why.
  4. SGD versus Adam/AdamW; what weight decay changes.
  5. Batch norm versus layer norm.
  6. Dropout behavior in training versus inference.
  7. Learning-rate warmup, schedules, gradient clipping, accumulation.

Debugging Questions

  1. Loss becomes NaN: what do you inspect first?
  2. Train loss falls but validation worsens: what now?
  3. GPU out-of-memory: which levers preserve quality?
  4. Training is slow: data loader, precision, kernels, batching, profiling.
  5. How do mixed precision and loss scaling work?
F4
NLP and embeddings: before and beyond Transformers.

Questions

  1. TF-IDF/BM25 versus dense embeddings.
  2. Word2Vec CBOW versus skip-gram; static versus contextual embeddings.
  3. Cosine, dot product, and Euclidean distance.
  4. Tokenization: BPE, WordPiece, SentencePiece, unknown/rare words.
  5. Why embedding anisotropy matters.
  6. Bi-encoder versus cross-encoder.
  7. NER, classification, sequence labeling, and generation metrics.

Applied Follow-Ups

  1. How do you pick an embedding model for finance?
  2. How do multilingual embeddings change evaluation?
  3. Why can cosine similarity retrieve irrelevant text?
  4. How do you detect embedding/index drift?
  5. When is BM25 the correct answer?
F5
Transformers: explain the architecture and its scaling costs.

Architecture Questions

  1. Derive scaled dot-product attention: softmax(QKᵀ/√d)V.
  2. Why divide by √d?
  3. Why multiple attention heads?
  4. Encoder-only versus decoder-only versus encoder-decoder.
  5. Causal masks, padding masks, and cross-attention.
  6. Absolute, sinusoidal, RoPE, and ALiBi positions.
  7. Residual connections, layer norm, and feed-forward blocks.

Scaling / Inference

  1. Why attention is quadratic in sequence length.
  2. KV cache: what it stores and its memory cost.
  3. Prefill versus decode; why decode is memory-bound.
  4. Continuous batching and paged attention.
  5. Quantization trade-offs: weights, activations, KV cache.
  6. Mixture-of-Experts benefits and routing challenges.
F6
LLM training and alignment: pretraining through preference optimization.
StagePurposeData / ObjectiveMain Risk
PretrainingLearn language/world patternsLarge corpus; next-token predictionCost, contamination, unsafe knowledge
Continued pretrainingAdapt domain languageUnlabeled domain corpusForgetting and domain overfit
SFTTeach instruction behaviorPrompt-response examplesData quality and imitation limits
RLHFOptimize human preferencesPreference/reward model + RLComplexity and reward hacking
DPOPreference alignment directlyChosen/rejected pairsPreference-data bias

Rapid Fire

  1. Temperature, top-k, top-p, repetition penalty.
  2. Distillation versus quantization versus pruning.
  3. Catastrophic forgetting.
  4. Data contamination and benchmark leakage.

Decision Questions

  1. Prompting vs RAG vs fine-tuning.
  2. Open model vs hosted API.
  3. Small specialized model vs large general model.
  4. When not to use an LLM.
F7
SQL, data pipelines, and MLOps questions implied by the resume.

SQL / Data

  1. Window functions, CTEs, joins, deduplication, top-N per group.
  2. Query plan, indexes, partition pruning, clustering.
  3. Batch versus streaming and exactly-once semantics.
  4. Idempotency, backfills, late data, schema evolution.
  5. Data quality checks and lineage.

MLOps / LLMOps

  1. Experiment tracking, registry, reproducibility, and rollback.
  2. Feature/data/model/prompt version compatibility.
  3. Shadow, canary, and A/B deployment.
  4. Monitoring drift, quality, cost, latency, and safety.
  5. How LLMOps differs from classical MLOps.
F8
Security, privacy, and responsible-AI questions for enterprise GenAI.

Threats

  1. Direct and indirect prompt injection.
  2. Data exfiltration through tools or retrieved context.
  3. Over-privileged agents and confused-deputy attacks.
  4. Training-data poisoning and unsafe dependencies.
  5. PII/PHI leakage, retention, and regional compliance.

Controls

  1. Least privilege, scoped credentials, allowlisted tools, policy checks.
  2. Separate instructions from untrusted data; sanitize and label context.
  3. Input/output validation, DLP, encryption, audit logs.
  4. Human approval for consequential actions.
  5. Red-team evals, incident response, and kill switches.
RAG Architecture
Retrieval-Augmented Generation — ingestion, retrieval, generation, evaluation
High Priority 8 questions
Offline ingestion
Build and refresh the searchable knowledge index
SourcesDocumentsPDF · API · S3 · tables
UnderstandParse & chunkOCR · hierarchy · metadata
RepresentEmbedDense vectors + text fields
IndexOpenSearchHNSW vector index + BM25
Online query path
Retrieve evidence first, then generate a grounded answer
RequestUser queryIntent + access context
RecallHybrid retrievalDense + BM25 · RRF fusion
PrecisionRerankCross-encoder top-k
GenerateGrounded LLMContext + refusal rules
ResponseCited answerAnswer · sources · confidence
INGESTION PIPELINE (offline) QUERY PIPELINE (online) Documents PDF / API / S3 Chunking Hierarchical Embed text-emb-3 Vector DB OpenSearch Retriever Hybrid BM25+ANN LLM Generator Claude / Bedrock Answer
Production RAG separates asynchronous indexing from the latency-sensitive query path.
Q1
Explain Hybrid RAG vs naive (dense-only) RAG. When and why do you use hybrid?

Naive RAG uses only dense vector search (bi-encoder embeddings + ANN). It captures semantic similarity well but fails on exact keywords, entity names, numeric IDs, or ticker symbols.

Hybrid RAG combines dense vector search with sparse BM25 (TF-IDF based keyword matching). The two result lists are merged using Reciprocal Rank Fusion (RRF): each document gets a score of Σ 1/(k + rank_i) across all rankers, and documents are re-sorted by this fused score.

DimensionNaive (Dense)Hybrid (Dense + BM25)
Semantic queriesExcellentExcellent
Exact keyword/IDMissesCatches
Financial codesPoorGood
Setup complexityLowMedium
Recall improvementBaseline+15–40% typical
Your real result: At PwC, switching to Hybrid RAG on AWS OpenSearch gave a 40% accuracy improvement on a US Financial client's document Q&A system — because financial product codes and ISIN numbers needed exact matching that dense search missed.
Python · LangChain EnsembleRetriever
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import OpenSearchVectorSearch

# Sparse retriever
bm25 = BM25Retriever.from_documents(docs, k=10)

# Dense retriever
vs = OpenSearchVectorSearch.from_documents(docs, embeddings)
dense = vs.as_retriever(search_kwargs={"k": 10})

# Hybrid: 40% BM25 + 60% semantic, merged via RRF
hybrid = EnsembleRetriever(
    retrievers=[bm25, dense],
    weights=[0.4, 0.6]   # tune per domain
)
Q2
What chunking strategies exist? Which do you choose and why?

Chunking directly controls what the retriever can find. The wrong chunk size either splits context across boundaries or retrieves too much noise.

StrategyHow it worksBest forTradeoff
Fixed-sizeSplits every N tokens with optional overlapSimple docs, fast prototypeSplits mid-sentence
Sentence splitterSplits at sentence boundariesGeneral proseUneven chunk sizes
Recursive characterTries \n\n → \n → ". " in orderMost text documents (LangChain default)Slight overhead
Semantic chunkingGroups sentences with cosine similarity shiftDense research docsSlow; needs embedding at ingest
Parent-childSmall child chunks indexed; parent context served to LLMLong reports, contractsLarger index size
Late chunkingEmbed full doc, then chunk — preserves context in embeddingsContext-sensitive passages (jina-embeddings-v3)Newer, less tooling
Production recommendation: Use parent-child (LangChain ParentDocumentRetriever) with child chunk 256 tokens, parent 1024 tokens. Retrieve small chunks for high precision; inject the parent window into the LLM for full context.

Always add chunk overlap (10–20% of chunk size) so sentences at boundaries aren't orphaned. Metadata-enrich every chunk: source_file, page_number, created_at, entity_type — these become filterable fields in the vector DB.

Q3
What is re-ranking? When is the extra latency justified?

Initial retrieval uses a bi-encoder (query and document embedded separately) because it's fast — you embed the query once and do ANN lookup. But bi-encoders miss subtle relevance nuances.

A cross-encoder re-ranker takes (query, document) pairs and scores them jointly with full attention — far more accurate. You run it on just the top-20 retrieved results (not the full index), making the cost manageable.

StageModel typeSpeedAccuracyScale
ANN retrievalBi-encoder~5msGoodMillions of docs
Re-rankingCross-encoder~80–200msExcellentTop 20–50 only

Use re-ranking when: precision matters over throughput (finance, healthcare, legal), corpus has many near-duplicate docs, or faithfulness is critical.

Python · Cohere Rerank
from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

reranker = CohereRerank(model="rerank-english-v3.0", top_n=5)
retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=hybrid_retriever  # retrieve top-20 first
)
Q4
What is RAG Fusion and how does it differ from standard hybrid RAG?

RAG Fusion uses the LLM itself to generate multiple alternative query reformulations from the user's original query, runs each through the retriever independently, then fuses all result lists with RRF before generating the answer.

Standard hybrid RAG uses one query, two retrieval methods (dense + sparse), then fuses. RAG Fusion uses N query variations, one or more retrieval methods, then fuses — it solves vocabulary mismatch and query ambiguity at the query level.

Python · RAG Fusion pattern
def rag_fusion(query: str, k=5):
    # 1. Generate query variations via LLM
    variations = llm.invoke(f"Generate 4 alternative versions of: {query}")

    # 2. Retrieve for each variation
    all_results = {}
    for q in [query] + variations:
        results = retriever.invoke(q)
        for rank, doc in enumerate(results):
            doc_id = doc.metadata["id"]
            all_results[doc_id] = all_results.get(doc_id, 0) + 1/(60 + rank)

    # 3. RRF sort and return top-k
    return sorted(all_results, key=all_results.get, reverse=True)[:k]
Q5
How do you evaluate a RAG pipeline? What metrics and tools do you use?

RAG evaluation has two orthogonal dimensions: retrieval quality (did we get the right chunks?) and generation quality (did the LLM use them faithfully?).

MetricFormula (simplified)TargetTool
Context PrecisionRelevant retrieved / Total retrieved>0.8RAGAS
Context RecallFacts in context / Total needed facts>0.75RAGAS
FaithfulnessClaims supported by context / Total claims>0.85RAGAS / LangSmith
Answer RelevancyCosine(answer embedding, question embedding)>0.7RAGAS
Latency P9595th pctile end-to-end ms<2000msCloudWatch
Token cost / queryInput + output tokens × priceTrack trendLangSmith
Python · RAGAS evaluation
from ragas import evaluate
from ragas.metrics import (
    faithfulness, answer_relevancy,
    context_precision, context_recall
)
from datasets import Dataset

dataset = Dataset.from_dict({
    "question": questions,
    "answer": generated_answers,
    "contexts": retrieved_chunks_per_q,   # List[List[str]]
    "ground_truth": reference_answers
})

results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
print(results.to_pandas())
Q6
How do you prevent hallucinations in RAG systems?
  1. 1
    Grounding instructions in system prompt: Explicitly instruct the LLM: "Answer only from the provided context. If the answer is not in the context, say 'I don't have enough information.'" This alone reduces hallucinations ~40%.
  2. 2
    Citations at generation time: Ask the LLM to cite the source chunk for every claim. You can then programmatically verify the citation exists in the retrieved context.
  3. 3
    Faithfulness guardrail: Run RAGAS faithfulness check on a sample of production traffic. Alert if score drops below threshold.
  4. 4
    Confidence thresholding: If retrieval similarity scores are all below a threshold (e.g., cosine < 0.6), return "no relevant information found" rather than hallucinating.
  5. 5
    NeMo Guardrails / Bedrock Guardrails: Add a post-generation layer that checks the answer doesn't reference topics outside the retrieved context.
Q7
What is GraphRAG and when would you use it over standard RAG?

GraphRAG (Microsoft, 2024) builds a knowledge graph from documents — entities become nodes, relationships become edges. Retrieval traverses the graph rather than (or in addition to) doing vector search.

It excels when the question requires connecting multiple entities across many documents ("What are all the risk factors shared by companies X and Y?") — a pattern that naive RAG misses because no single chunk contains the answer.

Use caseStandard RAGGraphRAG
Factual Q&A on single docs✅ BestOverkill
Multi-hop reasoning across entitiesStruggles✅ Best
Relationship queries ("who reports to X")Misses✅ Best
Real-time index updates✅ EasySlow (graph rebuild)
Q8
How do you handle multi-turn conversation memory in RAG?

Multi-turn RAG needs to resolve references ("it", "the same document", "as you mentioned") to prior turns before retrieval — otherwise the query is incomplete.

  • Query condensation: Pass last N messages + current query to LLM with instruction "Rewrite the final question as a standalone query." Then retrieve with the condensed query.
  • ConversationalRetrievalChain: LangChain's built-in pattern handles this automatically.
  • Context window management: Summarise old turns with a summarisation LLM when conversation exceeds 20 turns, to keep token cost bounded.
  • External memory stores: For long-running sessions, persist summaries to Redis or a DB keyed by session_id.
Multi-Agent Systems
LangGraph, orchestration patterns, tool use, agent safety
High Priority 8 questions
Controlled autonomy Orchestrator / Supervisor Classifies intent, plans work, routes tasks, and owns the shared LangGraph state
Risk gate Human approval Required for consequential actions
EvidenceRAG agentRetrieves grounded knowledge and citations
ActionTool agentCalls approved APIs and databases
ComputeCode agentRuns bounded Python or SQL tasks
ResponseSummary agentSynthesizes, validates, and formats output
Shared state with checkpointing Conditional edges for routing Budgets for tools, tokens, and retries
Orchestrator Agent Plans, routes, and delegates tasks RAG Agent Knowledge retrieval Tool Agent API / DB calls Code Agent Python / SQL exec Summary Agent Synthesise + format Human Loop Approval gate Supervisor pattern — LangGraph StateGraph with conditional routing
Multi-agent supervisor pattern with human-in-the-loop approval gate
Q1
Agent vs Workflow — what's the fundamental difference? When do you choose each?
DimensionWorkflow (DAG)Agent
Routing logicCode decidesLLM decides
PredictabilityHigh — deterministicLow — emergent
AuditabilityFull trace known aheadTrace varies per run
FlexibilityFixed paths onlyOpen-ended tasks
Failure modesStep failure, timeoutLoops, hallucinated tool calls
Best forFinance compliance, ETL, pipelinesResearch, assistant, open-ended
Production pattern: Use a fixed workflow topology (which nodes exist and how they connect) with LLM-driven tool selection inside each node. This is what you built at PwC — deterministic orchestration, intelligent execution.
Q2
How does LangGraph work? Explain StateGraph, nodes, edges, and checkpointing.

LangGraph is a low-level orchestration runtime for long-running, stateful agents. A StateGraph models work as nodes and edges; typed shared state flows through the graph and nodes return partial updates. Persistence enables durable execution, streaming, human-in-the-loop interrupts, and recovery. LangChain agents are a higher-level interface built on LangGraph; use LangGraph directly when you need explicit state and control.

Python · LangGraph StateGraph
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.sqlite import SqliteSaver
from typing import TypedDict, List, Annotated
import operator

class AgentState(TypedDict):
    messages: Annotated[List, operator.add]  # reducer: append
    retrieved: List[str]
    answer: str
    iteration: int

def retrieve_node(state: AgentState) -> dict:
    docs = retriever.invoke(state["messages"][-1].content)
    return {"retrieved": [d.page_content for d in docs]}

def generate_node(state: AgentState) -> dict:
    answer = llm.invoke(state["messages"] + state["retrieved"])
    return {"answer": answer, "iteration": state["iteration"] + 1}

def should_continue(state: AgentState) -> str:
    if state["iteration"] >= 3: return "end"
    if "insufficient" in state["answer"]: return "retrieve"
    return "end"

graph = StateGraph(AgentState)
graph.add_node("retrieve", retrieve_node)
graph.add_node("generate", generate_node)
graph.set_entry_point("retrieve")
graph.add_edge("retrieve", "generate")
graph.add_conditional_edges("generate", should_continue,
    {"retrieve": "retrieve", "end": END})

# Checkpointing = persistence across turns (human-in-the-loop)
memory = SqliteSaver.from_conn_string(":memory:")
app = graph.compile(checkpointer=memory, interrupt_before=["generate"])
Q3
What is the ReAct pattern? How does it differ from function-calling agents?

ReAct interleaves reasoning and action so a model can use observations from tools before choosing the next step. In production, do not depend on exposing the model's private reasoning trace. Log auditable state transitions, tool calls, arguments, results, policy decisions, and concise model-provided rationales instead.

Function/tool calling is the typed application interface: the model emits a structured tool request, the application validates and authorizes it, executes the tool, then returns the result. ReAct is a reasoning-and-action pattern; tool calling is an execution contract, and they can be used together.

PatternTransparencyLatencyBest for
ReActObservable actions and observationsUsually multi-stepAdaptive tool-using tasks
Function callingTyped tool request and resultOften lowerReliable application/tool integration
Plan-and-ExecuteExplicit plan and task stateVariesLong, decomposable tasks
Q4
How do you prevent agents from looping, hallucinating tool calls, or going off-rails?
  1. 1
    Explicit budgets: Set graph-step, per-tool retry, wall-clock, token, and cost limits. Force a safe terminal state when any budget is exhausted.
  2. 2
    Tool output validation: Wrap every tool in a Pydantic model. If the tool returns unexpected schema, raise a ToolException — the agent sees the error and can self-correct (max 1 retry per tool).
  3. 3
    Structured outputs for routing: Force the routing decision via JSON schema (with_structured_output). Eliminates free-text routing that the LLM might misformat.
  4. 4
    Human-in-the-loop for side-effecting actions: Any tool that writes to a DB, calls an API with side effects, or sends a message gets an interrupt_before checkpoint in LangGraph — requires explicit human approval.
  5. 5
    Deterministic policy layer: Treat all model and tool output as untrusted. Enforce identity, authorization, egress, argument validation, effect receipts, and audit outside the model.
Q5
How do you design a multi-agent system for high concurrency? (your PwC 60% uplift)
  1. 1
    Microservice isolation: Each agent runs as a separate AWS Lambda / ECS container. Agents communicate through an SQS queue, decoupling their lifecycles and enabling independent scaling.
  2. 2
    Async fan-out: Orchestrator dispatches sub-tasks via asyncio.gather — all sub-agents run concurrently. Only the final merge step waits.
  3. 3
    Result caching: Sub-agent results for repeated sub-tasks are cached in Redis (TTL: 5 min for volatile data, 1hr for reference data).
  4. 4
    Circuit breaker: If a sub-agent fails 3× in 60s, the orchestrator marks it unavailable and falls back to a degraded path rather than propagating failure.
This is the architecture that delivered the 60% efficiency uplift at PwC healthcare. The key insight: sequential sub-agent calls were the bottleneck. Async fan-out was the fix.
Q6
What is Model Context Protocol (MCP) and how does it affect agent tool integration?

MCP (Model Context Protocol) is an open client-server protocol for connecting AI applications to reusable tools, resources, and prompts. It reduces one-off integration code, but it does not remove the need to trust the server, authenticate users, authorize every action, validate outputs, and obtain approval for high-impact effects.

As of June 14, 2026: the stable specification is 2025-11-25. The 2026-07-28 release candidate is available for testing, with final release scheduled for July 28, 2026. MCP connects applications to context/tools; A2A focuses on agent-to-agent discovery and communication.

Q7
MCP vs A2A vs tool calling: which layer solves which problem?
LayerUse it forDo not confuse it with
Tool callingA model requests a typed application functionCross-vendor integration discovery
MCPA host connects to reusable tool/resource/prompt serversAutonomous agent-to-agent delegation
A2AAgents advertise capabilities and coordinate tasks across systemsDirect database/tool access

Security answer: protocol compatibility is not authorization. Authenticate identities, apply least privilege, validate schemas, constrain egress, record effect receipts, and require approval based on risk.

Q8
How do you choose an agent framework, runtime, harness, or managed platform in 2026?
NeedCandidateDecision signal
Fast high-level agentLangChain agents, OpenAI Agents SDK, Google ADKProvider/tool fit, tracing, evals, team familiarity
Explicit durable orchestrationLangGraph or custom state machinePersistence, interrupts, recovery, deterministic control
Agent harnessDeep Agents or an internal harnessPlanning, filesystem/subagents, long-running task ergonomics
Managed production runtimeBedrock AgentCore, Vertex AI Agent Engine, or equivalentIAM, networking, observability, compliance, regional support

Strong answer: start from the task and operating constraints. Prototype the simplest bounded workflow, add model-driven autonomy only where it improves measured task success, and choose the platform that minimizes undifferentiated operations without trapping critical business logic.

LLMOps & Evaluation
Deployment, monitoring, cost control, drift detection, CI/CD for LLMs
Core Topic 6 questions
Q1
How do you reduce LLM inference latency in production? (How did you achieve sub-second at EPAM?)
  1. 1
    Semantic caching (biggest win): Cache (query_embedding, response) pairs in Redis. On each new query, compute embedding similarity — if cosine > 0.92 with a cached query, return cached response. Typically serves 30–50% of traffic from cache at near-zero latency.
  2. 2
    Streaming responses: Use SSE/streaming so the user sees the first token in ~400ms instead of waiting for the full response. Perceived latency drops dramatically.
  3. 3
    Async concurrent retrieval: Run BM25 and dense retrieval concurrently with asyncio.gather. If retrieval and LLM call can overlap, pipeline them.
  4. 4
    Prompt compression: Use LLMLingua to compress long retrieved contexts before sending to the LLM — reduces input tokens 20–40%, direct latency reduction.
  5. 5
    Model routing: Route classification and simple extraction to the smallest evaluated low-latency model tier; reserve a stronger reasoning tier for complex generation. Re-benchmark by provider and region because model names, pricing, and latency change.
  6. 6
    Connection pooling: Reuse HTTP connections to the Bedrock endpoint. AWS SDK connection pool + keep-alive saves 80–120ms per cold connection.
At EPAM, the combination of semantic caching + async retrieval + streaming brought P95 latency from ~3.5s → sub-1s.
Q2
What do you monitor in a production LLM system? What are your alert thresholds?
MetricToolAlert thresholdWhy it matters
Latency P95 / P99CloudWatchP95 > 2sDirect UX impact
Error rate (5xx)CloudWatch Alarms> 1%System stability
Token cost / hourLangSmith + custom> 30% spikeRunaway costs
Faithfulness scoreRAGAS online eval< 0.80Hallucination proxy
Context precisionRAGAS< 0.70Retrieval degradation
Cache hit rateRedis metrics< 20% (unexpected drop)Cost/latency efficiency
Prompt injection attemptsCustom classifierAny detectionSecurity
Embedding driftScheduled jobCosine shift > 0.15 vs baselineModel / data drift
Treat the thresholds above as examples, not universal targets. Calibrate metrics, alert levels, sample rate, and evaluation cadence against user impact, traffic, risk, cost, and human labels. Monitor rolling trends and slice by tenant, intent, language, and model/version.
Q3
What is prompt versioning and how do you implement CI/CD for prompts?

Prompts are code. Changing a prompt is a deployment. Without versioning, you can't roll back a bad prompt change that's causing hallucinations.

  1. 1
    Store prompts in LangSmith Hub or a config store (SSM Parameter Store / Git): Every prompt has a version hash. Production always pins a specific version.
  2. 2
    Evaluation gate in CI: PR changing a prompt triggers an eval pipeline. RAGAS scores are computed on a golden test set. Merge only if faithfulness ≥ 0.82 and answer_relevancy ≥ 0.72.
  3. 3
    A/B shadow testing: New prompt gets 10% of traffic in shadow mode. Compare metrics vs current. Promote if better.
  4. 4
    Instant rollback: Update SSM parameter to prior version hash. No redeploy required.
Q4
How do you reduce LLM API costs in production by 50%+?
TechniqueCost reductionImplementation effort
Semantic caching30–50% fewer LLM callsMedium (Redis + GPTCache)
Model routing (smart/cheap)40–70% on routing tasksMedium (LLM router)
Bedrock batch inference50% discount vs on-demandLow (async jobs only)
Prompt compression (LLMLingua)20–40% input token reductionMedium
Reduce top-K retrieval5–15% input token reductionLow (tune k from 20 → 5)
Context window right-sizingVariableLow
Q5
What is LLM observability and how does it differ from traditional APM?

Traditional APM (Datadog, New Relic) tracks: latency, error rate, CPU, memory — all mechanical metrics. LLM observability adds a semantic layer: does the system say the right things?

  • Traces: Full chain trace — input query → retrieval results → LLM prompt sent → response. LangSmith captures all of this per-run.
  • Semantic metrics: Faithfulness, relevancy, hallucination rate — can only be measured by an LLM evaluator (judge model).
  • Token-level cost attribution: Which component of your chain uses the most tokens? LangSmith shows cost breakdown per node.
  • User feedback loops: Thumbs up/down signals feed back into your golden test set for future eval runs.
Q6
How do you detect and handle model drift or knowledge cutoff issues in production?

Model drift in LLMs shows up as: answer quality degradation (faithfulness drops), output format changes (model update broke JSON parsing), or latency shifts (new model version).

  1. 1
    Golden test set regression: Run a versioned, representative test set on every deployment. Gate release using task-specific, statistically meaningful tolerances calibrated to business risk.
  2. 2
    Embedding and traffic drift: Track query/topic distributions, retrieval success, score distributions, and labelled outcomes by version. Investigate material changes relative to calibrated baselines rather than one universal cosine threshold.
  3. 3
    Knowledge freshness: For time-sensitive domains, add a "freshness" metadata filter. Prefer recently indexed documents. Surface "as of [date]" disclaimers in the response.
Fine-tuning / PEFT (LoRA, QLoRA)
Efficient adaptation of LLaMA, Gemma — your EPAM work
Strong Signal 5 questions
Q1
Explain LoRA mathematically. What problem does it solve?

Problem: Full fine-tuning a 7B model updates all ~7 billion weights — requires ~112GB GPU RAM (FP16), making it infeasible without a cluster.

LoRA insight: The weight update matrix ΔW during fine-tuning has intrinsically low rank. Instead of updating W directly, decompose the update: ΔW = A × B, where A ∈ ℝd×r and B ∈ ℝr×k, with r ≪ min(d, k). Only A and B are trained. W stays frozen.

FrozenPre-trained W4096 × 4096
16.7M parameters
+
TrainableMatrix A4096 × 16
65K parameters
×
TrainableMatrix B16 × 4096
65K parameters
=
AdaptedW + ΔWDomain-specific behavior
without replacing W
Rank r = 16 vs full rank 4096 Only A and B receive gradients ~99.2% fewer trainable parameters
Pre-trained W (4096 × 4096) — frozen 16.7M params + A (4096 × 16) 65K params ✓ × B (16 × 4096) 65K params ✓ = W + ΔW Fine-tuned result ~99.2% param reduction rank r=16 vs full rank 4096 — only A and B trained
Python · PEFT LoRA config
from peft import LoraConfig, get_peft_model, TaskType

config = LoraConfig(
    r=16,                              # rank
    lora_alpha=32,                     # scaling = alpha/r = 2
    target_modules=["q_proj", "v_proj"],  # attention layers
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)
model = get_peft_model(base_model, config)
model.print_trainable_parameters()
# trainable: 4,194,304 / 6,738,415,616 total (0.062%)
Q2
What is QLoRA? How does it enable fine-tuning on consumer hardware?

QLoRA = Quantised LoRA. It quantises the base model to 4-bit NF4 (NormalFloat4) while keeping LoRA adapter weights in BF16. A 13B model at 4-bit takes ~6.5GB instead of ~26GB.

MethodBase model precisionGPU RAM (13B)Quality vs full FT
Full fine-tuningBF16~104GB100% (baseline)
LoRABF16~26GB~98%
QLoRA4-bit NF4~6.5GB~97%

QLoRA also introduces double quantisation (quantise the quantisation constants) and paged optimisers (offload optimizer states to CPU RAM on memory spikes). These together enable fine-tuning a 65B model on a single A100 80GB.

Q3
When do you fine-tune vs use RAG? Give a decision framework.
CriteriaUse RAGUse Fine-tuning
Knowledge updatesFrequent (daily/weekly)Stable domain knowledge
Data availabilityLarge unstructured corpusRepresentative, high-quality labelled examples; required volume is task-dependent
Output style/formatGeneric format OKSpecific tone, JSON schema, brand voice
ExplainabilitySource-grounded citationsOpaque — no citation
LatencyAdds retrieval overheadNo retrieval step
CostLow upfront, per-query retrieval costGPU training cost upfront
Common production pattern: fine-tune for repeated behaviour and style; use RAG for current, source-grounded facts. Combine them only when the evaluated gain justifies added complexity. Fine-tuning support varies by provider and model, so verify current availability.
Q4
What is DPO and how does it compare to RLHF for alignment?

RLHF (Reinforcement Learning from Human Feedback) trains a separate reward model from human preference pairs, then uses PPO to optimise the LLM against that reward model. Complex, unstable, requires large infra.

DPO (Direct Preference Optimisation) bypasses the explicit reward-model-plus-PPO pipeline. Given (prompt, chosen_response, rejected_response) pairs, it directly updates the model to prefer the chosen response. It is operationally simpler than classic RLHF, but quality and stability still depend on data, objective, hyperparameters, and evaluation.

MethodReward modelStabilityData needed
RLHF / PPORequiredFragileLarge
DPONot neededStableModerate
ORPONot neededEvaluate for the taskTask-dependent
Q5
How do you prepare fine-tuning data? What quality signals matter most?
  1. 1
    Format: Use instruction-following format — {"instruction": "...", "input": "...", "output": "..."} (Alpaca format) or multi-turn chat format for conversational models.
  2. 2
    Quality before scale: Start with a small, curated, representative dataset; measure the gain on a held-out eval set, inspect failures, then add targeted examples. Deduplicate and use calibrated human/model review rather than trusting one judge.
  3. 3
    Diversity: Cover all intended use cases. Uneven distribution → model will overfit the majority class.
  4. 4
    Contamination check: Ensure test set has no overlap with training data. Use exact and near-duplicate detection.
Prompt Engineering
CoT, few-shot, context engineering, structured outputs, adversarial robustness
Foundational 5 questions
Q1
Explain Chain-of-Thought prompting. What variants exist and when do you use each?

Chain-of-Thought (CoT) prompting encourages intermediate reasoning on multi-step tasks. It can improve performance, but production systems should not rely on exposing a model's private reasoning. Ask for a concise rationale, assumptions, calculations, citations, or verification artifacts that can be audited.

VariantHow to triggerBest for
Zero-shot reasoningAsk for careful analysis plus a concise, checkable answerBaseline to evaluate, not a guaranteed win
Few-shot reasoningProvide representative worked examples and expected artifactsDomain-specific consistency
Tree of Thought (ToT)Prompt to explore N paths, evaluate each, select bestComplex planning, search problems
ReActThought/Action/Observation loopTool-using agents
Self-consistencySample multiple solutions and aggregate/verifyCases where added cost is justified; never the only high-stakes control
Prompt · CoT system instruction
system = """You are a financial analyst. Before answering any question:
1. State the key facts given in the context.
2. Identify any missing information.
3. Show the calculations, citations, or checks needed to verify the result.
4. State the final answer and concise rationale clearly.

If you are uncertain, say so explicitly."""
Q2
What is context engineering? How is it different from prompt engineering?

Prompt engineering = crafting the instruction/template text. It's about what you say.

Context engineering = dynamically constructing the entire model input — which retrieved documents to include, how to summarise prior conversation, which tool outputs to pass, how to structure memory, and where to place instructions. It is the broader discipline of managing a model's available context budget intelligently; the exact window varies by model and provider.

  • Primacy + recency: LLMs attend most to content at the very start and very end of the context. Place the most critical instructions in both positions.
  • Structured delimiters: Use XML tags (<context>, <instructions>, <examples>) to cleanly separate sections — reduces instruction-following errors.
  • Conversation compression: Summarise turns older than 10 with a cheap LLM. Keeps token cost bounded without losing thread.
  • Lost in the middle: Long contexts suffer from "lost in the middle" — relevant info in the middle of a 100K context gets underweighted. Front-load the most important chunks.
Q3
How do you get reliable structured output (JSON) from an LLM?
  1. 1
    Provider-native schema-constrained output: Prefer the current API's Structured Outputs / JSON Schema or typed tool-calling feature when supported. This constrains shape more strongly than plain JSON mode.
  2. 2
    Application validation: Parse into Pydantic or an equivalent schema, enforce semantic/business constraints, reject unknown fields, and retry only bounded, repairable failures.
  3. 3
    Fallbacks: Plain JSON mode guarantees syntax, not business correctness. Grammar/constrained-decoding libraries can help with self-hosted models, but still validate after generation.
Python · structured output via Pydantic
from pydantic import BaseModel, Field
from langchain_anthropic import ChatAnthropic

class RiskAssessment(BaseModel):
    risk_level: str = Field(description="low | medium | high | critical")
    key_factors: list[str] = Field(description="Top 3 risk factors")
    recommendation: str

llm = ChatAnthropic(model="current_evaluated_model")
structured_llm = llm.with_structured_output(RiskAssessment)
result = structured_llm.invoke("Assess the credit risk for...")
# Then apply business validation and bounded error handling.
Q4
How do you defend against prompt injection attacks?

Prompt injection occurs when untrusted content influences model behavior against the application's intent. It cannot be solved reliably by one prompt, delimiter, classifier, or blocklist; design the surrounding system so a successful injection still cannot perform an unauthorized action.

  • Separate instructions from untrusted data: mark provenance and prevent retrieved/user/tool content from becoming durable policy or memory without review.
  • Deterministic authorization: the model may propose an action, but code validates identity, permission, arguments, data scope, and business rules.
  • Least privilege and approvals: use scoped credentials, egress allowlists, read-only defaults, idempotency, and human confirmation for high-impact actions.
  • Defense in depth: input/output classifiers and guardrails are useful signals, then add adversarial evals, logging, anomaly detection, kill switches, and incident response.
Q5
What is the difference between system prompt, few-shot examples, and RAG context?
ComponentPurposeChanges per request?Token budget
System promptRole, rules, output format, personaNo (static)500–2000 tokens
Few-shot examplesDemonstrate expected format/styleRarely500–3000 tokens
RAG contextCurrent factual knowledge for this queryYes (per query)2000–8000 tokens
Conversation historyPrior turns contextYes (grows)500–10000 tokens
User messageThe actual queryYes50–500 tokens
AWS GenAI Stack — 2026
Bedrock, AgentCore, SageMaker AI, OpenSearch, serverless and containers
Your Stack 10 questions
Q1
Design a production Hybrid RAG system on AWS from scratch.
Ingestion plane
Event-driven, asynchronous, and independently scalable
StoreS3 documentsRaw files + versioned source
BufferSQS eventsRetries + dead-letter queue
ProcessLambda ingestParse · chunk · metadata
RepresentEvaluated embedding modelBedrock model invocation
IndexOpenSearchkNN vectors + BM25 text
Serving plane
Latency-sensitive request path with guardrails and citations
EdgeAPI GatewayAuth · quota · user query
OrchestrateLambda queryParallel retrieval + policy
RetrieveOpenSearchHybrid search + filters
GenerateEvaluated Bedrock modelGuardrails + grounded prompt
ReturnCited responseStreamed answer + sources
CloudWatch traces & alarms IAM least privilege KMS encryption Cost & quality evaluation
S3 Docs Raw files Lambda Parse+Chunk Embed Model Bedrock OpenSearch kNN + BM25 API GW User query Lambda Orchestrate Bedrock Model Generate answer Response CloudWatch Monitor + Alarm AWS production RAG architecture — serverless, auto-scaling
  1. 1
    Ingestion: S3 PUT event → SQS → Lambda or container worker (parse, normalize, chunk, attach provenance) → an evaluated Bedrock embedding model → OpenSearch Serverless or the selected supported vector store. Pin model/version and plan re-indexing.
  2. 2
    Query path: API Gateway → orchestrator → embed query → hybrid retrieval with tenant filters → evaluated reranker when it improves quality enough to justify latency/cost → top evidence into an evaluated Bedrock generation model → validate, cite, and stream.
  3. 3
    Security: IAM roles per service (least privilege). Bedrock Guardrails for PII redaction. Row-level security in OpenSearch (users only see their org's documents). All queries logged to S3 for audit.
  4. 4
    Monitoring: CloudWatch custom metrics for faithfulness, latency P95, token cost. Alarm on any metric breaching threshold.
Q2
What is AWS Bedrock Knowledge Bases? When do you use it vs custom RAG?

Bedrock Knowledge Bases is AWS's managed RAG capability for connecting supported data sources, parsing/chunking content, creating or using supported vector stores, retrieving evidence, reranking where supported, and integrating retrieval with generation. Exact sources, stores, models, and features vary by region and configuration.

DimensionBedrock KB (managed)Custom RAG
Delivery effortUsually faster; managed integrations and operationsMore engineering and operational ownership
CustomisationConfigurable within currently supported sources, stores, models, and workflowsFull control over every retrieval and serving stage
Hybrid searchYes — verify region and data-store supportYes (full control)
Re-rankingBuilt-in (Amazon Rerank)Any reranker
Cost at scaleBenchmark total managed-service costCan optimize deeply, but include engineering and operations
Choose Bedrock Knowledge Bases when its current feature set meets quality, control, governance, and cost requirements. Choose custom RAG when differentiated parsing, retrieval, data stores, orchestration, or operations create measurable value.
Q3
How do AWS Lambda and ECS differ for GenAI workloads? When do you use which?
DimensionLambdaECS / Fargate
Execution modelEvent/request-driven functions with platform limitsLong-running container services and tasks
StartupCold starts depend on runtime, package, networking, and configurationKeep desired tasks warm; task startup still matters during scaling
Duration/resourcesBounded by current Lambda quotasBroader task sizing and duration control; verify Fargate/EC2 limits
StreamingSupported patterns, with integration-specific constraintsNative application HTTP/WebSocket patterns
Cost modelUsage-based function executionProvisioned task resources while running

Rule of thumb: use Lambda for bounded, event-driven orchestration and transformation when its current quotas fit. Use ECS/Fargate or EKS for long-running agents, custom runtimes, connection-heavy streaming, background workers, or sustained workloads. Verify today's quotas and benchmark both.

Q4
What is Bedrock Guardrails and how do you implement content safety?

Bedrock Guardrails is a managed safety layer that intercepts LLM inputs and outputs. It operates independently of the model — wraps any Bedrock-hosted model.

  • Topic blocking: Define topics the model must not discuss (e.g., competitor products, investment advice). Guardrail intercepts and returns a pre-defined safe response.
  • PII redaction: Automatically detects and redacts (or masks) PII in both input and output — names, SSNs, credit card numbers.
  • Grounding check: Compares the generated response against the retrieved context. Flags responses that introduce facts not present in the context (hallucination detection).
  • Word filters: Block specific words, phrases, or regex patterns in input/output.
Q5
How does AWS Step Functions integrate with multi-agent GenAI workflows?

Step Functions provides a managed, visual workflow orchestrator. For deterministic multi-step GenAI pipelines, it's a strong alternative to LangGraph — especially when you need:

  • Built-in retry logic with exponential backoff on each step.
  • Long-running workflows (days) — impossible with Lambda alone, natural with Step Functions.
  • Human approval steps via waitForTaskToken — workflow pauses until a human approves, then resumes.
  • Parallel fan-out via Map state — process 1000 documents in parallel, each calling a Lambda.
Q6
Amazon Bedrock vs SageMaker AI endpoints vs EKS/EC2 for GenAI inference.
OptionChoose WhenYou OwnInterview Trade-off
BedrockManaged access to supported foundation/custom models; fastest product deliveryApplication, prompts, evals, governance configurationLess runtime/inference-engine control
SageMaker AI endpointCustom model serving, MLOps integration, managed endpointsModel packaging/configuration and endpoint operationsMore control and more operational work
HyperPod / EKS / EC2Specialized training/serving, custom kernels/runtime, maximum controlCapacity, orchestration, scaling, resilience, cost utilizationHighest flexibility and operational burden
Q7
What is Amazon Bedrock AgentCore, and when would you use it?

2026 positioning: AgentCore is the AWS platform layer for securely deploying and operating agents built with different frameworks and models. Discuss runtime isolation, enterprise tool/data connectivity, identity/access controls, tracing, debugging, and evaluation rather than treating an agent as only an LLM loop.

Use It When

  1. You need managed production agent runtime and governance.
  2. Agents connect to enterprise systems with controlled authentication.
  3. Tracing, debugging, evaluation, and operational consistency matter.

Custom Runtime When

  1. Runtime behavior is highly differentiated.
  2. You need unusual isolation, portability, or multi-cloud control.
  3. Platform feature/region constraints do not meet requirements.
Q8
Design AWS security and networking for a regulated enterprise GenAI platform.

Identity & Data

  1. Separate accounts/environments; least-privilege IAM roles and scoped agent/tool permissions.
  2. KMS encryption, Secrets Manager, tenant-level authorization, data classification and retention.
  3. Guardrails/DLP, immutable audit trail, prompt/response logging policy.

Network & Operations

  1. Private subnets and VPC endpoints/PrivateLink where supported; restrict egress.
  2. WAF/API Gateway throttling, CloudTrail, Config, Security Hub, CloudWatch/X-Ray.
  3. Threat-model model/vendor calls, tools, RAG data, agents, and human approvals.
Q9
How do 2026 Bedrock Guardrails and automated reasoning checks fit into safety?

Guardrails can be applied consistently around model interactions to filter harmful content, protect sensitive information, assess grounding, and enforce use-case policies. Automated Reasoning checks can detect logical issues and unstated assumptions against formalized policies, but return findings in detect mode; your application decides whether to serve, revise, clarify, or escalate.

Do not oversell: guardrails are one layer. Authorization, tool restrictions, output validation, monitoring, red-teaming, and human approval remain necessary.
Q10
How do you apply the AWS Well-Architected GenAI Lens?
Operational excellenceVersion prompts/models/data, automate evals, trace and debug.
SecurityIdentity, data protection, injection/tool threats, approvals.
ReliabilityQuotas, multi-AZ/regional strategy, fallback, idempotency, recovery.
Performance & costModel routing, context budget, caching, batching, capacity/utilization.

Also discuss sustainability and business impact: avoid unnecessary large-model calls, measure value, and retire low-value workloads.

GCP GenAI Stack — 2026
Vertex AI, Agent Engine, Gemini Enterprise, RAG, data, security and operations
Cloud Depth 10 questions
Q1
Design a production enterprise RAG platform on GCP.
  1. I
    Ingest: Cloud Storage events → Pub/Sub/Eventarc → Cloud Run or Dataflow workers; use Document AI where layout/OCR matters. Make jobs idempotent and dead-letter failures.
  2. R
    Retrieve: Choose Vertex AI RAG Engine for managed orchestration, Vector Search for high-scale ANN, or AlloyDB/pgvector when relational joins and transactions dominate. Store tenant and ACL metadata with every chunk.
  3. G
    Generate: Call an evaluated Vertex AI model, enforce grounded citations and structured output, then validate before serving through Cloud Run/API Gateway.
  4. O
    Operate: IAM service accounts, VPC Service Controls/Private Service Connect where applicable, CMEK, Cloud Logging/Monitoring/Trace, offline golden sets and sampled online evaluation.
Cross-question answers: Use Agent Search when managed enterprise discovery, connectors and relevance controls are the product; use custom RAG when pipeline control and application-specific retrieval matter. Propagate deletes with tombstone events across source, chunks, index and caches, then audit completion. Enforce ACLs as retrieval-time pre-filters derived from authenticated identity, never only after retrieval. A dimension/model change requires a versioned new index, dual-write or backfill, evaluation, then an alias cutover.
Q2
Vertex AI Agent Engine vs Gemini Enterprise Agent Platform vs a custom agent runtime.
ChoiceBest FitTrade-off
Agent EngineManaged deployment and operations for custom agents; sessions, memory, code execution and governancePlatform constraints and regional feature checks
Gemini Enterprise Agent PlatformEnterprise discovery, connected agents and governed employee workflowsLess freedom than a fully custom product runtime
Cloud Run/GKE customMaximum framework, portability and runtime controlYou own isolation, scaling, tracing, state and upgrades

Interview answer: start with compliance, integration, latency, portability and operating-model requirements; then choose. Do not select an agent platform merely because the workflow calls an LLM.

Q3
RAG Engine vs Vector Search vs AlloyDB/pgvector vs Agent Search — how do you choose?
ServiceChoose WhenProbe
RAG EngineYou want managed ingestion/retrieval integration with Vertex AISupported sources, regions, quotas and customization
Vector SearchLarge-scale, low-latency ANN is centralIndex update strategy, filtering and cost
AlloyDB/pgvectorVectors must live beside relational data and transactionsScale ceiling and query plan behavior
Agent SearchEnterprise search/discovery is the productConnectors, relevance controls and permissions

Cross-question: Prove the choice with corpus size, update rate, filter selectivity, recall@k, P95 latency, data residency, team skills and total cost.

Q4
Cloud Run vs Cloud Run functions vs GKE for GenAI workloads.
RuntimeUse It ForWatch-outs
Cloud RunStateless APIs, streaming gateways, workers and portable containersCold starts, request limits, concurrency and downstream quotas
Cloud Run functionsSmall event handlers and glue logicKeep complex orchestration out of tiny handlers
GKECustom serving, GPUs, sidecars, service mesh and deep scheduling controlCluster operations and utilization

Separate the latency-sensitive serving path from asynchronous ingestion/evaluation. Scale each independently and put Pub/Sub between bursty producers and workers.

Q5
Vertex AI managed endpoints and Model Garden vs custom model serving.
  • Managed endpoint: prefer when managed autoscaling, model lifecycle, monitoring and IAM integration outweigh runtime customization.
  • Cloud Run/GKE: prefer for custom inference engines, unusual dependencies, portability or tight hardware control.
  • Decision evidence: quality benchmark, tokens/sec, first-token latency, concurrency, accelerator utilization, availability, region, safety and cost per successful task.
Never answer with only a model name. In 2026, explain the evaluated capability tier and the routing/fallback policy.
Q6
Design security and networking for a regulated GCP GenAI platform.

Prevent

  1. Dedicated service accounts and least-privilege IAM.
  2. VPC Service Controls, Private Service Connect/private access and restricted egress where supported.
  3. CMEK, Secret Manager, DLP, tenant authorization and tool allowlists.

Detect & Respond

  1. Cloud Audit Logs, Security Command Center, Cloud Logging and alerting.
  2. Trace prompts, retrieval, tools and policy decisions with redaction.
  3. Human approval for high-impact actions; incident kill switch and credential rotation.

Threat model: prompt injection, data exfiltration, poisoned retrieval, excessive agency, insecure tool arguments, cross-tenant leakage and model/vendor outages.

Q7
How would you use Pub/Sub, Dataflow and BigQuery for GenAI data and evaluation pipelines?
  1. 1
    Publish immutable ingestion/evaluation events with schema version, tenant, correlation ID and idempotency key.
  2. 2
    Use Dataflow when transformations are high-volume, streaming, windowed or need replay; use Cloud Run workers for simpler task queues.
  3. 3
    Land sanitized traces and evaluation facts in BigQuery; partition by event date, cluster by tenant/model/version, and enforce retention/access policies.
  4. 4
    Monitor backlog age, dead letters, duplicate rate, processing latency, quality drift and evaluation cost.
Q8
BigQuery vs AlloyDB vs Spanner for GenAI application data.
StorePrimary RoleExample
BigQueryAnalytical warehouseTrace analytics, offline evals, cohorts and cost reporting
AlloyDBRelational operational data with PostgreSQL compatibilityConversation metadata, app transactions and vector-relational queries
SpannerGlobally consistent, horizontally scalable operational dataMulti-region agent state requiring strong consistency

State your access patterns, consistency needs, region topology, transaction boundaries, retention and expected growth before naming a database.

Q9
Describe CI/CD, observability and evaluation for a GCP GenAI service.
  • Build/deploy: Cloud Build → Artifact Registry → Cloud Deploy or controlled infrastructure pipeline; promote immutable application, prompt, model-policy and dataset versions.
  • Pre-production gates: unit/contract/security tests plus golden-set quality, latency and cost thresholds; canary before full rollout.
  • Runtime: Cloud Logging/Monitoring/Trace with correlation IDs across retrieval, model and tools. Alert on SLOs, quotas, safety, groundedness, tool failures and spend.
  • Rollback: independently roll back prompt, model route, index alias and application version.
Q10
Map an AWS GenAI architecture to GCP without pretending the services are identical.
ConcernAWS ExampleGCP Example
Managed foundation modelsAmazon BedrockVertex AI
Managed agent operationsBedrock AgentCoreVertex AI Agent Engine / Gemini Enterprise Agent Platform
Serverless containers/functionsECS/Fargate, LambdaCloud Run, Cloud Run functions
Messaging/streamingSQS/SNS/KinesisPub/Sub
Vector/searchOpenSearch, Bedrock Knowledge BasesVector Search, RAG Engine, Agent Search
Warehouse/analyticsRedshift/AthenaBigQuery

Senior answer: map capabilities, then re-evaluate IAM, networking, regional availability, quotas, reliability, operating skills and cost. A service-name translation is not an architecture migration.

Vector Databases
HNSW, ANN algorithms, Pinecone vs OpenSearch, metadata filtering
Core Topic 5 questions
Q1
How does HNSW work? Why is it the dominant ANN algorithm?

HNSW (Hierarchical Navigable Small World) builds a multi-layer proximity graph. Higher layers have few nodes with long-range "highway" connections; lower layers get progressively denser with short-range connections — mimicking how road networks work (motorways → main roads → local streets).

Search: Starts at an entry point in the top layer, greedily walks toward the query vector, then descends to lower layers for precision. Gives O(log n) average search time with high recall.

AlgorithmSpeedRecallMemoryBest for
HNSWFastestHighestHigh (graph)Low-latency production
IVF-FlatFastMediumLowLarge-scale, budget memory
IVF-PQFastMediumVery lowBillion-scale (with compression)
Flat (exact)Slow O(n)PerfectLowestEval benchmarks, <100K docs

Key HNSW parameters: ef_construction (build quality, more = slower build but better graph), M (connections per node, more = better recall but more memory), ef_search (search time recall trade-off).

Q2
How do metadata filters work in vector search? What are the performance tradeoffs?

Metadata filtering restricts ANN search to a subset of vectors matching a filter condition (e.g., doc_type == "earnings_report" AND date >= 2026-01-01).

Three approaches with different tradeoffs:

  • Pre-filtering (filter then search): Narrow to matching documents first, then ANN search over that subset. High precision but if the subset is small, HNSW degrades toward brute force.
  • Post-filtering (search then filter): Run ANN over full index, then discard non-matching results. Fast but can return fewer than k results if many are filtered out.
  • ACORN / filtered HNSW: Modern approach — metadata stored alongside vectors in the graph. Filter-aware graph traversal. Best of both worlds. Used in Pinecone, Weaviate.
Always index your most common filter fields. In OpenSearch, declare metadata fields as keyword type (not text) for exact-match filtering. Use post-filtering for high-cardinality filters, pre-filtering for low-cardinality (e.g., tenant_id).
Q3
Pinecone vs OpenSearch vs Qdrant vs pgvector — how do you choose?
DBBest forHybrid searchSelf-host?Notes
OpenSearchSearch-heavy workloads, AWS integration, BM25+vectorYesYesStrong search feature set; benchmark relevance and operations
PineconeManaged vector search with minimal operationsYes (sparse+dense)NoBenchmark cost, filters, tenancy, and region fit
QdrantVector-first workloads and payload filteringYesYesManaged and self-hosted choices; benchmark your workload
WeaviateMulti-tenancy, schema-richYes (BM25+vector)YesStrong for SaaS multi-tenant products
pgvectorPostgres-native joins, transactions, and vector searchCombine with Postgres FTSYesCan scale significantly with tuning/partitioning; benchmark recall, latency, and ops
ChromaLocal, single-node, or distributed/Cloud vector workloadsProduct-dependentYesChoose deployment mode by scale and operational needs
Decision rule: benchmark your corpus, filters, update rate, QPS, recall target, p95 latency, tenancy, joins/transactions, regions, recovery, team skill, and total cost. There is no universally fastest or cheapest vector database.
Q4
What is quantisation in vector search? How does Product Quantisation (PQ) work?

Storing a 1536-dim float32 embedding takes 6KB. For 10M documents that's 60GB. Quantisation compresses vectors to reduce memory at the cost of some recall accuracy.

Product Quantisation (PQ) splits the 1536-dim vector into M sub-vectors (e.g., M=64, each 24-dim). Each sub-vector is mapped to its nearest centroid in a codebook of 256 centroids. Result: 64 bytes instead of 6144 bytes — 96× compression. Distance computation uses a precomputed lookup table, keeping it fast.

Binary quantisation: Convert each float to a single bit (positive=1, negative=0). 32× compression. Works surprisingly well for high-dimensional embeddings with cosine similarity.

Q5
How do you handle vector index updates when documents change frequently?
  • Upsert by document ID: Every document has a stable ID. Reprocess and upsert when the document changes. All major vector DBs support upsert (Pinecone, OpenSearch, Qdrant).
  • Chunk deletion on update: When a document is updated, delete all chunks with parent_doc_id == document.id, then re-ingest. Track chunk-to-document mapping in a separate metadata store (DynamoDB/RDS).
  • Event-driven ingestion: S3 Object Lambda or DynamoDB Streams → SQS → Lambda ingestion pipeline. Near-real-time index updates without polling.
  • Versioned indexes: For zero-downtime updates on large corpora, build the new index alongside the old one, then do an atomic alias swap (OpenSearch index aliases).
HLD System Design (AI Systems)
End-to-end architecture, capacity, reliability, security, evaluation and cost
Must-ace 9 designs
Framework for every AI system design answer: Ingestion → Data model → Embedding/Retrieval → Generation → Evaluation → Monitoring → Security. Always mention latency targets, failure modes, and cost tradeoffs.
D1
Design an enterprise document Q&A for 10M documents (finance use case). Handle 500 QPS.
  1. I
    Ingestion: S3 → SQS (decouple) → Lambda fleet (parse PDF/DOCX/XLSX, hierarchical chunk: parent 1024t / child 256t, 50t overlap) → an evaluated Bedrock embedding model → OpenSearch Serverless (kNN + BM25 fields). Metadata: doc_type, entity_id, created_at, source. Row-level security by entity_id.
  2. D
    Data model: OpenSearch index with both embedding: knn_vector(dimension) and content: text (for BM25). Store the evaluated embedding dimension and model version in the index schema. Separate metadata fields as keyword type. DynamoDB tracks chunk-to-document mapping for updates/deletions.
  3. E
    Retrieval: Hybrid BM25 + kNN with evaluated fusion → an evaluated reranker when useful → evidence budget sent to the generation model. Add a version-aware semantic cache only after measuring correctness, staleness, latency, and hit rate.
  4. G
    Generation: An evaluated Bedrock reasoning/generation model. System prompt enforces: cite source for every claim, flag if answer not in context. Bedrock Guardrails for PII + topic blocking. Response streaming via Lambda Response Streaming.
  5. Ev
    Evaluation: versioned offline set plus risk-based production sampling. Calibrate retrieval, groundedness, task success, safety, latency, and cost gates against human review and business impact; do not present generic RAGAS thresholds as universal.
  6. M
    Monitoring: CloudWatch: latency P95 < 1.5s, error rate < 0.5%, cost per query. LangSmith: full trace per request. X-Ray for distributed tracing across Lambda chains.
  7. S
    Security: IAM roles per service. Bedrock Guardrails PII redaction. Entity-level index partitioning (user A cannot retrieve user B's documents). All queries audit-logged to S3.
Tradeoffs to mention: Re-ranking adds ~100ms but improves recall significantly for finance queries — worth it. Semantic cache saves cost but can serve slightly stale answers — use 1hr TTL for financial data, 24hr for regulatory docs.
D2
Design a multi-agent customer support bot with escalation to human agents.
  1. 1
    Architecture — Supervisor pattern in LangGraph: IntentClassifier node (routes query type) → either KnowledgeAgent (RAG over support docs), ActionAgent (CRM/order system APIs), or EscalationAgent (human handoff).
  2. 2
    Intent classification: An evaluated low-latency model tier classifies intent: information_request | account_action | complaint | escalation. Select it against an intent test set and explicit latency/cost SLO rather than relying on a model-name assumption.
  3. 3
    Escalation triggers: (a) User explicitly asks for human, (b) agent confidence < threshold for 2 turns, (c) complaint category detected, (d) max turns (10) reached. LangGraph uses interrupt_before at escalation node.
  4. 4
    Context handoff: Full conversation summary + entity extraction (customer ID, issue category, sentiment score) passed to human agent dashboard via Amazon Connect + DynamoDB.
  5. 5
    Memory: Short-term in LangGraph state. Long-term in DynamoDB keyed by customer_id (preferences, past issues) — injected as context on next session.
D3
Design an LLM evaluation pipeline at scale (continuous eval for production).
  1. 1
    Data collection: Risk-based, privacy-reviewed production sampling to an asynchronous evaluation queue. Oversample rare, high-impact, low-confidence, and newly changed slices.
  2. 2
    Judge layer: Use deterministic checks where possible and one or more evaluated judge models for rubric-based criteria. Calibrate judges against blinded human labels, monitor agreement/bias, and version judge prompts/models.
  3. 3
    Aggregation: Daily Lambda aggregates scores → writes to RDS → visualised in Grafana dashboard.
  4. 4
    CI gate: Every code, prompt, model, retriever, or policy change triggers a representative regression suite. Fail or require review when calibrated quality/safety tolerances or SLOs are breached.
  5. 5
    Feedback loop: Low-scoring samples surfaced to human reviewers → added to golden test set → improves future evals.
D4
Design a multi-region, multi-provider GenAI gateway for 2,000 QPS.
RequirementsP95 first token < 700ms, 99.95% availability, tenant quotas, residency, streaming, provider failover.
Core pathGlobal edge → auth/quota → policy router → provider adapters → safety/output validation → stream response.
StateTenant policy/config strongly consistent; request state regional and ephemeral; usage events asynchronously aggregated.
ReliabilityTimeout budgets, retry only safe failures, circuit breakers, regional/provider health, degraded fallback.
  1. C
    Capacity: derive concurrent streams from QPS × average duration; size connection pools and rate limits against provider quotas, not only CPU.
  2. S
    Security: per-tenant keys/policies, secret isolation, egress allowlists, audit trail and prompt/response retention controls.
  3. X
    Cross-question answers: prevent duplicate billable calls with an invocation ledger, idempotency key and explicit handling of ambiguous timeouts. Route fallback only to providers that satisfy the required schema/tool capability contract, then validate output. Test residency with policy-as-code, regional routing tests, blocked cross-region egress, and audit-log evidence.
D5
Design an enterprise agent platform with tools, memory and human approval.
  1. 1
    Control plane: agent definitions, prompt/model/tool versions, policy, evaluations, rollout and audit.
  2. 2
    Data plane: isolated runtime executes a bounded state machine; tool gateway authenticates, authorizes, validates arguments and records effects.
  3. 3
    Memory: separate immutable event history, short-term working state and curated long-term memory with consent, expiry and deletion.
  4. 4
    Safety: risk-tier tools; read-only may auto-run, money/data mutations require deterministic checks and human approval. Never trust model text as authorization.
  5. 5
    Operate: trace every reasoning-independent event, tool call, policy decision and approval; cap steps/tokens/time and provide kill switches.
Cross-question answers: resume from a durable checkpoint after the last committed state transition and reacquire a lease. Prevent replayed tool effects with invocation IDs, idempotency keys and stored effect receipts. Evaluate nondeterministic agents across repeated runs using terminal-task success, policy violations, tool correctness, path efficiency, latency and cost rather than requiring one exact trajectory.
D6
Design a multimodal document-intelligence platform for invoices and contracts.

Pipeline

  1. Upload → malware scan → immutable object store.
  2. OCR/layout/table extraction → normalized document graph.
  3. Rules + model extraction → schema validation → confidence routing.
  4. Human review for low-confidence/high-value fields.

Production Concerns

  1. Idempotent jobs, page-level retries and lineage to source coordinates.
  2. Per-field precision/recall, review rate and business-value metrics.
  3. PII isolation, retention, tenant keys and deletion propagation.
  4. Versioned extractors and replayable raw documents.

API/data model: POST /documents, GET /jobs/{id}, webhook completion; Document, Page, Element, Extraction, Evidence, ReviewDecision and ModelVersion.

D7
Design a real-time voice assistant with interruption and tool calling.
  1. P
    Path: WebRTC/media gateway → streaming ASR → dialogue/agent runtime → tools/RAG → streaming TTS. Keep regional session state and a durable event summary.
  2. L
    Latency: budget each stage; stream partial transcripts and speech; use VAD and speculative work carefully. Measure time-to-first-audio and interruption-stop latency.
  3. B
    Barge-in: cancel TTS and downstream work, advance the turn epoch, and reject late results from the old epoch.
  4. R
    Reliability: reconnect token, session checkpoint, provider fallback, bounded silence/timeouts and graceful transfer to human.

Cross-question: How do you prevent a partially heard confirmation from triggering a payment? Require explicit deterministic confirmation before high-impact tools.

D8
Design an intelligent model-routing and semantic-cache platform.
  • Router inputs: task type, complexity, modality, context size, tenant policy, region, safety tier, live health, measured quality, latency and cost.
  • Policy: deterministic hard constraints first; learned or rules-based ranking second; fallback must preserve schema/tool capability and safety.
  • Cache: key includes normalized intent plus tenant, policy, prompt, model family, knowledge/index version and safety context. Never share across authorization boundaries.
  • Evaluate: counterfactual shadow traffic, quality-regret, cache precision, hit rate, cost per accepted task and tail latency.
D9
Design a secure enterprise MCP/tool platform.
RegistryVersioned tool schemas, owners, risk tiers, scopes, regions and deprecation.
GatewayAuthenticate caller/agent, authorize each invocation, validate schema and policy, rate-limit and audit.
ExecutionSandbox/isolated connector, short-lived credentials, egress restrictions, idempotency and effect receipts.
GovernanceApproval for mutations, supply-chain checks, kill switch, usage/effect monitoring and incident replay.

Key principle: the model proposes a tool call; deterministic systems decide whether and how it executes. Treat tool output as untrusted input before returning it to an agent.

LLD & Machine Coding for GenAI
Interfaces, data models, state machines, reliability patterns and testable code
Implementation8 drills
LLD answer order: clarify use cases and invariants → define interfaces/entities → show the critical sequence/state transition → handle concurrency/failures → make it testable and extensible. Patterns are vocabulary, not the goal.
L1
Design a provider-agnostic LLM gateway SDK.
Python · Strategy + Adapter + Factory
class ModelProvider(Protocol):
    async def generate(self, req: GenerationRequest) -> GenerationResult: ...
    async def stream(self, req: GenerationRequest) -> AsyncIterator[TokenEvent]: ...

class Gateway:
    def __init__(self, router, providers, policies, telemetry): ...
    async def generate(self, req):
        checked = self.policies.validate(req)
        route = self.router.choose(checked)
        return await self.providers[route.provider].generate(route.request)

Use canonical request/result types and provider adapters. Keep routing, retry, safety, telemetry and provider calls separate. Preserve provider-specific capabilities through an explicit capability contract rather than leaking arbitrary dictionaries everywhere.

Cross-question answers: Record each invocation ID and outcome so an ambiguous timeout is reconciled before retry. Surface streaming failures as typed terminal events while preserving already emitted tokens and trace IDs. Test adapters with provider contract tests, deterministic fake providers, recorded fixtures and a small gated live integration suite.

L2
Design an idempotent RAG ingestion pipeline.
Core entities and state machine
Document(id, tenant_id, source_uri, content_hash, version)
IngestionJob(id, document_id, version, state, attempt, lease_until)
Chunk(id, document_id, version, ordinal, text_hash, metadata)

RECEIVED -> PARSED -> CHUNKED -> EMBEDDED -> INDEXED -> ACTIVE
                         \-> FAILED_RETRYABLE | FAILED_FINAL
  • Use (tenant_id, source_uri, content_hash) or an explicit idempotency key to deduplicate.
  • Workers claim jobs with leases; every transition uses compare-and-set. Retries resume from durable artifacts.
  • Build a new document version, verify counts/quality, atomically switch active version, then garbage-collect old chunks.
L3
Design a safe agent tool registry and executor.
Contracts
ToolSpec(name, version, input_schema, required_scopes, risk_tier, timeout)
Invocation(id, actor, agent, tool, args_hash, status, effect_receipt)

execute(invocation):
  authenticate -> authorize -> schema_validate -> policy_check
  -> approve_if_required -> run_isolated -> validate_output -> audit

Use Registry/Repository for discovery, Command for invocations, Policy object for authorization, and an idempotency store for mutations. The executor receives short-lived scoped credentials; the LLM never receives a reusable secret.

L4
Design conversation, session and memory data models.
EntityPurposeImportant Fields
ConversationUser-facing durable containertenant, participants, policy, created_at
Turn/EventAppend-only source of truthsequence, role/type, content_ref, trace_id, timestamp
SessionRuntime/checkpoint stateepoch, state, lease, expires_at
MemoryCurated reusable factsubject, fact, evidence, consent, confidence, expiry

Use optimistic concurrency on sequence/epoch. Summaries are derived views, never the sole source of truth. Support deletion and retention across events, embeddings, caches and analytics copies.

L5
Implement rate limiting, retry and circuit breaking for model calls.
  • Rate limit: hierarchical token buckets for tenant, model/provider and global capacity; account for requests and estimated tokens.
  • Retry: only retry transient errors within the request deadline; exponential backoff with jitter; honor provider retry hints.
  • Circuit breaker: CLOSED → OPEN after threshold → HALF_OPEN probes → CLOSED on recovery. Scope by provider/region/model capability.
  • Correctness: attach invocation IDs and record outcomes so a timeout does not automatically become a duplicate billable/action call.
L6
Design an extensible LLM evaluation framework.
Evaluator contracts
class Evaluator(Protocol):
    name: str
    async def score(self, case: EvalCase, output: Output) -> Score: ...

EvalCase(id, input, expected, rubric, tags, dataset_version)
Run(id, system_version, dataset_version, seed, status)
Score(metric, value, rationale, evaluator_version, cost)

Support deterministic checks, retrieval metrics, human rubrics and model judges behind one interface. Run asynchronously with bounded concurrency; persist every version and raw artifact. Compare paired cases and confidence intervals, not only averages.

L7
Design a secure semantic cache.

Two-stage lookup: exact canonical key first; semantic ANN lookup second. A candidate is reusable only if similarity passes an evaluated threshold and all hard dimensions match: tenant/ACL, locale, policy, prompt version, model capability, knowledge/index version and tool-state class.

  • Store answer, evidence, creation/expiry, safety decision and provenance; encrypt and enforce tenant boundaries.
  • Invalidate by version bump or event; use short TTL for volatile facts and never cache sensitive/high-impact actions by default.
  • Measure cache precision with sampled replay, not just hit rate. A false hit can be more expensive than a miss.
L8
Machine coding: build an async streaming chat service.
Expected API behavior
POST /v1/conversations/{id}/messages
Headers: Idempotency-Key, Last-Event-ID
Response: text/event-stream
events: accepted, retrieval, token, citation, completed | failed

cancel_event set -> stop provider stream -> persist terminal status
client disconnect -> bounded cleanup; do not leak provider connections

What interviewer checks: async iteration and backpressure, cancellation, timeout budgets, idempotency, ordered event persistence, reconnect/resume, dependency injection, validation and tests for disconnect/error races.

Tests: duplicate request, provider fails before/after first token, slow client, cancellation race, reconnect from event ID, authorization failure and no leaked tasks.
DSA — Arrays & Hashing
Most common pattern in GenAI company interviews (Shopify, Databricks, Cohere)
DSA 5 questions
Q1
Two Sum Easy

Given an integer array and a target, return indices of the two numbers that add up to target. You may not use the same element twice.

Key insight: For each element n, we need target - n. Store each value's index in a hash map. On each step, check if the complement is already in the map.

Python · O(n) time, O(n) space
def two_sum(nums: list[int], target: int) -> list[int]:
    seen = {}  # value → index
    for i, n in enumerate(nums):
        diff = target - n
        if diff in seen:
            return [seen[diff], i]
        seen[n] = i
    return []

# Example: two_sum([2,7,11,15], 9) → [0, 1]
Hash MapO(n) timeO(n) space
Q2
Longest Consecutive Sequence Medium

Find the length of the longest sequence of consecutive integers. Must run in O(n) time.

Key insight: For each number, only start counting if it's the beginning of a sequence (n-1 is not in the set). This ensures each number is visited at most twice — O(n) overall.

Python · O(n) time, O(n) space
def longest_consecutive(nums: list[int]) -> int:
    num_set = set(nums)
    best = 0
    for n in num_set:
        if n - 1 not in num_set:   # sequence start
            cur, length = n, 1
            while cur + 1 in num_set:
                cur += 1; length += 1
            best = max(best, length)
    return best

# [100,4,200,1,3,2] → 4  (sequence 1,2,3,4)
SetO(n)
Q3
Group Anagrams Medium

Group strings that are anagrams of each other into sublists.

Python · O(n·k·log k) — sort as key
from collections import defaultdict

def group_anagrams(strs: list[str]) -> list[list[str]]:
    groups = defaultdict(list)
    for s in strs:
        key = tuple(sorted(s))    # e.g. 'eat' → ('a','e','t')
        groups[key].append(s)
    return list(groups.values())
O(n·k) optimisation: Use a 26-length character count tuple as key instead of sorting. Avoids O(k log k) sort: key = tuple(Counter(s).get(c, 0) for c in 'abcdefghijklmnopqrstuvwxyz')
Hash MapSorting
Q4
Product of Array Except Self Medium — no division

For each index i, compute the product of all elements except nums[i]. O(n) time, no division operator.

Insight: result[i] = prefix_product[i-1] * suffix_product[i+1]. Do two passes: left prefix in one array, multiply by right suffix on the fly.

Python · O(n) time, O(1) extra space
def product_except_self(nums: list[int]) -> list[int]:
    n = len(nums)
    res = [1] * n
    # Left pass: res[i] = product of nums[0..i-1]
    prefix = 1
    for i in range(n):
        res[i] = prefix
        prefix *= nums[i]
    # Right pass: multiply by product of nums[i+1..n-1]
    suffix = 1
    for i in range(n - 1, -1, -1):
        res[i] *= suffix
        suffix *= nums[i]
    return res
Prefix/SuffixO(1) extra space
Q5
Top K Frequent Elements Medium

Return the k most frequent elements. O(n) solution using bucket sort.

Python · O(n) bucket sort
from collections import Counter

def top_k_frequent(nums: list[int], k: int) -> list[int]:
    freq = Counter(nums)
    # Bucket index = frequency value
    buckets = [[] for _ in range(len(nums) + 1)]
    for val, cnt in freq.items():
        buckets[cnt].append(val)

    res = []
    for i in range(len(buckets) - 1, -1, -1):
        for val in buckets[i]:
            res.append(val)
            if len(res) == k: return res
CounterBucket SortO(n)
DSA — Sliding Window & Two Pointers
Essential for string/array problems common in practical coding rounds
DSA 3 questions
Q1
Longest Substring Without Repeating Characters Medium
Python · O(n) sliding window
def length_of_longest_substring(s: str) -> int:
    seen = {}  # char → last index
    left = best = 0
    for right, ch in enumerate(s):
        if ch in seen and seen[ch] >= left:
            left = seen[ch] + 1  # shrink from left
        seen[ch] = right
        best = max(best, right - left + 1)
    return best

# "abcabcbb" → 3 ("abc")
Sliding WindowTwo PointersHash Map
Q2
Minimum Window Substring Hard

Find the minimum window in string s that contains all characters of string t.

Python · O(n+m) sliding window
from collections import Counter

def min_window(s: str, t: str) -> str:
    need = Counter(t)
    window = {}
    have, total = 0, len(need)
    best = (float("inf"), 0, 0)
    left = 0
    for right, ch in enumerate(s):
        window[ch] = window.get(ch, 0) + 1
        if ch in need and window[ch] == need[ch]:
            have += 1
        while have == total:
            if (right - left + 1) < best[0]:
                best = (right - left + 1, left, right)
            window[s[left]] -= 1
            if s[left] in need and window[s[left]] < need[s[left]]:
                have -= 1
            left += 1
    l, r = best[1], best[2]
    return s[l:r+1] if best[0] != float("inf") else ""
Q3
Container With Most Water Medium
Python · O(n) two pointers
def max_area(height: list[int]) -> int:
    left, right = 0, len(height) - 1
    best = 0
    while left < right:
        area = (right - left) * min(height[left], height[right])
        best = max(best, area)
        # Move the shorter wall inward
        if height[left] < height[right]: left += 1
        else: right -= 1
    return best
DSA — Trees & Graphs
BFS, DFS, binary trees — common in Shopify and Databricks rounds
DSA 4 questions
Q1
Binary Tree Level Order Traversal (BFS) Medium
Python · O(n) BFS with queue
from collections import deque

def level_order(root) -> list[list[int]]:
    if not root: return []
    result, q = [], deque([root])
    while q:
        level = []
        for _ in range(len(q)):   # snapshot size = current level
            node = q.popleft()
            level.append(node.val)
            if node.left:  q.append(node.left)
            if node.right: q.append(node.right)
        result.append(level)
    return result
BFSQueue (deque)O(n)
Q2
Number of Islands (DFS on grid) Medium
Python · O(m×n) DFS
def num_islands(grid: list[list[str]]) -> int:
    rows, cols = len(grid), len(grid[0])
    count = 0

    def dfs(r, c):
        if r < 0 or c < 0 or r >= rows or c >= cols \
           or grid[r][c] != "1": return
        grid[r][c] = "0"   # mark visited in-place
        for dr, dc in [(1,0),(-1,0),(0,1),(0,-1)]:
            dfs(r+dr, c+dc)

    for r in range(rows):
        for c in range(cols):
            if grid[r][c] == "1":
                dfs(r, c); count += 1
    return count
DFSGrid / Flood FillO(m×n)
Q3
Course Schedule (Topological Sort / Cycle Detection) Medium

Given n courses and prerequisites, determine if you can finish all courses. Equivalent to detecting a cycle in a directed graph.

Python · O(V+E) DFS cycle detection
def can_finish(n: int, prereqs: list[list[int]]) -> bool:
    adj = [[] for _ in range(n)]
    for a, b in prereqs: adj[b].append(a)

    # 0=unvisited, 1=in-stack (cycle), 2=done
    state = [0] * n

    def dfs(node):
        if state[node] == 1: return False  # cycle
        if state[node] == 2: return True   # already processed
        state[node] = 1
        for nei in adj[node]:
            if not dfs(nei): return False
        state[node] = 2
        return True

    return all(dfs(i) for i in range(n) if state[i] == 0)
Topological SortCycle DetectionDFS
Q4
Lowest Common Ancestor of a Binary Tree Medium
Python · O(n) recursive DFS
def lca(root, p, q):
    if not root or root == p or root == q:
        return root
    left  = lca(root.left,  p, q)
    right = lca(root.right, p, q)
    # If found in both subtrees → current node is LCA
    if left and right: return root
    return left or right
DFSPost-orderO(n)
DSA — Dynamic Programming
1D DP, 2D DP, memoisation — top patterns for practical AI interviews
DSA 4 questions
Q1
Climbing Stairs / Fibonacci — 1D DP Easy
Python · O(n) time, O(1) space
def climb_stairs(n: int) -> int:
    if n <= 2: return n
    a, b = 1, 2
    for _ in range(3, n + 1):
        a, b = b, a + b  # rolling Fibonacci
    return b

Pattern recognition: Whenever f(n) = f(n-1) + f(n-2) or similar recurrence appears, recognise it as Fibonacci DP. Roll two variables instead of an array for O(1) space.

Q2
Longest Common Subsequence — 2D DP Medium

Directly used in diff algorithms, tokenisation comparison, and DNA sequence alignment.

Python · O(m×n) time and space
def lcs(text1: str, text2: str) -> int:
    m, n = len(text1), len(text2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if text1[i-1] == text2[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])
    return dp[m][n]
Q3
0/1 Knapsack — tabulation Medium
Python · O(n×W) time, O(W) space
def knapsack(weights: list, values: list, W: int) -> int:
    n = len(weights)
    dp = [0] * (W + 1)
    for i in range(n):
        for w in range(W, weights[i] - 1, -1):  # reverse!
            dp[w] = max(dp[w], dp[w - weights[i]] + values[i])
    return dp[W]

Key: Iterate weights in reverse when using 1D DP (ensures each item is used at most once). Forward iteration = unbounded knapsack (items can repeat).

Q4
Word Break — memoised DP Medium

Determine if string s can be segmented using words from a dictionary. Directly related to LLM tokenisation problems.

Python · O(n²) DP
def word_break(s: str, words: list[str]) -> bool:
    word_set = set(words)
    dp = [False] * (len(s) + 1)
    dp[0] = True  # empty string is breakable
    for i in range(1, len(s) + 1):
        for j in range(i):
            if dp[j] and s[j:i] in word_set:
                dp[i] = True; break
    return dp[len(s)]
Python Patterns for GenAI
Async, FastAPI, decorators, generators — practical coding round essentials
Practical 4 questions
Q1
Write a production FastAPI endpoint for async LLM streaming.
Python · FastAPI + Anthropic async streaming
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import anthropic, asyncio

app = FastAPI()
client = anthropic.AsyncAnthropic()

class ChatRequest(BaseModel):
    query: str
    session_id: str = "default"

@app.post("/chat")
async def chat_stream(req: ChatRequest):
    async def generate():
        try:
            async with client.messages.stream(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                system="You are a helpful GenAI assistant.",
                messages=[{"role": "user", "content": req.query}]
            ) as stream:
                async for text in stream.text_stream():
                    yield f"data: {text}\n\n"  # SSE format
        except Exception as e:
            yield f"data: [ERROR] {str(e)}\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

# Batch concurrent calls
async def batch_process(queries: list[str]) -> list[str]:
    tasks = [single_call(q) for q in queries]
    return await asyncio.gather(*tasks, return_exceptions=True)
async/awaitSSE streamingasyncio.gather
Q2
Write a retry decorator with exponential backoff and jitter for LLM API calls.
Python · async-compatible retry with backoff
import asyncio, functools, random, logging
from typing import Type

def retry(
    max_tries: int = 3,
    base_delay: float = 1.0,
    exceptions: tuple[Type[Exception], ...] = (Exception,)
):
    def decorator(func):
        @functools.wraps(func)
        async def async_wrapper(*args, **kwargs):
            for attempt in range(max_tries):
                try:
                    return await func(*args, **kwargs)
                except exceptions as e:
                    if attempt == max_tries - 1: raise
                    delay = base_delay * (2 ** attempt) + random.uniform(0, 0.5)
                    logging.warning(f"Attempt {attempt+1} failed: {e}. Retry in {delay:.1f}s")
                    await asyncio.sleep(delay)
        return async_wrapper
    return decorator

# Usage
@retry(max_tries=3, base_delay=1.0)
async def call_llm(prompt: str) -> str: ...
Q3
Implement a simple semantic cache using Redis and cosine similarity.
Python · semantic cache pattern
import numpy as np, json, redis
from openai import OpenAI

r = redis.Redis(host="localhost", port=6379)
oai = OpenAI()
THRESHOLD = 0.92

def embed(text: str) -> np.ndarray:
    resp = oai.embeddings.create(model="text-embedding-3-small", input=text)
    return np.array(resp.data[0].embedding)

def cosine_sim(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def cached_llm(query: str, llm_fn) -> str:
    q_emb = embed(query)
    # Scan cache for similar queries
    for key in r.scan_iter("cache:*"):
        entry = json.loads(r.get(key))
        if cosine_sim(q_emb, entry["emb"]) >= THRESHOLD:
            return entry["response"]   # cache hit
    # Cache miss — call LLM
    response = llm_fn(query)
    r.setex(f"cache:{hash(query)}", 3600,
             json.dumps({"emb": q_emb.tolist(), "response": response}))
    return response
Q4
Python generators vs async generators — when do you use each in GenAI systems?
PatternWhen to useExample in GenAI
Sync generator yieldCPU-bound iteration, large dataset processingStreaming chunked document ingestion from S3
Async generator async yieldAwaiting I/O in each iterationStreaming LLM tokens to FastAPI response
asyncio.QueueMultiple producers/consumersAgent observation queue (multiple tools running)
Python · async generator for token streaming
async def token_stream(prompt: str):
    """Async generator: yields tokens as they arrive"""
    async with client.messages.stream(
        model="claude-sonnet-4-6", max_tokens=512,
        messages=[{"role": "user", "content": prompt}]
    ) as stream:
        async for token in stream.text_stream():
            yield token  # caller sees each token immediately

# Consumer
async def main():
    async for token in token_stream("Explain RAG"):
        print(token, end="", flush=True)
Behavioural — STAR Stories
Pre-built answers from your real experience at EPAM, PwC, Cognizant, TCS
Always Asked 5 questions
STAR framework: Situation (set context briefly) → Task (your specific responsibility) → Action (what YOU did, use "I" not "we") → Result (quantified outcome). Always end with what you learned or how you'd scale it.
B1
Tell me about a time you significantly improved system performance.
Situation

At PwC, a US healthcare client's multi-agent clinical query system had P95 latency of 3.8 seconds — too slow for real-time clinical use. The system was seeing 200+ QPS during peak hospital hours.

Task

I owned the end-to-end latency reduction initiative — profiling, identifying bottlenecks, proposing and implementing solutions within a 3-week sprint.

Action
  1. 1
    Profiled with LangSmith traces — found 60% of latency was redundant sequential sub-agent calls that could run in parallel.
  2. 2
    Refactored to async fan-out using asyncio.gather — all sub-agents dispatch concurrently, only the merge step waits.
  3. 3
    Added Redis semantic cache (GPTCache library) with cosine threshold 0.92 — served ~40% of traffic from cache at <10ms.
  4. 4
    Routed intent classification to an evaluated low-latency model tier instead of the stronger reasoning tier — 3× cheaper and 5× faster for that subtask.
Result

70% latency reduction — P95 from 3.8s → 1.1s. The system went live and the client reported a 60% efficiency uplift for clinical staff. I received PwC's Delivery Head Award for this sprint.

B2
Describe a time you owned an ambiguous problem end-to-end with no clear spec.
Situation

At Cognizant's GenAI Lab, I was asked simply to "explore if LLMs can improve our client's enterprise search." No timeline, no team, no defined success metric.

Task

Define the problem, build a prototype, benchmark it, and present a path to production — all independently.

Action
  1. 1
    Interviewed 3 internal stakeholders to learn the actual pain: keyword search missed semantic intent ("show me Q3 revenue analysis" found nothing).
  2. 2
    Ran a 2-week spike: built a Pinecone-backed semantic search with fine-tuned sentence-transformers embeddings. Collected 200 test queries with relevance labels.
  3. 3
    Benchmarked Recall@5 vs existing keyword search: 35% improvement. Presented the spike results with a 3-phase production roadmap.
Result

Prototype became the foundation for a production semantic search shipped to the client — 35% accuracy improvement. Received Cognizant's Rising Star Award.

B3
Tell me about a technical decision you made that you'd change in hindsight.
Situation

Early in a RAG project at PwC, I used fixed-size 512-token chunks with no overlap because it was the fastest to implement. The system went to QA.

What Happened & What I'd Change

QA caught that multi-page financial tables were being split across chunks — the retriever would get half a table, giving wrong answers. I spent 2 days re-chunking with a hierarchical splitter and adding 50-token overlap. If I'd spent 4 hours upfront analysing the document corpus structure, I would have picked the right chunking strategy immediately and avoided the rework.

What I do now: Always spend the first day profiling the document corpus (distribution of doc types, average length, presence of tables/headers) before writing a single line of chunking code.

Why This Answer Works

Shows self-awareness without being self-deprecating. The mistake was reasonable and the lesson is concrete and technical — not vague.

Related technical follow-up
B4
How do you work effectively in a fully remote, distributed team across time zones?
Situation

At EPAM, my team spans India (IST), Poland (CEST, +3.5hrs from IST), and the US East coast (EST, −10.5hrs from IST). We have critical weekly demos with a US Finance client.

My System
  1. 1
    Async-first documentation: Every design decision gets a Confluence ADR (Architecture Decision Record) written the same day. PRs have detailed descriptions so teammates in other zones can review without a live sync call.
  2. 2
    Slack status tagging: I tag every update with [DONE], [BLOCKED: needs X], or [DECISION NEEDED: by EOD]. This tells teammates exactly what action, if any, they need to take.
  3. 3
    Flexible overlap windows: 2 days/week I flex my start time to 6:30 AM IST to get 2 hours overlap with US EST morning for the client demo prep.
  4. 4
    Loom walkthroughs: For complex architecture changes, I record a 5-min Loom instead of a document. Faster to create, easier to consume across time zones.
Result

Zero missed client demo milestones in 10 months. The Delivery Head Award (2025) cited "consistent, reliable cross-timezone delivery" as a specific reason.

B5
Tell me about yourself. (90-second pitch — memorise this)
Your 90-Second Pitch

"I'm Purnendu Das, a Senior GenAI Developer with 5 years building production AI systems — not demos, but real systems that clients use every day.

My core expertise is three things: first, RAG architectures that work at scale — hybrid retrieval, re-ranking, evaluation pipelines. Second, multi-agent systems using LangGraph for complex, stateful workflows. Third, LLMOps on AWS — deploying with Bedrock, Lambda, and OpenSearch, with proper monitoring and cost control.

At PwC I built a multi-agent platform for a healthcare client that reduced response latency by 70% and delivered a 60% efficiency improvement for clinical staff. At EPAM right now I'm architecting a Hybrid RAG system for a Tier-1 US Finance client, and we've achieved sub-second inference latency. Before that at Cognizant I founded and ran our GenAI research lab, shipping a semantic search product with a 35% accuracy improvement.

I work fully remote and have delivered consistently across India, Poland, and US time zones. I'm looking for a senior role where I can keep solving enterprise-scale AI problems with a high-quality engineering team."

Delivery tip: Practice this out loud until it flows naturally at conversational speed. The structure is: who I am → three core skills → three concrete proof points → what I'm looking for. Never read from notes — interviewers notice.