GenAI Interview Saviour

2026 GenAI Interview Radar

Current platform vocabulary and durable reasoning, checked against official sources in June 2026

June 14, 202610 current-topic packs

2026 rule: model names, prices, quotas, and regional availability change quickly. In interviews, state the evaluated capability tier and deployment constraints first, then name a currently available model/service after confirming the role’s cloud and region.

26.1

What changed in senior GenAI interviews by 2026?

Expected Now

Agents as production systems: runtime, identity, memory, tools, observability, evals, and recovery.
Context engineering beyond prompt wording.
Multimodal RAG over layout, tables, charts, images, and provenance.
Continuous evaluation and production incident diagnosis.
Security against injection, excessive agency, leakage, and unbounded consumption.

What No Longer Impresses Alone

Framework-name memorization without trade-offs.
A demo chatbot with no evaluation or failure handling.
Claiming a model is “best” without a workload benchmark.
Using an agent where a workflow is sufficient.
Hard-coded old model names and prices.

26.2

AWS 2026 radar: which GenAI services and design choices matter?

Amazon Bedrock: managed foundation-model access, Knowledge Bases, Agents, Guardrails, model evaluation/customization, and multimodal RAG capabilities.
Amazon Bedrock AgentCore: production agent platform concerns such as secure runtime, enterprise tool/data connectivity, identity, tracing, debugging, evaluation, and AgentCore Policy for agent-to-tool controls.
Inference optimization: evaluate prompt caching and intelligent prompt routing where supported; confirm current model, region, and cost behavior.
Inference choice: Bedrock for managed models; SageMaker AI endpoints/HyperPod or EC2/EKS/ECS when deeper model/runtime control is required.
Architecture: apply the AWS Generative AI Lens and security reference architecture, not only service wiring.

Amazon Bedrock AgentCore AgentCore Policy AWS GenAI Lens

26.3

GCP 2026 radar: which GenAI services and design choices matter?

Gemini Enterprise Agent Platform / Vertex AI: build, deploy, govern, and optimize enterprise agents and GenAI workloads.
Vertex AI Agent Engine: managed agent runtime with sessions, Memory Bank, code execution, observability, private networking, CMEK, resource/concurrency controls, and enterprise compliance support.
Google Agent Development Kit: an open-source, model-agnostic framework with current SDKs across Python, TypeScript, Go, Java, and Kotlin; evaluate it alongside A2A for interoperable agents.
RAG choices: RAG Engine, Vector Search, Agent Search, Feature Store vector retrieval, or custom AlloyDB/Cloud SQL patterns depending on scale and control.
Serving: Vertex AI managed endpoints/Model Garden or Cloud Run/GKE for application and custom-runtime control.

Gemini Enterprise Agent Platform Google ADK GCP RAG architectures

26.4

How should I discuss models in 2026 without becoming outdated?

CapabilityReasoning, coding, multimodal, tool use, structured output, context, language.

OperationsTTFT, throughput, availability, rate limits, regional support, observability.

GovernanceData use, retention, residency, safety, audit, vendor risk.

EconomicsInput/output tokens, cache/batch discounts, serving utilization, operational cost.

Strong answer: “I maintain a task-specific evaluation and route to the smallest model tier that passes quality, safety, latency, and cost gates. I avoid choosing from benchmark headlines alone.”

26.5

Managed agent platform vs custom LangGraph/SDK runtime in 2026.

Dimension	Managed Platform	Custom Runtime
Time to production	Faster; runtime/identity/memory/observability integrated	Slower; assemble and operate components
Control	Platform constraints and regional feature availability	Full orchestration/runtime/model control
Portability	Potential cloud/platform coupling	Potentially portable with more ownership
Best fit	Enterprise team prioritizing managed governance and speed	Differentiated orchestration, unusual runtime, or multi-cloud need

26.6

2026 freshness checklist before every interview.

Confirm role cloud, region, and permitted model providers.Check current official model/service availability and quotas.Know one current AWS and one current GCP agent/RAG architecture.Prepare durable trade-offs independent of product names.Review current security, evaluation, and governance expectations.Avoid quoting unverified prices, latency, or benchmark rankings.

26.7

What is the current API and model-selection landscape as of June 14, 2026?

Interview-safe answer: use each provider's current unified API and capability discovery, but select models with an evaluation suite rather than a memorized leaderboard. For OpenAI, the Responses API is the primary agentic build path and works with tools, structured outputs, and the Agents SDK; its latest-model guide currently recommends GPT-5.5 for complex reasoning and coding. On AWS and GCP, confirm the model, feature, region, quota, and data-governance combination before committing to an architecture.

Quality gateTask success, groundedness, tool correctness, safety, and schema validity.

Operational gateTTFT, throughput, rate limits, context behavior, availability, and observability.

Governance gateResidency, retention, data use, IAM, audit, and provider risk.

Cost gateMeasure end-to-end cost per successful task, including retries, tools, retrieval, and judges.

OpenAI latest-model guide OpenAI models Responses API guide

26.8

What are the latest MCP and A2A interoperability details?

Exact-date answer: the current stable MCP specification is 2025-11-25. A 2026-07-28 release candidate was announced on May 21, 2026, but its final release is scheduled for July 28, 2026, so do not call it the current stable spec yet. The RC adds a stateless core, Extensions, MCP Apps, Tasks, and authorization hardening.

Interface	Purpose	Production concern
Function/tool calling	A model emits a typed request to an application-owned tool	Schema validation, authorization, idempotency
MCP	An application connects models/agents to reusable tools, resources, and prompts	Server trust, OAuth/authz, data boundaries, approvals
A2A	Agents discover and communicate with other agents across systems	Identity, capability trust, task lifecycle, inter-agent security

MCP stable spec MCP 2026 RC announcement A2A guide

26.9

What is the 2026 security baseline for LLM and agentic systems?

Threat-model against the OWASP Top 10 for LLM Applications 2025 and the OWASP Top 10 for Agentic Applications 2026. The agentic list expands the focus to goal hijacking, tool misuse, identity and privilege abuse, memory poisoning, insecure inter-agent communication, cascading failures, and rogue agents.

1
Assume model output is untrusted: never use it as authorization; validate every argument and output.
2
Bound agency: least-privilege identities, egress allowlists, step/time/token budgets, idempotency, and human approval for high-impact actions.
3
Protect state: isolate tenants, provenance-tag memory, expire/delete it, and prevent untrusted content from becoming durable instructions.
4
Verify continuously: adversarial evals, red-team tests, audit trails, kill switches, and incident exercises.

OWASP Agentic 2026 OWASP LLM 2025 NIST AI RMF

26.10

Which facts are durable, and which must I re-check before an interview?

Freshness class	Examples	Revision rule
Durable foundations	Attention, embeddings, RAG failure modes, distributed-systems trade-offs, least privilege	Learn deeply; explain from first principles
Review monthly	Agent frameworks, cloud capabilities, protocol specs, security guidance	Check official docs and release notes
Verify before interview	Model names, prices, quotas, context limits, regions, benchmark claims	Use the provider's current documentation
Verify with recruiter	Remote eligibility, interview steps, role scope, team stack	Current job posting and recruiter always win

This guide was audited on June 14, 2026. Treat every volatile product fact as an example of a decision, not as permanent truth.

Interview Saviour Operating System

Readiness gate, answer frameworks, decision matrices, current GenAI radar, and rescue scripts

Start Here10 survival packs

S2

How should I structure answers at 30 seconds, 2 minutes, and 10 minutes?

Answer Type	Structure	What Good Sounds Like
30 sec	Definition → distinction → when to use	Answer directly, distinguish the nearest concept, then give one decision rule.
2 min	Problem → options → choice → trade-off → metric	Show judgment, one rejected option, and evidence.
10 min	Requirements → design → deep dive → failure → scale → security → eval	Drive the conversation and invite the interviewer to choose a deep dive.
Behavioral	Situation → stakes → your action → result → lesson	Keep “we” for context and “I” for your contribution.
Debugging	Scope → hypotheses → instrumentation → isolate → mitigate → prevent	Do not jump directly to a favorite fix.

S3

Universal checklist for any GenAI system-design interview.

1 · ClarifyUsers, job, risk, latency/QPS, freshness, modalities, tenancy, compliance.

2 · MeasureQuality definition, golden set, business metric, SLO, cost, human review.

3 · DesignDeterministic baseline; model, retrieval, tools, data flow, APIs, storage.

4 · OperateFailures, fallback, observability, security, rollout, feedback, rollback.

Before Drawing

Which error is unacceptable: false answer, missed answer, unsafe action, or slow response?
Does knowledge change, or does behavior need adaptation?
Is output advisory or does it trigger actions?
What does success mean offline and online?
Which data may enter prompts, logs, vendors, and eval sets?

Close Every Design With

Top three failures and mitigations.
Quality, latency, cost, and safety release gates.
Progressive rollout and rollback plan.
One limitation and next iteration.
How it changes at 10× traffic or corpus.

S4

The decision matrix: RAG, fine-tuning, agents, models, and infrastructure.

Decision	Choose A When	Choose B When	Senior Caveat
RAG vs fine-tune	Facts change; citations/access control matter	Behavior, style, format, or task skill must change	They are complementary
Workflow vs agent	Path is known; auditability matters	Open-ended planning adds measured value	Start deterministic
Hosted vs open model	Fast iteration and managed capability	Control, privacy, specialization, scale economics	Include serving/ops cost
Large vs small model	Hard reasoning and broad tasks	Routing, extraction, classification, high volume	Use eval-driven routing
Managed vs custom RAG	Simple requirements and small team	Custom parsing, ranking, security, evaluation	Speed can outweigh flexibility
Lambda vs containers	Bursty short-lived orchestration	Long-running, streaming, custom runtime, GPU	Know execution limits

S5

Current senior GenAI interview radar: context, agents, MCP, multimodal, inference, security.

Agent & Context Engineering

Context engineering: optimize instructions, tools, memory, evidence, and state inside a limited token budget.
Agent harness: checkpointing, progress, budgets, permissions, isolation, and recovery across long tasks.
Tool ergonomics: clear schemas, bounded outputs, actionable errors, evaluation-driven improvement.
Agent evals: outcome, path efficiency, tools, safety, and recovery across nondeterministic runs.

Platform & Models

MCP: hosts, clients, servers, tools/resources/prompts, transports, authorization, confused-deputy risk.
Multimodal: OCR/layout/table/chart understanding, provenance, modality-specific evaluation.
Inference: prefill/decode, KV cache, continuous batching, quantization, routing, TTFT.
Security: constrain permissions, tools, outputs, consumption, and autonomy.

Context engineering Agent evals MCP architecture OWASP GenAI Top 10

S6

How do I design evaluations that interviewers trust?

Task successDid it complete the user’s job?

QualityCorrectness, relevance, faithfulness.

ProcessRetrieval, tools, efficiency, recovery.

SafetyInjection, leakage, unauthorized action.

OperationsLatency, cost, errors, stability.

1
Representative tasks: normal, edge, adversarial, multilingual, long-context, and no-answer cases.
2
Layered graders: deterministic checks, calibrated human labels, then scalable LLM judges.
3
Track slices: document type, user, query complexity, tool, model, and route.
4
Gate releases: block critical safety, quality, latency, or cost regressions.
5
Close loop: production traces and corrections become new eval cases.

S7

How do I threat-model an agent or RAG system?

Threat	Failure	Controls
Prompt injection	Untrusted content changes behavior	Trust zones, policy layer, validation; treat retrieved/tool content as data
Excessive agency	Too much autonomy/permission	Least privilege, scoped tools, budgets, approval, reversible actions
Sensitive disclosure	PII/secrets leak via prompts/output/logs	Minimize data, DLP, encryption, retention, isolation
Improper output handling	Model output becomes executable input	Schema validation, escaping, parameterization, sandboxing
Unbounded consumption	Loops cause denial of wallet/service	Token/tool/time budgets, quotas, circuit breakers
MCP authorization	Token theft/confused deputy	Explicit consent, scoped short-lived tokens, trusted redirects

Trap: a system prompt does not solve injection. Assume the model can be manipulated and constrain the surrounding system.

S8

Inference and cost deep-dive: answer beyond “use caching.”

Diagnose

Network/queue → retrieval/rerank → prompt → prefill/TTFT → decode → post-process.
Measure P50/P95/P99 by model, route, context, output, concurrency, warm/cold.
Long input raises prefill; long output raises decode. Optimize the correct stage.
Separate perceived streaming latency from total completion time.

Optimize Safely

Model routing, context compression, retrieval precision, parallel independent work.
Semantic/prefix cache with freshness and tenant-safe keys.
Continuous batching, KV cache, quantization, speculative decoding.
Bound outputs, tools, retries, loops; verify quality after every change.

S9

What high-signal questions should I ask the interviewer?

Engineering Maturity

What are the hardest production failure modes?
How do you evaluate changes and what blocks release?
Where do you intentionally avoid LLMs or agents?
How are quality, latency, cost, and safety trade-offs decided?
What does observability cover across retrieval, models, and tools?

Role & Team

What does excellent impact look like in 90 days?
Which decisions would this role own?
What technical disagreement is the team working through?
How are incidents and async handoffs handled?
Why have strong engineers struggled in this role?

S10

Interview rescue scripts: what do I say when I am stuck?

Situation	Useful Script
I do not know	“I have not implemented that directly. My understanding is X. I would verify Y; the closest system I built was Z.”
Ambiguous	“May I clarify failure cost, traffic, freshness, and whether output triggers actions?”
Wrong assumption	“That conflicts with the new constraint. I would revise X; the trade-off becomes Y.”
Confidential	“I cannot share identifiers or exact volumes, but I can explain architecture, measurement, relative scale, and my contribution.”
Coding stuck	“I will state a brute-force baseline, test it on a small example, then optimize the bottleneck.”
Metric challenged	“Fair point. The metric measured X, not Y. The limitation is Z, and I would measure A next.”

Timed Mocks & Production War Room

Answer aloud before opening each card; score reasoning, not memorized keywords

Active Recall8 pressure drills

Mock rule: set a timer, speak before opening the card, then score one point each for requirements, trade-offs, metrics, failures, and security. Strong answers score at least 4/5.

M1

45 min Full senior GenAI engineer mock loop.

8m
Resume: career walkthrough; defend sub-second latency and one percentage claim.
8m
Depth: hybrid retrieval, reranking, evaluation, and when RAG is wrong.
15m
Design: secure multi-tenant financial research assistant with citations at 500 QPS.
8m
Incident: P95 doubles and faithfulness drops after an index refresh.
6m
Behavioral: disagreement, failure, remote collaboration, why this role.

RequirementsClarified assumptions

JudgmentCompared options

EvidenceUsed metrics

ReliabilityHandled failure

CommunicationClear and concise

M2

10 min RAG quality drops after adding 500K documents. Diagnose it.

1
Scope: slice by source, parser, document/query type, index version; verify eval set.
2
Separate stages: inspect Recall@K/MRR and retrieved chunks before blaming generation.
3
Hypotheses: parsing, duplicates/noise, metadata, embedding mismatch, top-k dilution, stale index.
4
Mitigate: roll back index alias, isolate bad source, restore configuration.
5
Prevent: ingestion gates, canary index, regression suite, versioned promotion.

M3

10 min Agent loops, repeats tool calls, and creates duplicate actions.

Immediate

Restrict destructive tool; stop affected runs.
Use idempotency keys and deduplication.
Inspect state transitions, tool results, retries, model output.
Replay safely with the same state in a sandbox.

Prevent

Max steps/time/token/tool budgets and repeated-state detection.
Explicit transitions and completion criteria.
Retry only transient failures; give tools actionable errors.
Human approval and regression evals.

M4

10 min Offline eval improved, but users say the model is worse.

1
Check eval representativeness and recent production failures.
2
Slice by user, language, task, context length, tool, and route.
3
Validate graders for bias, leakage, weak rubrics, and poor agreement.
4
Compare paired traces and UX/latency changes, not only answer scores.
5
Roll back/reduce traffic, add failures to evals, recalibrate gates.

M5

10 min LLM cost doubled and P95 breached the SLO.

Instrument

Cost/request by model, route, tenant, tokens, tools, retries.
Latency waterfall: queue, retrieval, rerank, prefill, decode, tools.
Compare deployment, traffic mix, prompt/context, provider changes.

Act Safely

Stop runaway loops/retries and enforce budgets.
Route simple work to small models; compress irrelevant context.
Parallelize independent work; cache with safe freshness/tenant keys.
Canary and prove quality remains acceptable.

M6

12 min Design multimodal financial-document intelligence.

IngestLayout-aware parsing, OCR confidence, tables, charts, provenance.

RepresentText/table/image representations linked to page hierarchy.

RetrieveQuery routing, hybrid retrieval, security filters, reranking.

EvaluateTable QA, chart reasoning, citations, OCR slices, numeric correctness.

Critical risks: wrong units, split tables, OCR digit errors, stale documents, inaccessible citations, confident numeric hallucinations.

M7

10 min Design a secure enterprise MCP platform with hundreds of tools.

Architecture

Host controls UX, consent, policy, model, and clients.
Focused servers expose bounded tools/resources with schemas.
Discover/load tools on demand; avoid flooding context.
Registry, health, versioning, audit, evaluation, revocation.

Security

Per-user scoped auth; no token passthrough.
Trusted servers, explicit consent, short-lived tokens.
Sandbox execution; minimize data entering model context.
Budgets, validation, approval, and kill switch.

MCP security practices

M8

12 min Coding: implement a resilient bounded LLM request pipeline.

Clarify: async concurrency, timeout, retryable errors, backoff/jitter, rate limits, idempotency, validation, budget, cancellation, metrics, fallback.

Interview skeleton

async def bounded_llm_call(request, *, timeout_s, max_attempts, budget):
    # validate + reserve budget; acquire concurrency permit
    # call with timeout/idempotency; retry transient errors only
    # validate output; emit trace/metrics; release resources
    ...

State which errors you will not retry and how you prevent retry storms.

Resume Cross-Examination

Defend every role, metric, technology, award, and transition without bluffing

Highest Priority 12 drill packs · 70+ cross-questions

Metric defense rule: For every percentage, prepare the exact baseline, final value, dataset/sample size, measurement window, your personal contribution, and one limitation. If any element is confidential, say what you can disclose and explain the method.

R1

Walk me through your resume, career transitions, and “why now?”

PresentSenior GenAI Engineer at EPAM, building production Hybrid RAG and LLM systems for finance.

PatternBackend/data foundation → applied ML → enterprise GenAI ownership.

Proof5.8+ years, production impact across finance, healthcare, retail, and repeated awards.

NextA senior remote role with deeper ownership of scalable AI product architecture.

Likely Cross-Questions

Why did you move from full-stack to data engineering, then data science, then GenAI?
Why are you considering a move after joining EPAM in August 2025?
What makes you “senior” beyond years of experience?
Which role changed your technical judgment most?
Why remote, and what evidence shows you succeed remotely?
Which parts of your resume are hands-on versus team-level outcomes?

Strong Answer Anchors

Make the progression intentional: software → data platforms → models → LLM products.
For a current-job move, stay positive and name the specific scope you seek.
Define seniority through ambiguity handling, trade-offs, reliability, mentoring, and hiring.
Keep the walkthrough to 90 seconds; let the interviewer choose the deep dive.

R2

Defend the EPAM Hybrid RAG and multimodal document-intelligence claim.

ProblemName document types, users, failure cost, corpus scale, and why search was insufficient.

DesignIngestion → parsing/OCR → chunking → embeddings/BM25 → fusion/rerank → generation → citations.

Trade-offExplain why Bedrock, OpenSearch, and Lambda fit, plus where they do not.

EvidenceBring retrieval, generation, latency, reliability, and cost metrics separately.

Architecture Pressure Test

What made it hybrid? How did you fuse sparse and dense results?
What was multimodal: scanned PDFs, tables, charts, images, or all four?
How did you parse tables without losing row/column relationships?
Why OpenSearch instead of Pinecone, pgvector, or Bedrock Knowledge Bases?
Why Lambda; what happens with cold starts, long jobs, or GPU inference?
How did you enforce tenant isolation and finance-domain access control?
How were citations, abstention, and hallucination controls implemented?
What would break first at 10x corpus size or QPS?

Answer Checklist

Draw the full request path in under two minutes.
Separate decisions you owned from platform constraints you inherited.
Give one rejected design and the reason it lost.
Name the golden dataset and the retrieval metric used to tune hybrid weights.
Explain failure modes: parser errors, stale index, no relevant context, model timeout.

R3

You claim sub-second latency. What exactly was sub-second, and how was it measured?

Do not say only “sub-second.” State whether it was time-to-first-token, retrieval latency, model latency, or end-to-end latency; then give percentile, workload, payload size, and measurement window.

Cross-Questions

Was sub-second measured at P50, P95, or P99?
Did streaming make perceived latency lower than total latency?
What were the before/after values and number of sampled requests?
Which change contributed most: caching, parallelism, smaller model, connection reuse, or retrieval tuning?
How did you handle Lambda cold starts and Bedrock throttling?
What accuracy or cost trade-off did latency optimization introduce?

Measurement Template

Metric: TTFT / retrieval / end-to-end.
Distribution: P50, P95, P99, not average alone.
Conditions: concurrency, prompt/context tokens, model, warm/cold.
Instrumentation: trace spans around every stage.
Guardrail: verify quality and error rate did not regress.

R4

Defend your PEFT and LoRA work on LLaMA and Gemma.

Must-Answer Technical Questions

What task required fine-tuning rather than RAG or prompting?
Which LLaMA/Gemma sizes and base versus instruct variants?
What were the dataset size, schema, train/validation split, and cleaning rules?
Which target modules, rank r, alpha, dropout, learning rate, and epochs?
Did you use QLoRA? Explain NF4, double quantization, and compute dtype.
How did you detect overfitting and catastrophic forgetting?
What baseline and evaluation metric proved improvement?
How were adapters merged, versioned, and served?

Decision Defense

Fine-tune for behavior/style/task adaptation; use RAG for changing factual knowledge.
Explain parameter and memory savings, not only “cheaper.”
Discuss data licensing, PII removal, memorization, and rollback.
Compare LoRA with full fine-tuning, prompt tuning, and continued pretraining.
Bring one failed experiment and what it taught you.

R5

How did the PwC multi-agent healthcare platform improve processing efficiency by 60%?

BaselineDefine the old process, human/system effort, throughput, and bottleneck.

MechanismExplain agents, tools, orchestration, state, and parallelizable work.

MetricSay how “processing efficiency” was calculated and over what period.

RiskCover PHI/PII, auditability, clinical review, and failure escalation.

Cross-Questions

Why multiple agents instead of one workflow with tools?
What did each agent own, and how did they communicate?
How did you prevent duplicate actions and non-deterministic loops?
What state was persisted, and how did retries/checkpointing work?
How were tools authorized and outputs validated?
Where was human-in-the-loop mandatory?
Why microservices, and what operational cost did that add?
How did you evaluate agent-level versus end-to-end success?

Senior-Level Trade-Offs

Agents increase flexibility but reduce predictability and observability.
Use deterministic nodes for rules, calculations, and compliance gates.
Bound execution with budgets, timeouts, allowed transitions, and circuit breakers.
Expose the exact portion you personally architected or implemented.

R6

Defend PwC’s 40% retrieval accuracy, 70% assessment-time, and 68% execution-time improvements.

Resume Claim	Interviewer Will Ask	Your Required Evidence	Common Trap
40% retrieval accuracy	Accuracy means Recall@K, MRR, NDCG, or answer correctness?	Golden queries, labels, baseline, final metric, confidence/error analysis	Calling semantic similarity “accuracy”
70% assessment time	Whose time and from how long to how long?	Workflow timing, sample count, review rate, quality guardrail	Ignoring time shifted to reviewers
68% execution time	Which stages were parallelized and why safe?	Trace waterfall, dependency DAG, before/after percentiles	Comparing unmatched workloads

Follow-Ups

What is chain parallelism, and how does it differ from batching?
How did you handle partial failure when parallel calls diverged?
How did you tune hybrid retrieval and reranking?
What did the risk-intelligence agents produce, and how was correctness verified?
What changed after production feedback?

Answer Pattern

Give the formula before the percentage.
State absolute values as well as relative improvement.
Name the quality metric held constant during speed optimization.
Acknowledge limitations and what you would measure next.

R7

Deep-dive the Cognizant semantic search, multi-LLM workflows, metadata linking, and security work.

Semantic Search / RAG

Why Pinecone, which index/metric, and how did namespaces/metadata work?
What did the 35% accuracy improvement measure?
The resume says latency reduced but gives no number; what can you honestly claim?
How did you select embeddings, chunk size, top-k, and reranker?
What was required to move an R&D prototype into production?

Other Cognizant Claims

Why Azure OpenAI versus Gemini Pro for each workflow step?
What is metadata linking, and how was link quality evaluated?
How did Checkmarx, Sonar, and Black Duck differ?
What does “100% security compliance” specifically mean?
How did you handle secrets, prompt injection, PII, and dependency risk?

R8

Defend your TCS data-engineering foundation and the move into data science.

Data Engineering Questions

Draw one GCP ETL pipeline end to end: sources, orchestration, transforms, storage, consumers.
Airflow versus Cloud Functions: what belonged where?
How did you reduce cost by 40%: slots, partitioning, scheduling, storage, or serverless changes?
How was manual effort reduced by 80%, and what remained manual?
How did you ensure idempotency, backfills, schema evolution, and data quality?
Explain BigQuery partitioning, clustering, and query-cost control.

Transition Questions

Why leave data engineering for data science?
How does your data-engineering background make you better at GenAI?
Which parts of MLOps/LLMOps are direct extensions of data-platform work?
What ML knowledge did you have to build deliberately?
Would you still be comfortable owning a production data pipeline today?

R9

What did you personally build in your Django role and Python internship?

Oneness Tech Solutions

What prediction did the analytics app make, using which model and features?
How did Django, REST APIs, database, model inference, and UI connect?
What changed to improve organic traffic by 65%, and which analytics source measured it?
What production incident or performance bottleneck did you solve?
What does “end-to-end product lifecycle” mean in concrete deliverables?

CodeSpeedy Internship

How did the weather app handle API errors, caching, and rate limits?
How was certificate generation automated?
Explain the multithreaded socket system and race-condition risks.
Threads versus processes versus asyncio in Python?
How did technical writing improve your engineering communication?

R10

Your skills section is very broad. Which technologies can you genuinely defend?

Never answer “expert in all.” Classify each listed skill as production ownership, working proficiency, or informed exposure. Interviewers often choose the least-supported item.

Cluster	Likely Deep Probe	Have Ready
LangChain / LangGraph / LlamaIndex	When would you avoid each framework?	One production example, one limitation, one plain-Python alternative
AWS / GCP	Map equivalent services and explain operational trade-offs	One architecture on each cloud, IAM/networking/cost story
PyTorch / TensorFlow / MLflow	Show training/evaluation/registry experience	Real model, experiment setup, artifact/version flow
FastAPI / Flask / Django	Async, validation, middleware, deployment, testing	Why FastAPI for LLM serving; when Django still wins
Docker / Kubernetes / CI/CD	Health checks, secrets, autoscaling, rollback	A deployment you operated, not only packaged

R11

Defend your awards, top-1% rating, and experience interviewing 50+ candidates.

Leadership / Awards Questions

Why did you receive nine PwC awards and seven Cognizant awards? Give two distinct stories.
How was the Cognizant top-1% / 5-of-5 rating determined?
What did the EPAM Delivery Head Award recognize?
How did you influence without formal authority?
Tell me about a teammate you mentored and the observable result.

Interviewer Questions

What was your A3-level interview rubric?
How did you distinguish memorization from real depth?
Tell me about a close hire/no-hire decision.
How did you reduce bias and calibrate with other interviewers?
What common GenAI weakness did you observe across 50+ candidates?

R12

Final resume-defense questions: failures, gaps, ownership, and why hire you?

Behavioral Cross-Questions

What is your biggest technical failure in production?
When did you disagree with an architect or client?
What metric did you improve but later realize was incomplete?
What work are you least proud of, and what changed afterward?
When did a GenAI approach fail and a simpler approach win?
What is the hardest feedback you received?

Closing Questions

Why should we hire you over a stronger ML researcher?
Why should we hire you over a stronger backend engineer?
What are your two biggest gaps for this role?
What would your 30/60/90-day plan be?
Which architecture decision would you revisit first in our product?
What questions do you have for us that reveal engineering maturity?

Your positioning: You are strongest at turning GenAI ideas into measurable enterprise systems because you combine data pipelines, backend engineering, applied ML, cloud delivery, and stakeholder ownership. Do not position yourself as a frontier-model researcher unless the evidence supports it.

Remote Company & Interview Playbooks

Target roles, typical rounds, difficulty, and preparation emphasis

Role-dependent 10 companies

Process note · audited June 14, 2026: Exact rounds, remote eligibility, and location restrictions change by team and opening. Atlassian and Databricks publish current guides; where a company does not publish a role-specific sequence, the steps below are a preparation hypothesis, not a guaranteed loop. The current job posting and recruiter always win.

C1

Which remote/distributed companies best match this resume?

Company	Remote Signal	Best-Fit Role	Typical Emphasis	Difficulty
Twilio	Remote-first; remote India roles appear	Senior AI/ML, backend platform, applied AI	APIs, distributed systems, coding, ownership	High
Atlassian	Team Anywhere; distributed-first with entity/timezone rules	Senior/Principal ML or platform engineer	DSA, code design, system design, values	High
Databricks	Remote/hybrid varies strongly by role/location	ML platform, GenAI, solutions/field engineering	DSA, distributed data, ML systems, depth	Very high
GitLab	All-remote	AI-powered features, backend, MLOps	Async writing, values, practical technical depth	High
Elastic	Distributed company	Search/GenAI, relevance, platform engineering	Search internals, distributed systems, collaboration	High
Canonical	Remote-first	ML platform, Python/backend, cloud	Written application, academics, technical breadth	High / long
Automattic	Fully remote/global	Applied AI, experienced software engineer	Async communication, paid trial, product judgment	High / practical
Zapier	Fully remote	Applied AI, automation platform, backend	Written evidence, skills assessment, values	Medium-high
Remote	Remote-first, globally async	AI/data/backend platform	Async ownership, product engineering, 4–5 video rounds	Medium-high
Deel	Work-from-anywhere, global	AI automation, data, backend	Speed, ownership, automation, product impact	High

Twilio jobs Atlassian Team Anywhere GitLab jobs Canonical remote work Automattic careers

C2

Twilio: likely interview steps, difficulty, and targeted preparation.

Typical LoopRecruiter → hiring manager → coding → system/ML design → behavioral/values → team match.

DifficultyHigh. Expect production engineering depth, not GenAI vocabulary alone.

Your AngleFastAPI/microservices, remote delivery, AWS, LLM reliability, measurable customer impact.

RiskYour resume does not show CPaaS/telecom; bridge with API reliability and event-driven systems.

Prepare

Design a globally reliable notification or conversational-AI API.
Idempotency keys, retries, rate limits, queues, webhook security, observability.
Medium DSA in Python with clean tests and complexity analysis.
Explain how you would add safe LLM features to customer communications.

Questions to Expect

How would you prevent duplicate message delivery?
Design a multi-tenant AI support platform with strict latency SLOs.
How do you operate an API during provider/model failure?
Tell me about a time you improved reliability across teams.

Twilio remote-first culture

C3

Atlassian: official engineering loop and how to prepare.

Typical LoopRecruiter → coding (data structures + code design) → system design → manager → values.

DifficultyHigh. Clean reasoning, maintainable design, and values are heavily weighted.

Your AngleDistributed teamwork, enterprise platforms, interviewer experience, written design decisions.

RiskDo not over-focus on LLMs; show broad distributed-software fundamentals.

Prepare

DSA medium problems while narrating trade-offs and edge cases.
Code design: extensible classes/APIs, tests, readability, change requests.
System design: multi-tenant collaboration, permissions, search, scale, reliability.
Five values stories with conflict, customer impact, and learning.

Likely Questions

Design AI search across Jira and Confluence with permissions preserved.
How would you evaluate and roll out an AI feature safely?
Tell me when you changed your mind after strong disagreement.
Write code that remains easy to extend after a new requirement.

Official engineering interview guide Team Anywhere

C4

Databricks: likely interview steps, difficulty, and the gap-closing plan.

Official ShapeTalent acquisition → skill assessments → interviews → references → decision/offer.

Engineering ShapeHiring-manager screen → technical screen → virtual panel → references → hiring committee.

DifficultyVery high. Strong DSA plus distributed data/ML systems and deep project defense.

Your GapResume shows BigQuery/Dataflow, not Spark/Delta depth; prepare this explicitly.

Prepare

Spark execution: partitions, shuffle, joins, skew, caching, AQE, failure recovery.
Lakehouse concepts: Delta transaction log, ACID, schema evolution, streaming.
ML/LLM platform design: experiment tracking, feature/data lineage, serving, evaluation.
Medium-hard DSA and rigorous complexity analysis.

Likely Questions

Design an enterprise RAG/evaluation platform for thousands of teams.
Why does a distributed join become slow, and how do you fix skew?
How would you make LLM evaluation reproducible and governed?
Compare warehouse, lake, and lakehouse architectures.

Official Databricks hiring process Engineering interview overview

C5

GitLab: all-remote interview strategy.

Typical emphasis: recruiter/hiring-manager conversations, role-specific technical assessment or interview, cross-functional conversations, values, and strong written async communication. Difficulty is high because remote effectiveness is evaluated as an engineering competency.

Prepare

Write a concise design proposal with assumptions, alternatives, risks, and rollout.
Show documentation-first remote habits and transparent decision-making.
Prepare product-aware answers for AI features across the DevSecOps lifecycle.

Expect

How do you unblock yourself asynchronously?
How would you ship and evaluate an AI coding feature?
Tell me about a documented decision that prevented rework.

GitLab virtual interview guide

C6

Elastic: distributed search and GenAI interview strategy.

Typical loop: recruiter screen followed by role-specific interviews with technical and team stakeholders. Your strongest bridge is OpenSearch, hybrid retrieval, vector search, observability, and distributed remote delivery.

Prepare

BM25, inverted indexes, HNSW, shards/replicas, refresh, mappings, filters.
Hybrid retrieval evaluation and relevance tuning.
Distributed failure, consistency, hot shards, and capacity planning.

Expect

Why do vector and keyword search fail differently?
How would you debug poor relevance?
Design search for 10M changing documents.

Elastic how we hire

C7

Canonical: remote-first but unusually writing-heavy and long.

Typical pattern: detailed written application/questions, domain review, assessments, and multiple interviews shown in a personalized candidate dashboard. Difficulty comes from breadth, written precision, and process length.

Prepare

Polished written evidence for every claim; remove vague adjectives.
Python/Linux/cloud fundamentals and open-source awareness.
Academic and career choices, motivation, remote collaboration.

Expect

Why Canonical and open source?
What have you operated on Linux?
Explain a technical decision clearly in writing.

Canonical hiring process

C8

Automattic: fully remote, async, and paid-trial oriented.

Published process: application → interview → paid trial → final interview → offer. Developer hiring is conducted through the tools the company uses, so communication and practical execution matter as much as discussion.

Prepare

A strong written project narrative with personal ownership and trade-offs.
Practical coding, tests, incremental delivery, and async updates.
A truthful explanation of how AI has changed your workflow.

Expect

Why remote and why Automattic?
Complete a realistic paid work trial.
Show how you communicate progress without meetings.

Automattic process

C9

Zapier: remote automation company with evidence-first applications.

Published early stages: application review → recruiter interview → 45–60 minute hiring-manager interview, followed by role-specific skills assessment/final stages. Applications focus strongly on evidence of what you can do.

Prepare

Automation/product thinking and user impact.
Clear written answers using specific outcomes.
Reliable integrations: retries, idempotency, rate limits, webhooks.

Expect

How would you add AI to an automation safely?
How do you prioritize speed versus reliability?
Show an async ownership example.

Zapier applicant process

C10

Remote and Deel: globally distributed product-engineering targets.

Remote

Published average: roughly four weeks and 4–5 video interviews.
Emphasize async ownership, documentation, product judgment, and compliant global systems.
Prepare AI/data use cases for global HR, payroll, and support.

Deel

Work-from-anywhere with high speed and accountability.
Emphasize automation, measurable impact, and comfort across time zones.
Prepare for product/system questions around global employment and compliance.

Remote interview prep Deel careers

Concise 14-Day Study Plan

Prioritize defense and active recall instead of passively reading everything

Execution2–3 hours/day

P1

What should I study each day?

Days	Primary Work	Output You Must Produce
1–2	Resume defense and metric evidence	90-second pitch; one evidence sheet for every percentage
3–4	RAG, vector DB, evaluation, hallucination	Draw your EPAM/PwC architecture from memory; answer 15 follow-ups aloud
5	Agents, LangGraph, reliability	Defend why agents were needed; design failure controls
6	PEFT/LoRA, Transformers, LLM inference	Explain LoRA math and one real experiment with hyperparameters
7	ML/DL/NLP/statistics fundamentals	Rapid-fire recall without notes
8	AWS/GCP, backend, security, LLMOps	One production design and one incident/debug story
9–10	Python, SQL, DSA	6 timed mediums; explain tests and complexity
11	System and ML system design	Two 45-minute mocks: RAG platform and AI support system
12	Company-specific prep	One-page Twilio, Atlassian, or Databricks brief
13	Behavioral and remote stories	Eight STAR stories with metrics and lessons
14	Full mock and gap repair	Record, review, shorten, and correct weak answers

ML, DL, NLP & LLM Foundations

Compact senior-level rapid fire for everything listed in the resume

Foundation Check8 packs · 55+ prompts

F1

Classical ML: what can be asked from the resume?

Core Questions

Bias versus variance; diagnose each from train/validation curves.
L1 versus L2 regularization; effect on coefficients and feature selection.
Bagging versus boosting; Random Forest versus XGBoost.
How trees choose splits; entropy versus Gini.
When scaling matters and when it does not.
How to handle missing values, outliers, high cardinality, and leakage.
Why cross-validation can still be wrong for time series or grouped data.

Senior Answer Signals

Start with the business cost of false positives/negatives.
Choose splits that mimic production.
Keep preprocessing inside the fitted pipeline.
Explain calibration when probabilities drive decisions.
Compare simple baseline before complex model.

F2

Statistics and evaluation: metrics, experiments, and uncertainty.

Question	Concise Answer Anchor
Precision vs recall?	Precision controls false positives; recall controls false negatives; choose from business cost.
ROC-AUC vs PR-AUC?	PR-AUC is more informative for rare positives; ROC-AUC can look optimistic.
Data drift vs concept drift?	Input distribution changes versus relationship between input and target changes.
Statistical vs practical significance?	A small effect can be statistically real but not worth shipping.
Offline vs online metric?	Offline enables fast iteration; online validates real behavior and system effects.

Rapid Fire

Confidence interval and bootstrap.
Type I/II error and power.
A/B-test sample ratio mismatch.
Class imbalance and threshold tuning.
Model calibration and Brier score.

LLM Evaluation Bridge

Why LLM-as-judge can be biased.
How to create a golden set.
Pairwise versus pointwise judging.
How to measure inter-rater agreement.
How to gate releases on quality and latency.

F3

Deep learning fundamentals: training, optimization, and debugging.

Core Questions

Backpropagation and the chain rule.
Vanishing/exploding gradients and mitigations.
ReLU, GELU, sigmoid, softmax: where and why.
SGD versus Adam/AdamW; what weight decay changes.
Batch norm versus layer norm.
Dropout behavior in training versus inference.
Learning-rate warmup, schedules, gradient clipping, accumulation.

Debugging Questions

Loss becomes NaN: what do you inspect first?
Train loss falls but validation worsens: what now?
GPU out-of-memory: which levers preserve quality?
Training is slow: data loader, precision, kernels, batching, profiling.
How do mixed precision and loss scaling work?

F4

NLP and embeddings: before and beyond Transformers.

Questions

TF-IDF/BM25 versus dense embeddings.
Word2Vec CBOW versus skip-gram; static versus contextual embeddings.
Cosine, dot product, and Euclidean distance.
Tokenization: BPE, WordPiece, SentencePiece, unknown/rare words.
Why embedding anisotropy matters.
Bi-encoder versus cross-encoder.
NER, classification, sequence labeling, and generation metrics.

Applied Follow-Ups

How do you pick an embedding model for finance?
How do multilingual embeddings change evaluation?
Why can cosine similarity retrieve irrelevant text?
How do you detect embedding/index drift?
When is BM25 the correct answer?

F5

Transformers: explain the architecture and its scaling costs.

Architecture Questions

Derive scaled dot-product attention: softmax(QKᵀ/√d)V.
Why divide by √d?
Why multiple attention heads?
Encoder-only versus decoder-only versus encoder-decoder.
Causal masks, padding masks, and cross-attention.
Absolute, sinusoidal, RoPE, and ALiBi positions.
Residual connections, layer norm, and feed-forward blocks.

Scaling / Inference

Why attention is quadratic in sequence length.
KV cache: what it stores and its memory cost.
Prefill versus decode; why decode is memory-bound.
Continuous batching and paged attention.
Quantization trade-offs: weights, activations, KV cache.
Mixture-of-Experts benefits and routing challenges.

F6

LLM training and alignment: pretraining through preference optimization.

Stage	Purpose	Data / Objective	Main Risk
Pretraining	Learn language/world patterns	Large corpus; next-token prediction	Cost, contamination, unsafe knowledge
Continued pretraining	Adapt domain language	Unlabeled domain corpus	Forgetting and domain overfit
SFT	Teach instruction behavior	Prompt-response examples	Data quality and imitation limits
RLHF	Optimize human preferences	Preference/reward model + RL	Complexity and reward hacking
DPO	Preference alignment directly	Chosen/rejected pairs	Preference-data bias

Rapid Fire

Temperature, top-k, top-p, repetition penalty.
Distillation versus quantization versus pruning.
Catastrophic forgetting.
Data contamination and benchmark leakage.

Decision Questions

Prompting vs RAG vs fine-tuning.
Open model vs hosted API.
Small specialized model vs large general model.
When not to use an LLM.

F7

SQL, data pipelines, and MLOps questions implied by the resume.

SQL / Data

Window functions, CTEs, joins, deduplication, top-N per group.
Query plan, indexes, partition pruning, clustering.
Batch versus streaming and exactly-once semantics.
Idempotency, backfills, late data, schema evolution.
Data quality checks and lineage.

MLOps / LLMOps

Experiment tracking, registry, reproducibility, and rollback.
Feature/data/model/prompt version compatibility.
Shadow, canary, and A/B deployment.
Monitoring drift, quality, cost, latency, and safety.
How LLMOps differs from classical MLOps.

F8

Security, privacy, and responsible-AI questions for enterprise GenAI.

Threats

Direct and indirect prompt injection.
Data exfiltration through tools or retrieved context.
Over-privileged agents and confused-deputy attacks.
Training-data poisoning and unsafe dependencies.
PII/PHI leakage, retention, and regional compliance.

Controls

Least privilege, scoped credentials, allowlisted tools, policy checks.
Separate instructions from untrusted data; sanitize and label context.
Input/output validation, DLP, encryption, audit logs.
Human approval for consequential actions.
Red-team evals, incident response, and kill switches.

RAG Architecture

Retrieval-Augmented Generation — ingestion, retrieval, generation, evaluation

High Priority 8 questions

Offline ingestion

Build and refresh the searchable knowledge index

SourcesDocumentsPDF · API · S3 · tables

UnderstandParse & chunkOCR · hierarchy · metadata

RepresentEmbedDense vectors + text fields

IndexOpenSearchHNSW vector index + BM25

Online query path

Retrieve evidence first, then generate a grounded answer

RequestUser queryIntent + access context

RecallHybrid retrievalDense + BM25 · RRF fusion

PrecisionRerankCross-encoder top-k

GenerateGrounded LLMContext + refusal rules

ResponseCited answerAnswer · sources · confidence

Production RAG separates asynchronous indexing from the latency-sensitive query path.

Q1

Explain Hybrid RAG vs naive (dense-only) RAG. When and why do you use hybrid?

Naive RAG uses only dense vector search (bi-encoder embeddings + ANN). It captures semantic similarity well but fails on exact keywords, entity names, numeric IDs, or ticker symbols.

Hybrid RAG combines dense vector search with sparse BM25 (TF-IDF based keyword matching). The two result lists are merged using Reciprocal Rank Fusion (RRF): each document gets a score of Σ 1/(k + rank_i) across all rankers, and documents are re-sorted by this fused score.

Dimension	Naive (Dense)	Hybrid (Dense + BM25)
Semantic queries	Excellent	Excellent
Exact keyword/ID	Misses	Catches
Financial codes	Poor	Good
Setup complexity	Low	Medium
Recall improvement	Baseline	+15–40% typical

Your real result: At PwC, switching to Hybrid RAG on AWS OpenSearch gave a 40% accuracy improvement on a US Financial client's document Q&A system — because financial product codes and ISIN numbers needed exact matching that dense search missed.

Python · LangChain EnsembleRetriever

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import OpenSearchVectorSearch

# Sparse retriever
bm25 = BM25Retriever.from_documents(docs, k=10)

# Dense retriever
vs = OpenSearchVectorSearch.from_documents(docs, embeddings)
dense = vs.as_retriever(search_kwargs={"k": 10})

# Hybrid: 40% BM25 + 60% semantic, merged via RRF
hybrid = EnsembleRetriever(
    retrievers=[bm25, dense],
    weights=[0.4, 0.6]   # tune per domain
)

Lewis et al. 2020 — RAG paper (arXiv) AWS OpenSearch hybrid search docs LangChain EnsembleRetriever

Follow-up questions

Chunking strategies Re-ranking RAG evaluation metrics

Q2

What chunking strategies exist? Which do you choose and why?

Chunking directly controls what the retriever can find. The wrong chunk size either splits context across boundaries or retrieves too much noise.

Strategy	How it works	Best for	Tradeoff
Fixed-size	Splits every N tokens with optional overlap	Simple docs, fast prototype	Splits mid-sentence
Sentence splitter	Splits at sentence boundaries	General prose	Uneven chunk sizes
Recursive character	Tries \n\n → \n → ". " in order	Most text documents (LangChain default)	Slight overhead
Semantic chunking	Groups sentences with cosine similarity shift	Dense research docs	Slow; needs embedding at ingest
Parent-child	Small child chunks indexed; parent context served to LLM	Long reports, contracts	Larger index size
Late chunking	Embed full doc, then chunk — preserves context in embeddings	Context-sensitive passages (jina-embeddings-v3)	Newer, less tooling

Production recommendation: Use parent-child (LangChain ParentDocumentRetriever) with child chunk 256 tokens, parent 1024 tokens. Retrieve small chunks for high precision; inject the parent window into the LLM for full context.

Always add chunk overlap (10–20% of chunk size) so sentences at boundaries aren't orphaned. Metadata-enrich every chunk: source_file, page_number, created_at, entity_type — these become filterable fields in the vector DB.

LangChain text splitter docs Late Chunking paper (arXiv 2024)

Follow-up questions

Re-ranking after retrieval Metadata filtering in vector DBs

Q3

What is re-ranking? When is the extra latency justified?

Initial retrieval uses a bi-encoder (query and document embedded separately) because it's fast — you embed the query once and do ANN lookup. But bi-encoders miss subtle relevance nuances.

A cross-encoder re-ranker takes (query, document) pairs and scores them jointly with full attention — far more accurate. You run it on just the top-20 retrieved results (not the full index), making the cost manageable.

Stage	Model type	Speed	Accuracy	Scale
ANN retrieval	Bi-encoder	~5ms	Good	Millions of docs
Re-ranking	Cross-encoder	~80–200ms	Excellent	Top 20–50 only

Use re-ranking when: precision matters over throughput (finance, healthcare, legal), corpus has many near-duplicate docs, or faithfulness is critical.

Python · Cohere Rerank

from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

reranker = CohereRerank(model="rerank-english-v3.0", top_n=5)
retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=hybrid_retriever  # retrieve top-20 first
)

Cohere Rerank docs BGE Reranker paper (arXiv)

Follow-up questions

RAG evaluation (RAGAS) Latency optimization

Q4

What is RAG Fusion and how does it differ from standard hybrid RAG?

RAG Fusion uses the LLM itself to generate multiple alternative query reformulations from the user's original query, runs each through the retriever independently, then fuses all result lists with RRF before generating the answer.

Standard hybrid RAG uses one query, two retrieval methods (dense + sparse), then fuses. RAG Fusion uses N query variations, one or more retrieval methods, then fuses — it solves vocabulary mismatch and query ambiguity at the query level.

Python · RAG Fusion pattern

def rag_fusion(query: str, k=5):
    # 1. Generate query variations via LLM
    variations = llm.invoke(f"Generate 4 alternative versions of: {query}")

    # 2. Retrieve for each variation
    all_results = {}
    for q in [query] + variations:
        results = retriever.invoke(q)
        for rank, doc in enumerate(results):
            doc_id = doc.metadata["id"]
            all_results[doc_id] = all_results.get(doc_id, 0) + 1/(60 + rank)

    # 3. RRF sort and return top-k
    return sorted(all_results, key=all_results.get, reverse=True)[:k]

RAG Fusion paper (arXiv 2024)

Q5

How do you evaluate a RAG pipeline? What metrics and tools do you use?

RAG evaluation has two orthogonal dimensions: retrieval quality (did we get the right chunks?) and generation quality (did the LLM use them faithfully?).

Metric	Formula (simplified)	Target	Tool
Context Precision	Relevant retrieved / Total retrieved	>0.8	RAGAS
Context Recall	Facts in context / Total needed facts	>0.75	RAGAS
Faithfulness	Claims supported by context / Total claims	>0.85	RAGAS / LangSmith
Answer Relevancy	Cosine(answer embedding, question embedding)	>0.7	RAGAS
Latency P95	95th pctile end-to-end ms	<2000ms	CloudWatch
Token cost / query	Input + output tokens × price	Track trend	LangSmith

Python · RAGAS evaluation

from ragas import evaluate
from ragas.metrics import (
    faithfulness, answer_relevancy,
    context_precision, context_recall
)
from datasets import Dataset

dataset = Dataset.from_dict({
    "question": questions,
    "answer": generated_answers,
    "contexts": retrieved_chunks_per_q,   # List[List[str]]
    "ground_truth": reference_answers
})

results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
print(results.to_pandas())

RAGAS documentation RAGAS paper (arXiv 2023) LangSmith tracing docs

Follow-up questions

Online monitoring in production Handling hallucinations

Q6

How do you prevent hallucinations in RAG systems?

1
Grounding instructions in system prompt: Explicitly instruct the LLM: "Answer only from the provided context. If the answer is not in the context, say 'I don't have enough information.'" This alone reduces hallucinations ~40%.
2
Citations at generation time: Ask the LLM to cite the source chunk for every claim. You can then programmatically verify the citation exists in the retrieved context.
3
Faithfulness guardrail: Run RAGAS faithfulness check on a sample of production traffic. Alert if score drops below threshold.
4
Confidence thresholding: If retrieval similarity scores are all below a threshold (e.g., cosine < 0.6), return "no relevant information found" rather than hallucinating.
5
NeMo Guardrails / Bedrock Guardrails: Add a post-generation layer that checks the answer doesn't reference topics outside the retrieved context.

AWS Bedrock Guardrails NVIDIA NeMo Guardrails

Q7

What is GraphRAG and when would you use it over standard RAG?

GraphRAG (Microsoft, 2024) builds a knowledge graph from documents — entities become nodes, relationships become edges. Retrieval traverses the graph rather than (or in addition to) doing vector search.

It excels when the question requires connecting multiple entities across many documents ("What are all the risk factors shared by companies X and Y?") — a pattern that naive RAG misses because no single chunk contains the answer.

Use case	Standard RAG	GraphRAG
Factual Q&A on single docs	✅ Best	Overkill
Multi-hop reasoning across entities	Struggles	✅ Best
Relationship queries ("who reports to X")	Misses	✅ Best
Real-time index updates	✅ Easy	Slow (graph rebuild)

GraphRAG paper — Microsoft (arXiv 2024) Microsoft GraphRAG repo

Q8

How do you handle multi-turn conversation memory in RAG?

Multi-turn RAG needs to resolve references ("it", "the same document", "as you mentioned") to prior turns before retrieval — otherwise the query is incomplete.

Query condensation: Pass last N messages + current query to LLM with instruction "Rewrite the final question as a standalone query." Then retrieve with the condensed query.
ConversationalRetrievalChain: LangChain's built-in pattern handles this automatically.
Context window management: Summarise old turns with a summarisation LLM when conversation exceeds 20 turns, to keep token cost bounded.
External memory stores: For long-running sessions, persist summaries to Redis or a DB keyed by session_id.

LangChain ConversationalRetrievalChain

Multi-Agent Systems

LangGraph, orchestration patterns, tool use, agent safety

High Priority 8 questions

Controlled autonomy Orchestrator / Supervisor Classifies intent, plans work, routes tasks, and owns the shared LangGraph state

Risk gate Human approval Required for consequential actions

EvidenceRAG agentRetrieves grounded knowledge and citations

ActionTool agentCalls approved APIs and databases

ComputeCode agentRuns bounded Python or SQL tasks

ResponseSummary agentSynthesizes, validates, and formats output

Shared state with checkpointing Conditional edges for routing Budgets for tools, tokens, and retries

Multi-agent supervisor pattern with human-in-the-loop approval gate

Q1

Agent vs Workflow — what's the fundamental difference? When do you choose each?

Dimension	Workflow (DAG)	Agent
Routing logic	Code decides	LLM decides
Predictability	High — deterministic	Low — emergent
Auditability	Full trace known ahead	Trace varies per run
Flexibility	Fixed paths only	Open-ended tasks
Failure modes	Step failure, timeout	Loops, hallucinated tool calls
Best for	Finance compliance, ETL, pipelines	Research, assistant, open-ended

Production pattern: Use a fixed workflow topology (which nodes exist and how they connect) with LLM-driven tool selection inside each node. This is what you built at PwC — deterministic orchestration, intelligent execution.

Anthropic: Building Effective Agents (2024)

Follow-up questions

LangGraph StateGraph internals Preventing agent loops

Q2

How does LangGraph work? Explain StateGraph, nodes, edges, and checkpointing.

LangGraph is a low-level orchestration runtime for long-running, stateful agents. A StateGraph models work as nodes and edges; typed shared state flows through the graph and nodes return partial updates. Persistence enables durable execution, streaming, human-in-the-loop interrupts, and recovery. LangChain agents are a higher-level interface built on LangGraph; use LangGraph directly when you need explicit state and control.

Python · LangGraph StateGraph

from langgraph.graph import StateGraph, END
from langgraph.checkpoint.sqlite import SqliteSaver
from typing import TypedDict, List, Annotated
import operator

class AgentState(TypedDict):
    messages: Annotated[List, operator.add]  # reducer: append
    retrieved: List[str]
    answer: str
    iteration: int

def retrieve_node(state: AgentState) -> dict:
    docs = retriever.invoke(state["messages"][-1].content)
    return {"retrieved": [d.page_content for d in docs]}

def generate_node(state: AgentState) -> dict:
    answer = llm.invoke(state["messages"] + state["retrieved"])
    return {"answer": answer, "iteration": state["iteration"] + 1}

def should_continue(state: AgentState) -> str:
    if state["iteration"] >= 3: return "end"
    if "insufficient" in state["answer"]: return "retrieve"
    return "end"

graph = StateGraph(AgentState)
graph.add_node("retrieve", retrieve_node)
graph.add_node("generate", generate_node)
graph.set_entry_point("retrieve")
graph.add_edge("retrieve", "generate")
graph.add_conditional_edges("generate", should_continue,
    {"retrieve": "retrieve", "end": END})

# Checkpointing = persistence across turns (human-in-the-loop)
memory = SqliteSaver.from_conn_string(":memory:")
app = graph.compile(checkpointer=memory, interrupt_before=["generate"])

LangGraph official docs LangGraph GitHub

Q3

What is the ReAct pattern? How does it differ from function-calling agents?

ReAct interleaves reasoning and action so a model can use observations from tools before choosing the next step. In production, do not depend on exposing the model's private reasoning trace. Log auditable state transitions, tool calls, arguments, results, policy decisions, and concise model-provided rationales instead.

Function/tool calling is the typed application interface: the model emits a structured tool request, the application validates and authorizes it, executes the tool, then returns the result. ReAct is a reasoning-and-action pattern; tool calling is an execution contract, and they can be used together.

Pattern	Transparency	Latency	Best for
ReAct	Observable actions and observations	Usually multi-step	Adaptive tool-using tasks
Function calling	Typed tool request and result	Often lower	Reliable application/tool integration
Plan-and-Execute	Explicit plan and task state	Varies	Long, decomposable tasks

ReAct paper — Yao et al. (arXiv 2022)

Q4

How do you prevent agents from looping, hallucinating tool calls, or going off-rails?

1
Explicit budgets: Set graph-step, per-tool retry, wall-clock, token, and cost limits. Force a safe terminal state when any budget is exhausted.
2
Tool output validation: Wrap every tool in a Pydantic model. If the tool returns unexpected schema, raise a ToolException — the agent sees the error and can self-correct (max 1 retry per tool).
3
Structured outputs for routing: Force the routing decision via JSON schema (with_structured_output). Eliminates free-text routing that the LLM might misformat.
4
Human-in-the-loop for side-effecting actions: Any tool that writes to a DB, calls an API with side effects, or sends a message gets an interrupt_before checkpoint in LangGraph — requires explicit human approval.
5
Deterministic policy layer: Treat all model and tool output as untrusted. Enforce identity, authorization, egress, argument validation, effect receipts, and audit outside the model.

LangGraph interrupts

Q5

How do you design a multi-agent system for high concurrency? (your PwC 60% uplift)

1
Microservice isolation: Each agent runs as a separate AWS Lambda / ECS container. Agents communicate through an SQS queue, decoupling their lifecycles and enabling independent scaling.
2
Async fan-out: Orchestrator dispatches sub-tasks via asyncio.gather — all sub-agents run concurrently. Only the final merge step waits.
3
Result caching: Sub-agent results for repeated sub-tasks are cached in Redis (TTL: 5 min for volatile data, 1hr for reference data).
4
Circuit breaker: If a sub-agent fails 3× in 60s, the orchestrator marks it unavailable and falls back to a degraded path rather than propagating failure.

This is the architecture that delivered the 60% efficiency uplift at PwC healthcare. The key insight: sequential sub-agent calls were the bottleneck. Async fan-out was the fix.

Q6

What is Model Context Protocol (MCP) and how does it affect agent tool integration?

MCP (Model Context Protocol) is an open client-server protocol for connecting AI applications to reusable tools, resources, and prompts. It reduces one-off integration code, but it does not remove the need to trust the server, authenticate users, authorize every action, validate outputs, and obtain approval for high-impact effects.

As of June 14, 2026: the stable specification is 2025-11-25. The 2026-07-28 release candidate is available for testing, with final release scheduled for July 28, 2026. MCP connects applications to context/tools; A2A focuses on agent-to-agent discovery and communication.

MCP stable specification MCP 2026 release candidate

Q7

MCP vs A2A vs tool calling: which layer solves which problem?

Layer	Use it for	Do not confuse it with
Tool calling	A model requests a typed application function	Cross-vendor integration discovery
MCP	A host connects to reusable tool/resource/prompt servers	Autonomous agent-to-agent delegation
A2A	Agents advertise capabilities and coordinate tasks across systems	Direct database/tool access

Security answer: protocol compatibility is not authorization. Authenticate identities, apply least privilege, validate schemas, constrain egress, record effect receipts, and require approval based on risk.

MCP specification A2A developer guide

Q8

How do you choose an agent framework, runtime, harness, or managed platform in 2026?

Need	Candidate	Decision signal
Fast high-level agent	LangChain agents, OpenAI Agents SDK, Google ADK	Provider/tool fit, tracing, evals, team familiarity
Explicit durable orchestration	LangGraph or custom state machine	Persistence, interrupts, recovery, deterministic control
Agent harness	Deep Agents or an internal harness	Planning, filesystem/subagents, long-running task ergonomics
Managed production runtime	Bedrock AgentCore, Vertex AI Agent Engine, or equivalent	IAM, networking, observability, compliance, regional support

Strong answer: start from the task and operating constraints. Prototype the simplest bounded workflow, add model-driven autonomy only where it improves measured task success, and choose the platform that minimizes undifferentiated operations without trapping critical business logic.

LangChain products Google ADK Bedrock AgentCore

LLMOps & Evaluation

Deployment, monitoring, cost control, drift detection, CI/CD for LLMs

Core Topic 6 questions

Q1

How do you reduce LLM inference latency in production? (How did you achieve sub-second at EPAM?)

1
Semantic caching (biggest win): Cache (query_embedding, response) pairs in Redis. On each new query, compute embedding similarity — if cosine > 0.92 with a cached query, return cached response. Typically serves 30–50% of traffic from cache at near-zero latency.
2
Streaming responses: Use SSE/streaming so the user sees the first token in ~400ms instead of waiting for the full response. Perceived latency drops dramatically.
3
Async concurrent retrieval: Run BM25 and dense retrieval concurrently with asyncio.gather. If retrieval and LLM call can overlap, pipeline them.
4
Prompt compression: Use LLMLingua to compress long retrieved contexts before sending to the LLM — reduces input tokens 20–40%, direct latency reduction.
5
Model routing: Route classification and simple extraction to the smallest evaluated low-latency model tier; reserve a stronger reasoning tier for complex generation. Re-benchmark by provider and region because model names, pricing, and latency change.
6
Connection pooling: Reuse HTTP connections to the Bedrock endpoint. AWS SDK connection pool + keep-alive saves 80–120ms per cold connection.

At EPAM, the combination of semantic caching + async retrieval + streaming brought P95 latency from ~3.5s → sub-1s.

LLMLingua paper (arXiv 2023) GPTCache — semantic cache library

Q2

What do you monitor in a production LLM system? What are your alert thresholds?

Metric	Tool	Alert threshold	Why it matters
Latency P95 / P99	CloudWatch	P95 > 2s	Direct UX impact
Error rate (5xx)	CloudWatch Alarms	> 1%	System stability
Token cost / hour	LangSmith + custom	> 30% spike	Runaway costs
Faithfulness score	RAGAS online eval	< 0.80	Hallucination proxy
Context precision	RAGAS	< 0.70	Retrieval degradation
Cache hit rate	Redis metrics	< 20% (unexpected drop)	Cost/latency efficiency
Prompt injection attempts	Custom classifier	Any detection	Security
Embedding drift	Scheduled job	Cosine shift > 0.15 vs baseline	Model / data drift

Treat the thresholds above as examples, not universal targets. Calibrate metrics, alert levels, sample rate, and evaluation cadence against user impact, traffic, risk, cost, and human labels. Monitor rolling trends and slice by tenant, intent, language, and model/version.

AWS Bedrock CloudWatch metrics LangSmith monitoring

Q3

What is prompt versioning and how do you implement CI/CD for prompts?

Prompts are code. Changing a prompt is a deployment. Without versioning, you can't roll back a bad prompt change that's causing hallucinations.

1
Store prompts in LangSmith Hub or a config store (SSM Parameter Store / Git): Every prompt has a version hash. Production always pins a specific version.
2
Evaluation gate in CI: PR changing a prompt triggers an eval pipeline. RAGAS scores are computed on a golden test set. Merge only if faithfulness ≥ 0.82 and answer_relevancy ≥ 0.72.
3
A/B shadow testing: New prompt gets 10% of traffic in shadow mode. Compare metrics vs current. Promote if better.
4
Instant rollback: Update SSM parameter to prior version hash. No redeploy required.

LangSmith Prompt Hub

Q4

How do you reduce LLM API costs in production by 50%+?

Technique	Cost reduction	Implementation effort
Semantic caching	30–50% fewer LLM calls	Medium (Redis + GPTCache)
Model routing (smart/cheap)	40–70% on routing tasks	Medium (LLM router)
Bedrock batch inference	50% discount vs on-demand	Low (async jobs only)
Prompt compression (LLMLingua)	20–40% input token reduction	Medium
Reduce top-K retrieval	5–15% input token reduction	Low (tune k from 20 → 5)
Context window right-sizing	Variable	Low

AWS Bedrock batch pricing LLMLingua compression

Q5

What is LLM observability and how does it differ from traditional APM?

Traditional APM (Datadog, New Relic) tracks: latency, error rate, CPU, memory — all mechanical metrics. LLM observability adds a semantic layer: does the system say the right things?

Traces: Full chain trace — input query → retrieval results → LLM prompt sent → response. LangSmith captures all of this per-run.
Semantic metrics: Faithfulness, relevancy, hallucination rate — can only be measured by an LLM evaluator (judge model).
Token-level cost attribution: Which component of your chain uses the most tokens? LangSmith shows cost breakdown per node.
User feedback loops: Thumbs up/down signals feed back into your golden test set for future eval runs.

OpenTelemetry for LLM tracing W&B LLMOps guide

Q6

How do you detect and handle model drift or knowledge cutoff issues in production?

Model drift in LLMs shows up as: answer quality degradation (faithfulness drops), output format changes (model update broke JSON parsing), or latency shifts (new model version).

1
Golden test set regression: Run a versioned, representative test set on every deployment. Gate release using task-specific, statistically meaningful tolerances calibrated to business risk.
2
Embedding and traffic drift: Track query/topic distributions, retrieval success, score distributions, and labelled outcomes by version. Investigate material changes relative to calibrated baselines rather than one universal cosine threshold.
3
Knowledge freshness: For time-sensitive domains, add a "freshness" metadata filter. Prefer recently indexed documents. Surface "as of [date]" disclaimers in the response.

Evaluation and model optimization

Fine-tuning / PEFT (LoRA, QLoRA)

Efficient adaptation of LLaMA, Gemma — your EPAM work

Strong Signal 5 questions

Q1

Explain LoRA mathematically. What problem does it solve?

Problem: Full fine-tuning a 7B model updates all ~7 billion weights — requires ~112GB GPU RAM (FP16), making it infeasible without a cluster.

LoRA insight: The weight update matrix ΔW during fine-tuning has intrinsically low rank. Instead of updating W directly, decompose the update: ΔW = A × B, where A ∈ ℝ^d×r and B ∈ ℝ^r×k, with r ≪ min(d, k). Only A and B are trained. W stays frozen.

FrozenPre-trained W4096 × 4096
16.7M parameters

+

TrainableMatrix A4096 × 16
65K parameters

×

TrainableMatrix B16 × 4096
65K parameters

=

AdaptedW + ΔWDomain-specific behavior
without replacing W

Rank r = 16 vs full rank 4096 Only A and B receive gradients ~99.2% fewer trainable parameters

Python · PEFT LoRA config

from peft import LoraConfig, get_peft_model, TaskType

config = LoraConfig(
    r=16,                              # rank
    lora_alpha=32,                     # scaling = alpha/r = 2
    target_modules=["q_proj", "v_proj"],  # attention layers
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)
model = get_peft_model(base_model, config)
model.print_trainable_parameters()
# trainable: 4,194,304 / 6,738,415,616 total (0.062%)

LoRA paper — Hu et al. (arXiv 2021) HuggingFace PEFT / LoRA guide

Follow-up questions

QLoRA vs LoRA Fine-tune vs RAG decision

Q2

What is QLoRA? How does it enable fine-tuning on consumer hardware?

QLoRA = Quantised LoRA. It quantises the base model to 4-bit NF4 (NormalFloat4) while keeping LoRA adapter weights in BF16. A 13B model at 4-bit takes ~6.5GB instead of ~26GB.

Method	Base model precision	GPU RAM (13B)	Quality vs full FT
Full fine-tuning	BF16	~104GB	100% (baseline)
LoRA	BF16	~26GB	~98%
QLoRA	4-bit NF4	~6.5GB	~97%

QLoRA also introduces double quantisation (quantise the quantisation constants) and paged optimisers (offload optimizer states to CPU RAM on memory spikes). These together enable fine-tuning a 65B model on a single A100 80GB.

QLoRA paper — Dettmers et al. (arXiv 2023)

Q3

When do you fine-tune vs use RAG? Give a decision framework.

Criteria	Use RAG	Use Fine-tuning
Knowledge updates	Frequent (daily/weekly)	Stable domain knowledge
Data availability	Large unstructured corpus	Representative, high-quality labelled examples; required volume is task-dependent
Output style/format	Generic format OK	Specific tone, JSON schema, brand voice
Explainability	Source-grounded citations	Opaque — no citation
Latency	Adds retrieval overhead	No retrieval step
Cost	Low upfront, per-query retrieval cost	GPU training cost upfront

Common production pattern: fine-tune for repeated behaviour and style; use RAG for current, source-grounded facts. Combine them only when the evaluated gain justifies added complexity. Fine-tuning support varies by provider and model, so verify current availability.

Anyscale: Fine-tuning is for form, not facts

Q4

What is DPO and how does it compare to RLHF for alignment?

RLHF (Reinforcement Learning from Human Feedback) trains a separate reward model from human preference pairs, then uses PPO to optimise the LLM against that reward model. Complex, unstable, requires large infra.

DPO (Direct Preference Optimisation) bypasses the explicit reward-model-plus-PPO pipeline. Given (prompt, chosen_response, rejected_response) pairs, it directly updates the model to prefer the chosen response. It is operationally simpler than classic RLHF, but quality and stability still depend on data, objective, hyperparameters, and evaluation.

Method	Reward model	Stability	Data needed
RLHF / PPO	Required	Fragile	Large
DPO	Not needed	Stable	Moderate
ORPO	Not needed	Evaluate for the task	Task-dependent

DPO paper — Rafailov et al. (arXiv 2023)

Q5

How do you prepare fine-tuning data? What quality signals matter most?

1
Format: Use instruction-following format — {"instruction": "...", "input": "...", "output": "..."} (Alpaca format) or multi-turn chat format for conversational models.
2
Quality before scale: Start with a small, curated, representative dataset; measure the gain on a held-out eval set, inspect failures, then add targeted examples. Deduplicate and use calibrated human/model review rather than trusting one judge.
3
Diversity: Cover all intended use cases. Uneven distribution → model will overfit the majority class.
4
Contamination check: Ensure test set has no overlap with training data. Use exact and near-duplicate detection.

Prompt Engineering

CoT, few-shot, context engineering, structured outputs, adversarial robustness

Foundational 5 questions

Q1

Explain Chain-of-Thought prompting. What variants exist and when do you use each?

Chain-of-Thought (CoT) prompting encourages intermediate reasoning on multi-step tasks. It can improve performance, but production systems should not rely on exposing a model's private reasoning. Ask for a concise rationale, assumptions, calculations, citations, or verification artifacts that can be audited.

Variant	How to trigger	Best for
Zero-shot reasoning	Ask for careful analysis plus a concise, checkable answer	Baseline to evaluate, not a guaranteed win
Few-shot reasoning	Provide representative worked examples and expected artifacts	Domain-specific consistency
Tree of Thought (ToT)	Prompt to explore N paths, evaluate each, select best	Complex planning, search problems
ReAct	Thought/Action/Observation loop	Tool-using agents
Self-consistency	Sample multiple solutions and aggregate/verify	Cases where added cost is justified; never the only high-stakes control

Prompt · CoT system instruction

system = """You are a financial analyst. Before answering any question:
1. State the key facts given in the context.
2. Identify any missing information.
3. Show the calculations, citations, or checks needed to verify the result.
4. State the final answer and concise rationale clearly.

If you are uncertain, say so explicitly."""

Chain-of-Thought Prompting — Wei et al. (arXiv 2022) Tree of Thoughts — Yao et al. (arXiv 2023)

Q2

What is context engineering? How is it different from prompt engineering?

Prompt engineering = crafting the instruction/template text. It's about what you say.

Context engineering = dynamically constructing the entire model input — which retrieved documents to include, how to summarise prior conversation, which tool outputs to pass, how to structure memory, and where to place instructions. It is the broader discipline of managing a model's available context budget intelligently; the exact window varies by model and provider.

Primacy + recency: LLMs attend most to content at the very start and very end of the context. Place the most critical instructions in both positions.
Structured delimiters: Use XML tags (<context>, <instructions>, <examples>) to cleanly separate sections — reduces instruction-following errors.
Conversation compression: Summarise turns older than 10 with a cheap LLM. Keeps token cost bounded without losing thread.
Lost in the middle: Long contexts suffer from "lost in the middle" — relevant info in the middle of a 100K context gets underweighted. Front-load the most important chunks.

Lost in the Middle (arXiv 2023)

Q3

How do you get reliable structured output (JSON) from an LLM?

1
Provider-native schema-constrained output: Prefer the current API's Structured Outputs / JSON Schema or typed tool-calling feature when supported. This constrains shape more strongly than plain JSON mode.
2
Application validation: Parse into Pydantic or an equivalent schema, enforce semantic/business constraints, reject unknown fields, and retry only bounded, repairable failures.
3
Fallbacks: Plain JSON mode guarantees syntax, not business correctness. Grammar/constrained-decoding libraries can help with self-hosted models, but still validate after generation.

Python · structured output via Pydantic

from pydantic import BaseModel, Field
from langchain_anthropic import ChatAnthropic

class RiskAssessment(BaseModel):
    risk_level: str = Field(description="low | medium | high | critical")
    key_factors: list[str] = Field(description="Top 3 risk factors")
    recommendation: str

llm = ChatAnthropic(model="current_evaluated_model")
structured_llm = llm.with_structured_output(RiskAssessment)
result = structured_llm.invoke("Assess the credit risk for...")
# Then apply business validation and bounded error handling.

Structured Outputs guide

Q4

How do you defend against prompt injection attacks?

Prompt injection occurs when untrusted content influences model behavior against the application's intent. It cannot be solved reliably by one prompt, delimiter, classifier, or blocklist; design the surrounding system so a successful injection still cannot perform an unauthorized action.

Separate instructions from untrusted data: mark provenance and prevent retrieved/user/tool content from becoming durable policy or memory without review.
Deterministic authorization: the model may propose an action, but code validates identity, permission, arguments, data scope, and business rules.
Least privilege and approvals: use scoped credentials, egress allowlists, read-only defaults, idempotency, and human confirmation for high-impact actions.
Defense in depth: input/output classifiers and guardrails are useful signals, then add adversarial evals, logging, anomaly detection, kill switches, and incident response.

OpenAI agent safety OWASP Agentic 2026

Q5

What is the difference between system prompt, few-shot examples, and RAG context?

Component	Purpose	Changes per request?	Token budget
System prompt	Role, rules, output format, persona	No (static)	500–2000 tokens
Few-shot examples	Demonstrate expected format/style	Rarely	500–3000 tokens
RAG context	Current factual knowledge for this query	Yes (per query)	2000–8000 tokens
Conversation history	Prior turns context	Yes (grows)	500–10000 tokens
User message	The actual query	Yes	50–500 tokens

AWS GenAI Stack — 2026

Bedrock, AgentCore, SageMaker AI, OpenSearch, serverless and containers

Your Stack 10 questions

Q1

Design a production Hybrid RAG system on AWS from scratch.

Ingestion plane

Event-driven, asynchronous, and independently scalable

StoreS3 documentsRaw files + versioned source

BufferSQS eventsRetries + dead-letter queue

ProcessLambda ingestParse · chunk · metadata

RepresentEvaluated embedding modelBedrock model invocation

IndexOpenSearchkNN vectors + BM25 text

Serving plane

Latency-sensitive request path with guardrails and citations

EdgeAPI GatewayAuth · quota · user query

OrchestrateLambda queryParallel retrieval + policy

RetrieveOpenSearchHybrid search + filters

GenerateEvaluated Bedrock modelGuardrails + grounded prompt

ReturnCited responseStreamed answer + sources

CloudWatch traces & alarms IAM least privilege KMS encryption Cost & quality evaluation

1
Ingestion: S3 PUT event → SQS → Lambda or container worker (parse, normalize, chunk, attach provenance) → an evaluated Bedrock embedding model → OpenSearch Serverless or the selected supported vector store. Pin model/version and plan re-indexing.
2
Query path: API Gateway → orchestrator → embed query → hybrid retrieval with tenant filters → evaluated reranker when it improves quality enough to justify latency/cost → top evidence into an evaluated Bedrock generation model → validate, cite, and stream.
3
Security: IAM roles per service (least privilege). Bedrock Guardrails for PII redaction. Row-level security in OpenSearch (users only see their org's documents). All queries logged to S3 for audit.
4
Monitoring: CloudWatch custom metrics for faithfulness, latency P95, token cost. Alarm on any metric breaching threshold.

AWS OpenSearch Serverless vector search AWS Bedrock overview

Q2

What is AWS Bedrock Knowledge Bases? When do you use it vs custom RAG?

Bedrock Knowledge Bases is AWS's managed RAG capability for connecting supported data sources, parsing/chunking content, creating or using supported vector stores, retrieving evidence, reranking where supported, and integrating retrieval with generation. Exact sources, stores, models, and features vary by region and configuration.

Dimension	Bedrock KB (managed)	Custom RAG
Delivery effort	Usually faster; managed integrations and operations	More engineering and operational ownership
Customisation	Configurable within currently supported sources, stores, models, and workflows	Full control over every retrieval and serving stage
Hybrid search	Yes — verify region and data-store support	Yes (full control)
Re-ranking	Built-in (Amazon Rerank)	Any reranker
Cost at scale	Benchmark total managed-service cost	Can optimize deeply, but include engineering and operations

Choose Bedrock Knowledge Bases when its current feature set meets quality, control, governance, and cost requirements. Choose custom RAG when differentiated parsing, retrieval, data stores, orchestration, or operations create measurable value.

Q3

How do AWS Lambda and ECS differ for GenAI workloads? When do you use which?

Dimension	Lambda	ECS / Fargate
Execution model	Event/request-driven functions with platform limits	Long-running container services and tasks
Startup	Cold starts depend on runtime, package, networking, and configuration	Keep desired tasks warm; task startup still matters during scaling
Duration/resources	Bounded by current Lambda quotas	Broader task sizing and duration control; verify Fargate/EC2 limits
Streaming	Supported patterns, with integration-specific constraints	Native application HTTP/WebSocket patterns
Cost model	Usage-based function execution	Provisioned task resources while running

Rule of thumb: use Lambda for bounded, event-driven orchestration and transformation when its current quotas fit. Use ECS/Fargate or EKS for long-running agents, custom runtimes, connection-heavy streaming, background workers, or sustained workloads. Verify today's quotas and benchmark both.

Q4

What is Bedrock Guardrails and how do you implement content safety?

Bedrock Guardrails is a managed safety layer that intercepts LLM inputs and outputs. It operates independently of the model — wraps any Bedrock-hosted model.

Topic blocking: Define topics the model must not discuss (e.g., competitor products, investment advice). Guardrail intercepts and returns a pre-defined safe response.
PII redaction: Automatically detects and redacts (or masks) PII in both input and output — names, SSNs, credit card numbers.
Grounding check: Compares the generated response against the retrieved context. Flags responses that introduce facts not present in the context (hallucination detection).
Word filters: Block specific words, phrases, or regex patterns in input/output.

AWS Bedrock Guardrails docs

Q5

How does AWS Step Functions integrate with multi-agent GenAI workflows?

Step Functions provides a managed, visual workflow orchestrator. For deterministic multi-step GenAI pipelines, it's a strong alternative to LangGraph — especially when you need:

Built-in retry logic with exponential backoff on each step.
Long-running workflows (days) — impossible with Lambda alone, natural with Step Functions.
Human approval steps via waitForTaskToken — workflow pauses until a human approves, then resumes.
Parallel fan-out via Map state — process 1000 documents in parallel, each calling a Lambda.

Q6

Amazon Bedrock vs SageMaker AI endpoints vs EKS/EC2 for GenAI inference.

Option	Choose When	You Own	Interview Trade-off
Bedrock	Managed access to supported foundation/custom models; fastest product delivery	Application, prompts, evals, governance configuration	Less runtime/inference-engine control
SageMaker AI endpoint	Custom model serving, MLOps integration, managed endpoints	Model packaging/configuration and endpoint operations	More control and more operational work
HyperPod / EKS / EC2	Specialized training/serving, custom kernels/runtime, maximum control	Capacity, orchestration, scaling, resilience, cost utilization	Highest flexibility and operational burden

AWS inference architecture guidance

Q7

What is Amazon Bedrock AgentCore, and when would you use it?

2026 positioning: AgentCore is the AWS platform layer for securely deploying and operating agents built with different frameworks and models. Discuss runtime isolation, enterprise tool/data connectivity, identity/access controls, tracing, debugging, and evaluation rather than treating an agent as only an LLM loop.

Use It When

You need managed production agent runtime and governance.
Agents connect to enterprise systems with controlled authentication.
Tracing, debugging, evaluation, and operational consistency matter.

Custom Runtime When

Runtime behavior is highly differentiated.
You need unusual isolation, portability, or multi-cloud control.
Platform feature/region constraints do not meet requirements.

Amazon Bedrock AgentCore

Q8

Design AWS security and networking for a regulated enterprise GenAI platform.

Identity & Data

Separate accounts/environments; least-privilege IAM roles and scoped agent/tool permissions.
KMS encryption, Secrets Manager, tenant-level authorization, data classification and retention.
Guardrails/DLP, immutable audit trail, prompt/response logging policy.

Network & Operations

Private subnets and VPC endpoints/PrivateLink where supported; restrict egress.
WAF/API Gateway throttling, CloudTrail, Config, Security Hub, CloudWatch/X-Ray.
Threat-model model/vendor calls, tools, RAG data, agents, and human approvals.

AWS GenAI Security Reference Architecture

Q9

How do 2026 Bedrock Guardrails and automated reasoning checks fit into safety?

Guardrails can be applied consistently around model interactions to filter harmful content, protect sensitive information, assess grounding, and enforce use-case policies. Automated Reasoning checks can detect logical issues and unstated assumptions against formalized policies, but return findings in detect mode; your application decides whether to serve, revise, clarify, or escalate.

Do not oversell: guardrails are one layer. Authorization, tool restrictions, output validation, monitoring, red-teaming, and human approval remain necessary.

Automated Reasoning checks

Q10

How do you apply the AWS Well-Architected GenAI Lens?

Operational excellenceVersion prompts/models/data, automate evals, trace and debug.

SecurityIdentity, data protection, injection/tool threats, approvals.

ReliabilityQuotas, multi-AZ/regional strategy, fallback, idempotency, recovery.

Performance & costModel routing, context budget, caching, batching, capacity/utilization.

Also discuss sustainability and business impact: avoid unnecessary large-model calls, measure value, and retire low-value workloads.

AWS Generative AI Lens

GCP GenAI Stack — 2026

Vertex AI, Agent Engine, Gemini Enterprise, RAG, data, security and operations

Cloud Depth 10 questions

Q1

Design a production enterprise RAG platform on GCP.

I
Ingest: Cloud Storage events → Pub/Sub/Eventarc → Cloud Run or Dataflow workers; use Document AI where layout/OCR matters. Make jobs idempotent and dead-letter failures.
R
Retrieve: Choose Vertex AI RAG Engine for managed orchestration, Vector Search for high-scale ANN, or AlloyDB/pgvector when relational joins and transactions dominate. Store tenant and ACL metadata with every chunk.
G
Generate: Call an evaluated Vertex AI model, enforce grounded citations and structured output, then validate before serving through Cloud Run/API Gateway.
O
Operate: IAM service accounts, VPC Service Controls/Private Service Connect where applicable, CMEK, Cloud Logging/Monitoring/Trace, offline golden sets and sampled online evaluation.

Cross-question answers: Use Agent Search when managed enterprise discovery, connectors and relevance controls are the product; use custom RAG when pipeline control and application-specific retrieval matter. Propagate deletes with tombstone events across source, chunks, index and caches, then audit completion. Enforce ACLs as retrieval-time pre-filters derived from authenticated identity, never only after retrieval. A dimension/model change requires a versioned new index, dual-write or backfill, evaluation, then an alias cutover.

Google Cloud RAG architectures

Q2

Vertex AI Agent Engine vs Gemini Enterprise Agent Platform vs a custom agent runtime.

Choice	Best Fit	Trade-off
Agent Engine	Managed deployment and operations for custom agents; sessions, memory, code execution and governance	Platform constraints and regional feature checks
Gemini Enterprise Agent Platform	Enterprise discovery, connected agents and governed employee workflows	Less freedom than a fully custom product runtime
Cloud Run/GKE custom	Maximum framework, portability and runtime control	You own isolation, scaling, tracing, state and upgrades

Interview answer: start with compliance, integration, latency, portability and operating-model requirements; then choose. Do not select an agent platform merely because the workflow calls an LLM.

Gemini Enterprise Agent Platform Vertex AI release notes

Q3

RAG Engine vs Vector Search vs AlloyDB/pgvector vs Agent Search — how do you choose?

Service	Choose When	Probe
RAG Engine	You want managed ingestion/retrieval integration with Vertex AI	Supported sources, regions, quotas and customization
Vector Search	Large-scale, low-latency ANN is central	Index update strategy, filtering and cost
AlloyDB/pgvector	Vectors must live beside relational data and transactions	Scale ceiling and query plan behavior
Agent Search	Enterprise search/discovery is the product	Connectors, relevance controls and permissions

Cross-question: Prove the choice with corpus size, update rate, filter selectivity, recall@k, P95 latency, data residency, team skills and total cost.

Q4

Cloud Run vs Cloud Run functions vs GKE for GenAI workloads.

Runtime	Use It For	Watch-outs
Cloud Run	Stateless APIs, streaming gateways, workers and portable containers	Cold starts, request limits, concurrency and downstream quotas
Cloud Run functions	Small event handlers and glue logic	Keep complex orchestration out of tiny handlers
GKE	Custom serving, GPUs, sidecars, service mesh and deep scheduling control	Cluster operations and utilization

Separate the latency-sensitive serving path from asynchronous ingestion/evaluation. Scale each independently and put Pub/Sub between bursty producers and workers.

Q5

Vertex AI managed endpoints and Model Garden vs custom model serving.

Managed endpoint: prefer when managed autoscaling, model lifecycle, monitoring and IAM integration outweigh runtime customization.
Cloud Run/GKE: prefer for custom inference engines, unusual dependencies, portability or tight hardware control.
Decision evidence: quality benchmark, tokens/sec, first-token latency, concurrency, accelerator utilization, availability, region, safety and cost per successful task.

Never answer with only a model name. In 2026, explain the evaluated capability tier and the routing/fallback policy.

Q6

Design security and networking for a regulated GCP GenAI platform.

Prevent

Dedicated service accounts and least-privilege IAM.
VPC Service Controls, Private Service Connect/private access and restricted egress where supported.
CMEK, Secret Manager, DLP, tenant authorization and tool allowlists.

Detect & Respond

Cloud Audit Logs, Security Command Center, Cloud Logging and alerting.
Trace prompts, retrieval, tools and policy decisions with redaction.
Human approval for high-impact actions; incident kill switch and credential rotation.

Threat model: prompt injection, data exfiltration, poisoned retrieval, excessive agency, insecure tool arguments, cross-tenant leakage and model/vendor outages.

Q7

How would you use Pub/Sub, Dataflow and BigQuery for GenAI data and evaluation pipelines?

1
Publish immutable ingestion/evaluation events with schema version, tenant, correlation ID and idempotency key.
2
Use Dataflow when transformations are high-volume, streaming, windowed or need replay; use Cloud Run workers for simpler task queues.
3
Land sanitized traces and evaluation facts in BigQuery; partition by event date, cluster by tenant/model/version, and enforce retention/access policies.
4
Monitor backlog age, dead letters, duplicate rate, processing latency, quality drift and evaluation cost.

Q8

BigQuery vs AlloyDB vs Spanner for GenAI application data.

Store	Primary Role	Example
BigQuery	Analytical warehouse	Trace analytics, offline evals, cohorts and cost reporting
AlloyDB	Relational operational data with PostgreSQL compatibility	Conversation metadata, app transactions and vector-relational queries
Spanner	Globally consistent, horizontally scalable operational data	Multi-region agent state requiring strong consistency

State your access patterns, consistency needs, region topology, transaction boundaries, retention and expected growth before naming a database.

Q9

Describe CI/CD, observability and evaluation for a GCP GenAI service.

Build/deploy: Cloud Build → Artifact Registry → Cloud Deploy or controlled infrastructure pipeline; promote immutable application, prompt, model-policy and dataset versions.
Pre-production gates: unit/contract/security tests plus golden-set quality, latency and cost thresholds; canary before full rollout.
Runtime: Cloud Logging/Monitoring/Trace with correlation IDs across retrieval, model and tools. Alert on SLOs, quotas, safety, groundedness, tool failures and spend.
Rollback: independently roll back prompt, model route, index alias and application version.

Q10

Map an AWS GenAI architecture to GCP without pretending the services are identical.

Concern	AWS Example	GCP Example
Managed foundation models	Amazon Bedrock	Vertex AI
Managed agent operations	Bedrock AgentCore	Vertex AI Agent Engine / Gemini Enterprise Agent Platform
Serverless containers/functions	ECS/Fargate, Lambda	Cloud Run, Cloud Run functions
Messaging/streaming	SQS/SNS/Kinesis	Pub/Sub
Vector/search	OpenSearch, Bedrock Knowledge Bases	Vector Search, RAG Engine, Agent Search
Warehouse/analytics	Redshift/Athena	BigQuery

Senior answer: map capabilities, then re-evaluate IAM, networking, regional availability, quotas, reliability, operating skills and cost. A service-name translation is not an architecture migration.

Vector Databases

HNSW, ANN algorithms, Pinecone vs OpenSearch, metadata filtering

Core Topic 5 questions

Q1

How does HNSW work? Why is it the dominant ANN algorithm?

HNSW (Hierarchical Navigable Small World) builds a multi-layer proximity graph. Higher layers have few nodes with long-range "highway" connections; lower layers get progressively denser with short-range connections — mimicking how road networks work (motorways → main roads → local streets).

Search: Starts at an entry point in the top layer, greedily walks toward the query vector, then descends to lower layers for precision. Gives O(log n) average search time with high recall.

Algorithm	Speed	Recall	Memory	Best for
HNSW	Fastest	Highest	High (graph)	Low-latency production
IVF-Flat	Fast	Medium	Low	Large-scale, budget memory
IVF-PQ	Fast	Medium	Very low	Billion-scale (with compression)
Flat (exact)	Slow O(n)	Perfect	Lowest	Eval benchmarks, <100K docs

Key HNSW parameters: ef_construction (build quality, more = slower build but better graph), M (connections per node, more = better recall but more memory), ef_search (search time recall trade-off).

HNSW paper — Malkov & Yashunin (arXiv 2016)

Q2

How do metadata filters work in vector search? What are the performance tradeoffs?

Metadata filtering restricts ANN search to a subset of vectors matching a filter condition (e.g., doc_type == "earnings_report" AND date >= 2026-01-01).

Three approaches with different tradeoffs:

Pre-filtering (filter then search): Narrow to matching documents first, then ANN search over that subset. High precision but if the subset is small, HNSW degrades toward brute force.
Post-filtering (search then filter): Run ANN over full index, then discard non-matching results. Fast but can return fewer than k results if many are filtered out.
ACORN / filtered HNSW: Modern approach — metadata stored alongside vectors in the graph. Filter-aware graph traversal. Best of both worlds. Used in Pinecone, Weaviate.

Always index your most common filter fields. In OpenSearch, declare metadata fields as keyword type (not text) for exact-match filtering. Use post-filtering for high-cardinality filters, pre-filtering for low-cardinality (e.g., tenant_id).

Q3

Pinecone vs OpenSearch vs Qdrant vs pgvector — how do you choose?

DB	Best for	Hybrid search	Self-host?	Notes
OpenSearch	Search-heavy workloads, AWS integration, BM25+vector	Yes	Yes	Strong search feature set; benchmark relevance and operations
Pinecone	Managed vector search with minimal operations	Yes (sparse+dense)	No	Benchmark cost, filters, tenancy, and region fit
Qdrant	Vector-first workloads and payload filtering	Yes	Yes	Managed and self-hosted choices; benchmark your workload
Weaviate	Multi-tenancy, schema-rich	Yes (BM25+vector)	Yes	Strong for SaaS multi-tenant products
pgvector	Postgres-native joins, transactions, and vector search	Combine with Postgres FTS	Yes	Can scale significantly with tuning/partitioning; benchmark recall, latency, and ops
Chroma	Local, single-node, or distributed/Cloud vector workloads	Product-dependent	Yes	Choose deployment mode by scale and operational needs

Decision rule: benchmark your corpus, filters, update rate, QPS, recall target, p95 latency, tenancy, joins/transactions, regions, recovery, team skill, and total cost. There is no universally fastest or cheapest vector database.

pgvector Chroma architecture OpenSearch hybrid search

Q4

What is quantisation in vector search? How does Product Quantisation (PQ) work?

Storing a 1536-dim float32 embedding takes 6KB. For 10M documents that's 60GB. Quantisation compresses vectors to reduce memory at the cost of some recall accuracy.

Product Quantisation (PQ) splits the 1536-dim vector into M sub-vectors (e.g., M=64, each 24-dim). Each sub-vector is mapped to its nearest centroid in a codebook of 256 centroids. Result: 64 bytes instead of 6144 bytes — 96× compression. Distance computation uses a precomputed lookup table, keeping it fast.

Binary quantisation: Convert each float to a single bit (positive=1, negative=0). 32× compression. Works surprisingly well for high-dimensional embeddings with cosine similarity.

Q5

How do you handle vector index updates when documents change frequently?

Upsert by document ID: Every document has a stable ID. Reprocess and upsert when the document changes. All major vector DBs support upsert (Pinecone, OpenSearch, Qdrant).
Chunk deletion on update: When a document is updated, delete all chunks with parent_doc_id == document.id, then re-ingest. Track chunk-to-document mapping in a separate metadata store (DynamoDB/RDS).
Event-driven ingestion: S3 Object Lambda or DynamoDB Streams → SQS → Lambda ingestion pipeline. Near-real-time index updates without polling.
Versioned indexes: For zero-downtime updates on large corpora, build the new index alongside the old one, then do an atomic alias swap (OpenSearch index aliases).

HLD System Design (AI Systems)

End-to-end architecture, capacity, reliability, security, evaluation and cost

Must-ace 9 designs

Framework for every AI system design answer: Ingestion → Data model → Embedding/Retrieval → Generation → Evaluation → Monitoring → Security. Always mention latency targets, failure modes, and cost tradeoffs.

D1

Design an enterprise document Q&A for 10M documents (finance use case). Handle 500 QPS.

I
Ingestion: S3 → SQS (decouple) → Lambda fleet (parse PDF/DOCX/XLSX, hierarchical chunk: parent 1024t / child 256t, 50t overlap) → an evaluated Bedrock embedding model → OpenSearch Serverless (kNN + BM25 fields). Metadata: doc_type, entity_id, created_at, source. Row-level security by entity_id.
D
Data model: OpenSearch index with both embedding: knn_vector(dimension) and content: text (for BM25). Store the evaluated embedding dimension and model version in the index schema. Separate metadata fields as keyword type. DynamoDB tracks chunk-to-document mapping for updates/deletions.
E
Retrieval: Hybrid BM25 + kNN with evaluated fusion → an evaluated reranker when useful → evidence budget sent to the generation model. Add a version-aware semantic cache only after measuring correctness, staleness, latency, and hit rate.
G
Generation: An evaluated Bedrock reasoning/generation model. System prompt enforces: cite source for every claim, flag if answer not in context. Bedrock Guardrails for PII + topic blocking. Response streaming via Lambda Response Streaming.
Ev
Evaluation: versioned offline set plus risk-based production sampling. Calibrate retrieval, groundedness, task success, safety, latency, and cost gates against human review and business impact; do not present generic RAGAS thresholds as universal.
M
Monitoring: CloudWatch: latency P95 < 1.5s, error rate < 0.5%, cost per query. LangSmith: full trace per request. X-Ray for distributed tracing across Lambda chains.
S
Security: IAM roles per service. Bedrock Guardrails PII redaction. Entity-level index partitioning (user A cannot retrieve user B's documents). All queries audit-logged to S3.

Tradeoffs to mention: Re-ranking adds ~100ms but improves recall significantly for finance queries — worth it. Semantic cache saves cost but can serve slightly stale answers — use 1hr TTL for financial data, 24hr for regulatory docs.

D2

Design a multi-agent customer support bot with escalation to human agents.

1
Architecture — Supervisor pattern in LangGraph: IntentClassifier node (routes query type) → either KnowledgeAgent (RAG over support docs), ActionAgent (CRM/order system APIs), or EscalationAgent (human handoff).
2
Intent classification: An evaluated low-latency model tier classifies intent: information_request | account_action | complaint | escalation. Select it against an intent test set and explicit latency/cost SLO rather than relying on a model-name assumption.
3
Escalation triggers: (a) User explicitly asks for human, (b) agent confidence < threshold for 2 turns, (c) complaint category detected, (d) max turns (10) reached. LangGraph uses interrupt_before at escalation node.
4
Context handoff: Full conversation summary + entity extraction (customer ID, issue category, sentiment score) passed to human agent dashboard via Amazon Connect + DynamoDB.
5
Memory: Short-term in LangGraph state. Long-term in DynamoDB keyed by customer_id (preferences, past issues) — injected as context on next session.

D3

Design an LLM evaluation pipeline at scale (continuous eval for production).

1
Data collection: Risk-based, privacy-reviewed production sampling to an asynchronous evaluation queue. Oversample rare, high-impact, low-confidence, and newly changed slices.
2
Judge layer: Use deterministic checks where possible and one or more evaluated judge models for rubric-based criteria. Calibrate judges against blinded human labels, monitor agreement/bias, and version judge prompts/models.
3
Aggregation: Daily Lambda aggregates scores → writes to RDS → visualised in Grafana dashboard.
4
CI gate: Every code, prompt, model, retriever, or policy change triggers a representative regression suite. Fail or require review when calibrated quality/safety tolerances or SLOs are breached.
5
Feedback loop: Low-scoring samples surfaced to human reviewers → added to golden test set → improves future evals.

D4

Design a multi-region, multi-provider GenAI gateway for 2,000 QPS.

RequirementsP95 first token < 700ms, 99.95% availability, tenant quotas, residency, streaming, provider failover.

Core pathGlobal edge → auth/quota → policy router → provider adapters → safety/output validation → stream response.

StateTenant policy/config strongly consistent; request state regional and ephemeral; usage events asynchronously aggregated.

ReliabilityTimeout budgets, retry only safe failures, circuit breakers, regional/provider health, degraded fallback.

C
Capacity: derive concurrent streams from QPS × average duration; size connection pools and rate limits against provider quotas, not only CPU.
S
Security: per-tenant keys/policies, secret isolation, egress allowlists, audit trail and prompt/response retention controls.
X
Cross-question answers: prevent duplicate billable calls with an invocation ledger, idempotency key and explicit handling of ambiguous timeouts. Route fallback only to providers that satisfy the required schema/tool capability contract, then validate output. Test residency with policy-as-code, regional routing tests, blocked cross-region egress, and audit-log evidence.

D5

Design an enterprise agent platform with tools, memory and human approval.

1
Control plane: agent definitions, prompt/model/tool versions, policy, evaluations, rollout and audit.
2
Data plane: isolated runtime executes a bounded state machine; tool gateway authenticates, authorizes, validates arguments and records effects.
3
Memory: separate immutable event history, short-term working state and curated long-term memory with consent, expiry and deletion.
4
Safety: risk-tier tools; read-only may auto-run, money/data mutations require deterministic checks and human approval. Never trust model text as authorization.
5
Operate: trace every reasoning-independent event, tool call, policy decision and approval; cap steps/tokens/time and provide kill switches.

Cross-question answers: resume from a durable checkpoint after the last committed state transition and reacquire a lease. Prevent replayed tool effects with invocation IDs, idempotency keys and stored effect receipts. Evaluate nondeterministic agents across repeated runs using terminal-task success, policy violations, tool correctness, path efficiency, latency and cost rather than requiring one exact trajectory.

D6

Design a multimodal document-intelligence platform for invoices and contracts.

Pipeline

Upload → malware scan → immutable object store.
OCR/layout/table extraction → normalized document graph.
Rules + model extraction → schema validation → confidence routing.
Human review for low-confidence/high-value fields.

Production Concerns

Idempotent jobs, page-level retries and lineage to source coordinates.
Per-field precision/recall, review rate and business-value metrics.
PII isolation, retention, tenant keys and deletion propagation.
Versioned extractors and replayable raw documents.

API/data model: POST /documents, GET /jobs/{id}, webhook completion; Document, Page, Element, Extraction, Evidence, ReviewDecision and ModelVersion.

D7

Design a real-time voice assistant with interruption and tool calling.

P
Path: WebRTC/media gateway → streaming ASR → dialogue/agent runtime → tools/RAG → streaming TTS. Keep regional session state and a durable event summary.
L
Latency: budget each stage; stream partial transcripts and speech; use VAD and speculative work carefully. Measure time-to-first-audio and interruption-stop latency.
B
Barge-in: cancel TTS and downstream work, advance the turn epoch, and reject late results from the old epoch.
R
Reliability: reconnect token, session checkpoint, provider fallback, bounded silence/timeouts and graceful transfer to human.

Cross-question: How do you prevent a partially heard confirmation from triggering a payment? Require explicit deterministic confirmation before high-impact tools.

D8

Design an intelligent model-routing and semantic-cache platform.

Router inputs: task type, complexity, modality, context size, tenant policy, region, safety tier, live health, measured quality, latency and cost.
Policy: deterministic hard constraints first; learned or rules-based ranking second; fallback must preserve schema/tool capability and safety.
Cache: key includes normalized intent plus tenant, policy, prompt, model family, knowledge/index version and safety context. Never share across authorization boundaries.
Evaluate: counterfactual shadow traffic, quality-regret, cache precision, hit rate, cost per accepted task and tail latency.

D9

Design a secure enterprise MCP/tool platform.

RegistryVersioned tool schemas, owners, risk tiers, scopes, regions and deprecation.

GatewayAuthenticate caller/agent, authorize each invocation, validate schema and policy, rate-limit and audit.

ExecutionSandbox/isolated connector, short-lived credentials, egress restrictions, idempotency and effect receipts.

GovernanceApproval for mutations, supply-chain checks, kill switch, usage/effect monitoring and incident replay.

Key principle: the model proposes a tool call; deterministic systems decide whether and how it executes. Treat tool output as untrusted input before returning it to an agent.

LLD & Machine Coding for GenAI

Interfaces, data models, state machines, reliability patterns and testable code

Implementation8 drills

LLD answer order: clarify use cases and invariants → define interfaces/entities → show the critical sequence/state transition → handle concurrency/failures → make it testable and extensible. Patterns are vocabulary, not the goal.

L1

Design a provider-agnostic LLM gateway SDK.

Python · Strategy + Adapter + Factory

class ModelProvider(Protocol):
    async def generate(self, req: GenerationRequest) -> GenerationResult: ...
    async def stream(self, req: GenerationRequest) -> AsyncIterator[TokenEvent]: ...

class Gateway:
    def __init__(self, router, providers, policies, telemetry): ...
    async def generate(self, req):
        checked = self.policies.validate(req)
        route = self.router.choose(checked)
        return await self.providers[route.provider].generate(route.request)

Use canonical request/result types and provider adapters. Keep routing, retry, safety, telemetry and provider calls separate. Preserve provider-specific capabilities through an explicit capability contract rather than leaking arbitrary dictionaries everywhere.

Cross-question answers: Record each invocation ID and outcome so an ambiguous timeout is reconciled before retry. Surface streaming failures as typed terminal events while preserving already emitted tokens and trace IDs. Test adapters with provider contract tests, deterministic fake providers, recorded fixtures and a small gated live integration suite.

L2

Design an idempotent RAG ingestion pipeline.

Core entities and state machine

Document(id, tenant_id, source_uri, content_hash, version)
IngestionJob(id, document_id, version, state, attempt, lease_until)
Chunk(id, document_id, version, ordinal, text_hash, metadata)

RECEIVED -> PARSED -> CHUNKED -> EMBEDDED -> INDEXED -> ACTIVE
                         \-> FAILED_RETRYABLE | FAILED_FINAL

Use (tenant_id, source_uri, content_hash) or an explicit idempotency key to deduplicate.
Workers claim jobs with leases; every transition uses compare-and-set. Retries resume from durable artifacts.
Build a new document version, verify counts/quality, atomically switch active version, then garbage-collect old chunks.

L3

Design a safe agent tool registry and executor.

Contracts

ToolSpec(name, version, input_schema, required_scopes, risk_tier, timeout)
Invocation(id, actor, agent, tool, args_hash, status, effect_receipt)

execute(invocation):
  authenticate -> authorize -> schema_validate -> policy_check
  -> approve_if_required -> run_isolated -> validate_output -> audit

Use Registry/Repository for discovery, Command for invocations, Policy object for authorization, and an idempotency store for mutations. The executor receives short-lived scoped credentials; the LLM never receives a reusable secret.

L4

Design conversation, session and memory data models.

Entity	Purpose	Important Fields
Conversation	User-facing durable container	tenant, participants, policy, created_at
Turn/Event	Append-only source of truth	sequence, role/type, content_ref, trace_id, timestamp
Session	Runtime/checkpoint state	epoch, state, lease, expires_at
Memory	Curated reusable fact	subject, fact, evidence, consent, confidence, expiry

Use optimistic concurrency on sequence/epoch. Summaries are derived views, never the sole source of truth. Support deletion and retention across events, embeddings, caches and analytics copies.

L5

Implement rate limiting, retry and circuit breaking for model calls.

Rate limit: hierarchical token buckets for tenant, model/provider and global capacity; account for requests and estimated tokens.
Retry: only retry transient errors within the request deadline; exponential backoff with jitter; honor provider retry hints.
Circuit breaker: CLOSED → OPEN after threshold → HALF_OPEN probes → CLOSED on recovery. Scope by provider/region/model capability.
Correctness: attach invocation IDs and record outcomes so a timeout does not automatically become a duplicate billable/action call.

L6

Design an extensible LLM evaluation framework.

Evaluator contracts

class Evaluator(Protocol):
    name: str
    async def score(self, case: EvalCase, output: Output) -> Score: ...

EvalCase(id, input, expected, rubric, tags, dataset_version)
Run(id, system_version, dataset_version, seed, status)
Score(metric, value, rationale, evaluator_version, cost)

Support deterministic checks, retrieval metrics, human rubrics and model judges behind one interface. Run asynchronously with bounded concurrency; persist every version and raw artifact. Compare paired cases and confidence intervals, not only averages.

L7

Design a secure semantic cache.

Two-stage lookup: exact canonical key first; semantic ANN lookup second. A candidate is reusable only if similarity passes an evaluated threshold and all hard dimensions match: tenant/ACL, locale, policy, prompt version, model capability, knowledge/index version and tool-state class.

Store answer, evidence, creation/expiry, safety decision and provenance; encrypt and enforce tenant boundaries.
Invalidate by version bump or event; use short TTL for volatile facts and never cache sensitive/high-impact actions by default.
Measure cache precision with sampled replay, not just hit rate. A false hit can be more expensive than a miss.

L8

Machine coding: build an async streaming chat service.

Expected API behavior

POST /v1/conversations/{id}/messages
Headers: Idempotency-Key, Last-Event-ID
Response: text/event-stream
events: accepted, retrieval, token, citation, completed | failed

cancel_event set -> stop provider stream -> persist terminal status
client disconnect -> bounded cleanup; do not leak provider connections

What interviewer checks: async iteration and backpressure, cancellation, timeout budgets, idempotency, ordered event persistence, reconnect/resume, dependency injection, validation and tests for disconnect/error races.

Tests: duplicate request, provider fails before/after first token, slow client, cancellation race, reconnect from event ID, authorization failure and no leaked tasks.

DSA — Arrays & Hashing

Most common pattern in GenAI company interviews (Shopify, Databricks, Cohere)

DSA 5 questions

Q1

Two Sum Easy

Given an integer array and a target, return indices of the two numbers that add up to target. You may not use the same element twice.

Key insight: For each element n, we need target - n. Store each value's index in a hash map. On each step, check if the complement is already in the map.

Python · O(n) time, O(n) space

def two_sum(nums: list[int], target: int) -> list[int]:
    seen = {}  # value → index
    for i, n in enumerate(nums):
        diff = target - n
        if diff in seen:
            return [seen[diff], i]
        seen[n] = i
    return []

# Example: two_sum([2,7,11,15], 9) → [0, 1]

Hash MapO(n) timeO(n) space

Q2

Longest Consecutive Sequence Medium

Find the length of the longest sequence of consecutive integers. Must run in O(n) time.

Key insight: For each number, only start counting if it's the beginning of a sequence (n-1 is not in the set). This ensures each number is visited at most twice — O(n) overall.

Python · O(n) time, O(n) space

def longest_consecutive(nums: list[int]) -> int:
    num_set = set(nums)
    best = 0
    for n in num_set:
        if n - 1 not in num_set:   # sequence start
            cur, length = n, 1
            while cur + 1 in num_set:
                cur += 1; length += 1
            best = max(best, length)
    return best

# [100,4,200,1,3,2] → 4  (sequence 1,2,3,4)

SetO(n)

Q3

Group Anagrams Medium

Group strings that are anagrams of each other into sublists.

Python · O(n·k·log k) — sort as key

from collections import defaultdict

def group_anagrams(strs: list[str]) -> list[list[str]]:
    groups = defaultdict(list)
    for s in strs:
        key = tuple(sorted(s))    # e.g. 'eat' → ('a','e','t')
        groups[key].append(s)
    return list(groups.values())

O(n·k) optimisation: Use a 26-length character count tuple as key instead of sorting. Avoids O(k log k) sort: key = tuple(Counter(s).get(c, 0) for c in 'abcdefghijklmnopqrstuvwxyz')

Hash MapSorting

Q4

Product of Array Except Self Medium — no division

For each index i, compute the product of all elements except nums[i]. O(n) time, no division operator.

Insight: result[i] = prefix_product[i-1] * suffix_product[i+1]. Do two passes: left prefix in one array, multiply by right suffix on the fly.

Python · O(n) time, O(1) extra space

def product_except_self(nums: list[int]) -> list[int]:
    n = len(nums)
    res = [1] * n
    # Left pass: res[i] = product of nums[0..i-1]
    prefix = 1
    for i in range(n):
        res[i] = prefix
        prefix *= nums[i]
    # Right pass: multiply by product of nums[i+1..n-1]
    suffix = 1
    for i in range(n - 1, -1, -1):
        res[i] *= suffix
        suffix *= nums[i]
    return res

Prefix/SuffixO(1) extra space

Q5

Top K Frequent Elements Medium

Return the k most frequent elements. O(n) solution using bucket sort.

Python · O(n) bucket sort

from collections import Counter

def top_k_frequent(nums: list[int], k: int) -> list[int]:
    freq = Counter(nums)
    # Bucket index = frequency value
    buckets = [[] for _ in range(len(nums) + 1)]
    for val, cnt in freq.items():
        buckets[cnt].append(val)

    res = []
    for i in range(len(buckets) - 1, -1, -1):
        for val in buckets[i]:
            res.append(val)
            if len(res) == k: return res

CounterBucket SortO(n)

DSA — Sliding Window & Two Pointers

Essential for string/array problems common in practical coding rounds

DSA 3 questions

Q1

Longest Substring Without Repeating Characters Medium

Python · O(n) sliding window

def length_of_longest_substring(s: str) -> int:
    seen = {}  # char → last index
    left = best = 0
    for right, ch in enumerate(s):
        if ch in seen and seen[ch] >= left:
            left = seen[ch] + 1  # shrink from left
        seen[ch] = right
        best = max(best, right - left + 1)
    return best

# "abcabcbb" → 3 ("abc")

Sliding WindowTwo PointersHash Map

Q2

Minimum Window Substring Hard

Find the minimum window in string s that contains all characters of string t.

Python · O(n+m) sliding window

from collections import Counter

def min_window(s: str, t: str) -> str:
    need = Counter(t)
    window = {}
    have, total = 0, len(need)
    best = (float("inf"), 0, 0)
    left = 0
    for right, ch in enumerate(s):
        window[ch] = window.get(ch, 0) + 1
        if ch in need and window[ch] == need[ch]:
            have += 1
        while have == total:
            if (right - left + 1) < best[0]:
                best = (right - left + 1, left, right)
            window[s[left]] -= 1
            if s[left] in need and window[s[left]] < need[s[left]]:
                have -= 1
            left += 1
    l, r = best[1], best[2]
    return s[l:r+1] if best[0] != float("inf") else ""

Q3

Container With Most Water Medium

Python · O(n) two pointers

def max_area(height: list[int]) -> int:
    left, right = 0, len(height) - 1
    best = 0
    while left < right:
        area = (right - left) * min(height[left], height[right])
        best = max(best, area)
        # Move the shorter wall inward
        if height[left] < height[right]: left += 1
        else: right -= 1
    return best

DSA — Trees & Graphs

BFS, DFS, binary trees — common in Shopify and Databricks rounds

DSA 4 questions

Q1

Binary Tree Level Order Traversal (BFS) Medium

Python · O(n) BFS with queue

from collections import deque

def level_order(root) -> list[list[int]]:
    if not root: return []
    result, q = [], deque([root])
    while q:
        level = []
        for _ in range(len(q)):   # snapshot size = current level
            node = q.popleft()
            level.append(node.val)
            if node.left:  q.append(node.left)
            if node.right: q.append(node.right)
        result.append(level)
    return result

BFSQueue (deque)O(n)

Q2

Number of Islands (DFS on grid) Medium

Python · O(m×n) DFS

def num_islands(grid: list[list[str]]) -> int:
    rows, cols = len(grid), len(grid[0])
    count = 0

    def dfs(r, c):
        if r < 0 or c < 0 or r >= rows or c >= cols \
           or grid[r][c] != "1": return
        grid[r][c] = "0"   # mark visited in-place
        for dr, dc in [(1,0),(-1,0),(0,1),(0,-1)]:
            dfs(r+dr, c+dc)

    for r in range(rows):
        for c in range(cols):
            if grid[r][c] == "1":
                dfs(r, c); count += 1
    return count

DFSGrid / Flood FillO(m×n)

Q3

Course Schedule (Topological Sort / Cycle Detection) Medium

Given n courses and prerequisites, determine if you can finish all courses. Equivalent to detecting a cycle in a directed graph.

Python · O(V+E) DFS cycle detection

def can_finish(n: int, prereqs: list[list[int]]) -> bool:
    adj = [[] for _ in range(n)]
    for a, b in prereqs: adj[b].append(a)

    # 0=unvisited, 1=in-stack (cycle), 2=done
    state = [0] * n

    def dfs(node):
        if state[node] == 1: return False  # cycle
        if state[node] == 2: return True   # already processed
        state[node] = 1
        for nei in adj[node]:
            if not dfs(nei): return False
        state[node] = 2
        return True

    return all(dfs(i) for i in range(n) if state[i] == 0)

Topological SortCycle DetectionDFS

Q4

Lowest Common Ancestor of a Binary Tree Medium

Python · O(n) recursive DFS

def lca(root, p, q):
    if not root or root == p or root == q:
        return root
    left  = lca(root.left,  p, q)
    right = lca(root.right, p, q)
    # If found in both subtrees → current node is LCA
    if left and right: return root
    return left or right

DFSPost-orderO(n)

DSA — Dynamic Programming

1D DP, 2D DP, memoisation — top patterns for practical AI interviews

DSA 4 questions

Q1

Climbing Stairs / Fibonacci — 1D DP Easy

Python · O(n) time, O(1) space

def climb_stairs(n: int) -> int:
    if n <= 2: return n
    a, b = 1, 2
    for _ in range(3, n + 1):
        a, b = b, a + b  # rolling Fibonacci
    return b

Pattern recognition: Whenever f(n) = f(n-1) + f(n-2) or similar recurrence appears, recognise it as Fibonacci DP. Roll two variables instead of an array for O(1) space.

Q2

Longest Common Subsequence — 2D DP Medium

Directly used in diff algorithms, tokenisation comparison, and DNA sequence alignment.

Python · O(m×n) time and space

def lcs(text1: str, text2: str) -> int:
    m, n = len(text1), len(text2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if text1[i-1] == text2[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])
    return dp[m][n]

Q3

0/1 Knapsack — tabulation Medium

Python · O(n×W) time, O(W) space

def knapsack(weights: list, values: list, W: int) -> int:
    n = len(weights)
    dp = [0] * (W + 1)
    for i in range(n):
        for w in range(W, weights[i] - 1, -1):  # reverse!
            dp[w] = max(dp[w], dp[w - weights[i]] + values[i])
    return dp[W]

Key: Iterate weights in reverse when using 1D DP (ensures each item is used at most once). Forward iteration = unbounded knapsack (items can repeat).

Q4

Word Break — memoised DP Medium

Determine if string s can be segmented using words from a dictionary. Directly related to LLM tokenisation problems.

Python · O(n²) DP

def word_break(s: str, words: list[str]) -> bool:
    word_set = set(words)
    dp = [False] * (len(s) + 1)
    dp[0] = True  # empty string is breakable
    for i in range(1, len(s) + 1):
        for j in range(i):
            if dp[j] and s[j:i] in word_set:
                dp[i] = True; break
    return dp[len(s)]

Python Patterns for GenAI

Async, FastAPI, decorators, generators — practical coding round essentials

Practical 4 questions

Q1

Write a production FastAPI endpoint for async LLM streaming.

Python · FastAPI + Anthropic async streaming

from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import anthropic, asyncio

app = FastAPI()
client = anthropic.AsyncAnthropic()

class ChatRequest(BaseModel):
    query: str
    session_id: str = "default"

@app.post("/chat")
async def chat_stream(req: ChatRequest):
    async def generate():
        try:
            async with client.messages.stream(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                system="You are a helpful GenAI assistant.",
                messages=[{"role": "user", "content": req.query}]
            ) as stream:
                async for text in stream.text_stream():
                    yield f"data: {text}\n\n"  # SSE format
        except Exception as e:
            yield f"data: [ERROR] {str(e)}\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

# Batch concurrent calls
async def batch_process(queries: list[str]) -> list[str]:
    tasks = [single_call(q) for q in queries]
    return await asyncio.gather(*tasks, return_exceptions=True)

async/awaitSSE streamingasyncio.gather

Q2

Write a retry decorator with exponential backoff and jitter for LLM API calls.

Python · async-compatible retry with backoff

import asyncio, functools, random, logging
from typing import Type

def retry(
    max_tries: int = 3,
    base_delay: float = 1.0,
    exceptions: tuple[Type[Exception], ...] = (Exception,)
):
    def decorator(func):
        @functools.wraps(func)
        async def async_wrapper(*args, **kwargs):
            for attempt in range(max_tries):
                try:
                    return await func(*args, **kwargs)
                except exceptions as e:
                    if attempt == max_tries - 1: raise
                    delay = base_delay * (2 ** attempt) + random.uniform(0, 0.5)
                    logging.warning(f"Attempt {attempt+1} failed: {e}. Retry in {delay:.1f}s")
                    await asyncio.sleep(delay)
        return async_wrapper
    return decorator

# Usage
@retry(max_tries=3, base_delay=1.0)
async def call_llm(prompt: str) -> str: ...

Q3

Implement a simple semantic cache using Redis and cosine similarity.

Python · semantic cache pattern

import numpy as np, json, redis
from openai import OpenAI

r = redis.Redis(host="localhost", port=6379)
oai = OpenAI()
THRESHOLD = 0.92

def embed(text: str) -> np.ndarray:
    resp = oai.embeddings.create(model="text-embedding-3-small", input=text)
    return np.array(resp.data[0].embedding)

def cosine_sim(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def cached_llm(query: str, llm_fn) -> str:
    q_emb = embed(query)
    # Scan cache for similar queries
    for key in r.scan_iter("cache:*"):
        entry = json.loads(r.get(key))
        if cosine_sim(q_emb, entry["emb"]) >= THRESHOLD:
            return entry["response"]   # cache hit
    # Cache miss — call LLM
    response = llm_fn(query)
    r.setex(f"cache:{hash(query)}", 3600,
             json.dumps({"emb": q_emb.tolist(), "response": response}))
    return response

Q4

Python generators vs async generators — when do you use each in GenAI systems?

Pattern	When to use	Example in GenAI
Sync generator `yield`	CPU-bound iteration, large dataset processing	Streaming chunked document ingestion from S3
Async generator `async yield`	Awaiting I/O in each iteration	Streaming LLM tokens to FastAPI response
`asyncio.Queue`	Multiple producers/consumers	Agent observation queue (multiple tools running)

Python · async generator for token streaming

async def token_stream(prompt: str):
    """Async generator: yields tokens as they arrive"""
    async with client.messages.stream(
        model="claude-sonnet-4-6", max_tokens=512,
        messages=[{"role": "user", "content": prompt}]
    ) as stream:
        async for token in stream.text_stream():
            yield token  # caller sees each token immediately

# Consumer
async def main():
    async for token in token_stream("Explain RAG"):
        print(token, end="", flush=True)

Behavioural — STAR Stories

Pre-built answers from your real experience at EPAM, PwC, Cognizant, TCS

Always Asked 5 questions

STAR framework: Situation (set context briefly) → Task (your specific responsibility) → Action (what YOU did, use "I" not "we") → Result (quantified outcome). Always end with what you learned or how you'd scale it.

B1

Tell me about a time you significantly improved system performance.

Situation

At PwC, a US healthcare client's multi-agent clinical query system had P95 latency of 3.8 seconds — too slow for real-time clinical use. The system was seeing 200+ QPS during peak hospital hours.

Task

I owned the end-to-end latency reduction initiative — profiling, identifying bottlenecks, proposing and implementing solutions within a 3-week sprint.

Action

1
Profiled with LangSmith traces — found 60% of latency was redundant sequential sub-agent calls that could run in parallel.
2
Refactored to async fan-out using asyncio.gather — all sub-agents dispatch concurrently, only the merge step waits.
3
Added Redis semantic cache (GPTCache library) with cosine threshold 0.92 — served ~40% of traffic from cache at <10ms.
4
Routed intent classification to an evaluated low-latency model tier instead of the stronger reasoning tier — 3× cheaper and 5× faster for that subtask.

Result

70% latency reduction — P95 from 3.8s → 1.1s. The system went live and the client reported a 60% efficiency uplift for clinical staff. I received PwC's Delivery Head Award for this sprint.

Related technical follow-ups

Latency reduction techniques Multi-agent concurrency design

B2

Describe a time you owned an ambiguous problem end-to-end with no clear spec.

Situation

At Cognizant's GenAI Lab, I was asked simply to "explore if LLMs can improve our client's enterprise search." No timeline, no team, no defined success metric.

Task

Define the problem, build a prototype, benchmark it, and present a path to production — all independently.

Action

1
Interviewed 3 internal stakeholders to learn the actual pain: keyword search missed semantic intent ("show me Q3 revenue analysis" found nothing).
2
Ran a 2-week spike: built a Pinecone-backed semantic search with fine-tuned sentence-transformers embeddings. Collected 200 test queries with relevance labels.
3
Benchmarked Recall@5 vs existing keyword search: 35% improvement. Presented the spike results with a 3-phase production roadmap.

Result

Prototype became the foundation for a production semantic search shipped to the client — 35% accuracy improvement. Received Cognizant's Rising Star Award.

B3

Tell me about a technical decision you made that you'd change in hindsight.

Situation

Early in a RAG project at PwC, I used fixed-size 512-token chunks with no overlap because it was the fastest to implement. The system went to QA.

What Happened & What I'd Change

QA caught that multi-page financial tables were being split across chunks — the retriever would get half a table, giving wrong answers. I spent 2 days re-chunking with a hierarchical splitter and adding 50-token overlap. If I'd spent 4 hours upfront analysing the document corpus structure, I would have picked the right chunking strategy immediately and avoided the rework.

What I do now: Always spend the first day profiling the document corpus (distribution of doc types, average length, presence of tables/headers) before writing a single line of chunking code.

Why This Answer Works

Shows self-awareness without being self-deprecating. The mistake was reasonable and the lesson is concrete and technical — not vague.

Related technical follow-up

Chunking strategies deep-dive

B4

How do you work effectively in a fully remote, distributed team across time zones?

Situation

At EPAM, my team spans India (IST), Poland (CEST, +3.5hrs from IST), and the US East coast (EST, −10.5hrs from IST). We have critical weekly demos with a US Finance client.

My System

1
Async-first documentation: Every design decision gets a Confluence ADR (Architecture Decision Record) written the same day. PRs have detailed descriptions so teammates in other zones can review without a live sync call.
2
Slack status tagging: I tag every update with [DONE], [BLOCKED: needs X], or [DECISION NEEDED: by EOD]. This tells teammates exactly what action, if any, they need to take.
3
Flexible overlap windows: 2 days/week I flex my start time to 6:30 AM IST to get 2 hours overlap with US EST morning for the client demo prep.
4
Loom walkthroughs: For complex architecture changes, I record a 5-min Loom instead of a document. Faster to create, easier to consume across time zones.

Result

Zero missed client demo milestones in 10 months. The Delivery Head Award (2025) cited "consistent, reliable cross-timezone delivery" as a specific reason.

B5

Tell me about yourself. (90-second pitch — memorise this)

Your 90-Second Pitch

"I'm Purnendu Das, a Senior GenAI Developer with 5 years building production AI systems — not demos, but real systems that clients use every day.

My core expertise is three things: first, RAG architectures that work at scale — hybrid retrieval, re-ranking, evaluation pipelines. Second, multi-agent systems using LangGraph for complex, stateful workflows. Third, LLMOps on AWS — deploying with Bedrock, Lambda, and OpenSearch, with proper monitoring and cost control.

At PwC I built a multi-agent platform for a healthcare client that reduced response latency by 70% and delivered a 60% efficiency improvement for clinical staff. At EPAM right now I'm architecting a Hybrid RAG system for a Tier-1 US Finance client, and we've achieved sub-second inference latency. Before that at Cognizant I founded and ran our GenAI research lab, shipping a semantic search product with a 35% accuracy improvement.

I work fully remote and have delivered consistently across India, Poland, and US time zones. I'm looking for a senior role where I can keep solving enterprise-scale AI problems with a high-quality engineering team."

Delivery tip: Practice this out loud until it flows naturally at conversational speed. The structure is: who I am → three core skills → three concrete proof points → what I'm looking for. Never read from notes — interviewers notice.

GenAI InterviewSaviour

Choose today’s pressure test

Expected Now

What No Longer Impresses Alone

Before Drawing

Close Every Design With

Agent & Context Engineering

Platform & Models

Diagnose

Optimize Safely

Engineering Maturity

Role & Team

Immediate

Prevent

Instrument

Act Safely

Architecture

Security

Likely Cross-Questions

Strong Answer Anchors

Architecture Pressure Test

Answer Checklist

Cross-Questions

Measurement Template

Must-Answer Technical Questions

Decision Defense

Cross-Questions

Senior-Level Trade-Offs

Follow-Ups

Answer Pattern

Semantic Search / RAG

Other Cognizant Claims

Data Engineering Questions

Transition Questions

Oneness Tech Solutions

CodeSpeedy Internship

Leadership / Awards Questions

Interviewer Questions

Behavioral Cross-Questions

Closing Questions

Prepare

Questions to Expect

Prepare

Likely Questions

Prepare

Likely Questions

Prepare

Expect

Prepare

Expect

Prepare

Expect

Prepare

Expect

Prepare

Expect

Remote

Deel

Core Questions

Senior Answer Signals

Rapid Fire

LLM Evaluation Bridge

Core Questions

Debugging Questions

Questions

Applied Follow-Ups

Architecture Questions

Scaling / Inference

Rapid Fire

Decision Questions

SQL / Data

MLOps / LLMOps

Threats

Controls

Use It When

Custom Runtime When

Identity & Data

Network & Operations

Prevent

Detect & Respond

GenAI Interview
Saviour