GenAI Interview
Saviour
A pressure-tested system for resume defense, GenAI depth, production debugging, coding, system design, mocks, and remote-company interviews. Every main question has a complete answer, hidden initially and revealed on click.
Choose today’s pressure test
No sheet can guarantee every interview. This one is designed to make gaps visible and train the repeatable reasoning that strong GenAI interviews reward.
Expected Now
- Agents as production systems: runtime, identity, memory, tools, observability, evals, and recovery.
- Context engineering beyond prompt wording.
- Multimodal RAG over layout, tables, charts, images, and provenance.
- Continuous evaluation and production incident diagnosis.
- Security against injection, excessive agency, leakage, and unbounded consumption.
What No Longer Impresses Alone
- Framework-name memorization without trade-offs.
- A demo chatbot with no evaluation or failure handling.
- Claiming a model is “best” without a workload benchmark.
- Using an agent where a workflow is sufficient.
- Hard-coded old model names and prices.
- Amazon Bedrock: managed foundation-model access, Knowledge Bases, Agents, Guardrails, model evaluation/customization, and multimodal RAG capabilities.
- Amazon Bedrock AgentCore: production agent platform concerns such as secure runtime, enterprise tool/data connectivity, identity, tracing, debugging, evaluation, and AgentCore Policy for agent-to-tool controls.
- Inference optimization: evaluate prompt caching and intelligent prompt routing where supported; confirm current model, region, and cost behavior.
- Inference choice: Bedrock for managed models; SageMaker AI endpoints/HyperPod or EC2/EKS/ECS when deeper model/runtime control is required.
- Architecture: apply the AWS Generative AI Lens and security reference architecture, not only service wiring.
- Gemini Enterprise Agent Platform / Vertex AI: build, deploy, govern, and optimize enterprise agents and GenAI workloads.
- Vertex AI Agent Engine: managed agent runtime with sessions, Memory Bank, code execution, observability, private networking, CMEK, resource/concurrency controls, and enterprise compliance support.
- Google Agent Development Kit: an open-source, model-agnostic framework with current SDKs across Python, TypeScript, Go, Java, and Kotlin; evaluate it alongside A2A for interoperable agents.
- RAG choices: RAG Engine, Vector Search, Agent Search, Feature Store vector retrieval, or custom AlloyDB/Cloud SQL patterns depending on scale and control.
- Serving: Vertex AI managed endpoints/Model Garden or Cloud Run/GKE for application and custom-runtime control.
Strong answer: “I maintain a task-specific evaluation and route to the smallest model tier that passes quality, safety, latency, and cost gates. I avoid choosing from benchmark headlines alone.”
| Dimension | Managed Platform | Custom Runtime |
|---|---|---|
| Time to production | Faster; runtime/identity/memory/observability integrated | Slower; assemble and operate components |
| Control | Platform constraints and regional feature availability | Full orchestration/runtime/model control |
| Portability | Potential cloud/platform coupling | Potentially portable with more ownership |
| Best fit | Enterprise team prioritizing managed governance and speed | Differentiated orchestration, unusual runtime, or multi-cloud need |
Interview-safe answer: use each provider's current unified API and capability discovery, but select models with an evaluation suite rather than a memorized leaderboard. For OpenAI, the Responses API is the primary agentic build path and works with tools, structured outputs, and the Agents SDK; its latest-model guide currently recommends GPT-5.5 for complex reasoning and coding. On AWS and GCP, confirm the model, feature, region, quota, and data-governance combination before committing to an architecture.
Exact-date answer: the current stable MCP specification is 2025-11-25. A 2026-07-28 release candidate was announced on May 21, 2026, but its final release is scheduled for July 28, 2026, so do not call it the current stable spec yet. The RC adds a stateless core, Extensions, MCP Apps, Tasks, and authorization hardening.
| Interface | Purpose | Production concern |
|---|---|---|
| Function/tool calling | A model emits a typed request to an application-owned tool | Schema validation, authorization, idempotency |
| MCP | An application connects models/agents to reusable tools, resources, and prompts | Server trust, OAuth/authz, data boundaries, approvals |
| A2A | Agents discover and communicate with other agents across systems | Identity, capability trust, task lifecycle, inter-agent security |
Threat-model against the OWASP Top 10 for LLM Applications 2025 and the OWASP Top 10 for Agentic Applications 2026. The agentic list expands the focus to goal hijacking, tool misuse, identity and privilege abuse, memory poisoning, insecure inter-agent communication, cascading failures, and rogue agents.
- 1Assume model output is untrusted: never use it as authorization; validate every argument and output.
- 2Bound agency: least-privilege identities, egress allowlists, step/time/token budgets, idempotency, and human approval for high-impact actions.
- 3Protect state: isolate tenants, provenance-tag memory, expire/delete it, and prevent untrusted content from becoming durable instructions.
- 4Verify continuously: adversarial evals, red-team tests, audit trails, kill switches, and incident exercises.
| Freshness class | Examples | Revision rule |
|---|---|---|
| Durable foundations | Attention, embeddings, RAG failure modes, distributed-systems trade-offs, least privilege | Learn deeply; explain from first principles |
| Review monthly | Agent frameworks, cloud capabilities, protocol specs, security guidance | Check official docs and release notes |
| Verify before interview | Model names, prices, quotas, context limits, regions, benchmark claims | Use the provider's current documentation |
| Verify with recruiter | Remote eligibility, interview steps, role scope, team stack | Current job posting and recruiter always win |
Check an item only after doing it without reading the answer. The score persists in this browser.
| Answer Type | Structure | What Good Sounds Like |
|---|---|---|
| 30 sec | Definition → distinction → when to use | Answer directly, distinguish the nearest concept, then give one decision rule. |
| 2 min | Problem → options → choice → trade-off → metric | Show judgment, one rejected option, and evidence. |
| 10 min | Requirements → design → deep dive → failure → scale → security → eval | Drive the conversation and invite the interviewer to choose a deep dive. |
| Behavioral | Situation → stakes → your action → result → lesson | Keep “we” for context and “I” for your contribution. |
| Debugging | Scope → hypotheses → instrumentation → isolate → mitigate → prevent | Do not jump directly to a favorite fix. |
Before Drawing
- Which error is unacceptable: false answer, missed answer, unsafe action, or slow response?
- Does knowledge change, or does behavior need adaptation?
- Is output advisory or does it trigger actions?
- What does success mean offline and online?
- Which data may enter prompts, logs, vendors, and eval sets?
Close Every Design With
- Top three failures and mitigations.
- Quality, latency, cost, and safety release gates.
- Progressive rollout and rollback plan.
- One limitation and next iteration.
- How it changes at 10× traffic or corpus.
| Decision | Choose A When | Choose B When | Senior Caveat |
|---|---|---|---|
| RAG vs fine-tune | Facts change; citations/access control matter | Behavior, style, format, or task skill must change | They are complementary |
| Workflow vs agent | Path is known; auditability matters | Open-ended planning adds measured value | Start deterministic |
| Hosted vs open model | Fast iteration and managed capability | Control, privacy, specialization, scale economics | Include serving/ops cost |
| Large vs small model | Hard reasoning and broad tasks | Routing, extraction, classification, high volume | Use eval-driven routing |
| Managed vs custom RAG | Simple requirements and small team | Custom parsing, ranking, security, evaluation | Speed can outweigh flexibility |
| Lambda vs containers | Bursty short-lived orchestration | Long-running, streaming, custom runtime, GPU | Know execution limits |
Agent & Context Engineering
- Context engineering: optimize instructions, tools, memory, evidence, and state inside a limited token budget.
- Agent harness: checkpointing, progress, budgets, permissions, isolation, and recovery across long tasks.
- Tool ergonomics: clear schemas, bounded outputs, actionable errors, evaluation-driven improvement.
- Agent evals: outcome, path efficiency, tools, safety, and recovery across nondeterministic runs.
Platform & Models
- MCP: hosts, clients, servers, tools/resources/prompts, transports, authorization, confused-deputy risk.
- Multimodal: OCR/layout/table/chart understanding, provenance, modality-specific evaluation.
- Inference: prefill/decode, KV cache, continuous batching, quantization, routing, TTFT.
- Security: constrain permissions, tools, outputs, consumption, and autonomy.
- 1Representative tasks: normal, edge, adversarial, multilingual, long-context, and no-answer cases.
- 2Layered graders: deterministic checks, calibrated human labels, then scalable LLM judges.
- 3Track slices: document type, user, query complexity, tool, model, and route.
- 4Gate releases: block critical safety, quality, latency, or cost regressions.
- 5Close loop: production traces and corrections become new eval cases.
| Threat | Failure | Controls |
|---|---|---|
| Prompt injection | Untrusted content changes behavior | Trust zones, policy layer, validation; treat retrieved/tool content as data |
| Excessive agency | Too much autonomy/permission | Least privilege, scoped tools, budgets, approval, reversible actions |
| Sensitive disclosure | PII/secrets leak via prompts/output/logs | Minimize data, DLP, encryption, retention, isolation |
| Improper output handling | Model output becomes executable input | Schema validation, escaping, parameterization, sandboxing |
| Unbounded consumption | Loops cause denial of wallet/service | Token/tool/time budgets, quotas, circuit breakers |
| MCP authorization | Token theft/confused deputy | Explicit consent, scoped short-lived tokens, trusted redirects |
Diagnose
- Network/queue → retrieval/rerank → prompt → prefill/TTFT → decode → post-process.
- Measure P50/P95/P99 by model, route, context, output, concurrency, warm/cold.
- Long input raises prefill; long output raises decode. Optimize the correct stage.
- Separate perceived streaming latency from total completion time.
Optimize Safely
- Model routing, context compression, retrieval precision, parallel independent work.
- Semantic/prefix cache with freshness and tenant-safe keys.
- Continuous batching, KV cache, quantization, speculative decoding.
- Bound outputs, tools, retries, loops; verify quality after every change.
Engineering Maturity
- What are the hardest production failure modes?
- How do you evaluate changes and what blocks release?
- Where do you intentionally avoid LLMs or agents?
- How are quality, latency, cost, and safety trade-offs decided?
- What does observability cover across retrieval, models, and tools?
Role & Team
- What does excellent impact look like in 90 days?
- Which decisions would this role own?
- What technical disagreement is the team working through?
- How are incidents and async handoffs handled?
- Why have strong engineers struggled in this role?
| Situation | Useful Script |
|---|---|
| I do not know | “I have not implemented that directly. My understanding is X. I would verify Y; the closest system I built was Z.” |
| Ambiguous | “May I clarify failure cost, traffic, freshness, and whether output triggers actions?” |
| Wrong assumption | “That conflicts with the new constraint. I would revise X; the trade-off becomes Y.” |
| Confidential | “I cannot share identifiers or exact volumes, but I can explain architecture, measurement, relative scale, and my contribution.” |
| Coding stuck | “I will state a brute-force baseline, test it on a small example, then optimize the bottleneck.” |
| Metric challenged | “Fair point. The metric measured X, not Y. The limitation is Z, and I would measure A next.” |
- 8mResume: career walkthrough; defend sub-second latency and one percentage claim.
- 8mDepth: hybrid retrieval, reranking, evaluation, and when RAG is wrong.
- 15mDesign: secure multi-tenant financial research assistant with citations at 500 QPS.
- 8mIncident: P95 doubles and faithfulness drops after an index refresh.
- 6mBehavioral: disagreement, failure, remote collaboration, why this role.
- 1Scope: slice by source, parser, document/query type, index version; verify eval set.
- 2Separate stages: inspect Recall@K/MRR and retrieved chunks before blaming generation.
- 3Hypotheses: parsing, duplicates/noise, metadata, embedding mismatch, top-k dilution, stale index.
- 4Mitigate: roll back index alias, isolate bad source, restore configuration.
- 5Prevent: ingestion gates, canary index, regression suite, versioned promotion.
Immediate
- Restrict destructive tool; stop affected runs.
- Use idempotency keys and deduplication.
- Inspect state transitions, tool results, retries, model output.
- Replay safely with the same state in a sandbox.
Prevent
- Max steps/time/token/tool budgets and repeated-state detection.
- Explicit transitions and completion criteria.
- Retry only transient failures; give tools actionable errors.
- Human approval and regression evals.
- 1Check eval representativeness and recent production failures.
- 2Slice by user, language, task, context length, tool, and route.
- 3Validate graders for bias, leakage, weak rubrics, and poor agreement.
- 4Compare paired traces and UX/latency changes, not only answer scores.
- 5Roll back/reduce traffic, add failures to evals, recalibrate gates.
Instrument
- Cost/request by model, route, tenant, tokens, tools, retries.
- Latency waterfall: queue, retrieval, rerank, prefill, decode, tools.
- Compare deployment, traffic mix, prompt/context, provider changes.
Act Safely
- Stop runaway loops/retries and enforce budgets.
- Route simple work to small models; compress irrelevant context.
- Parallelize independent work; cache with safe freshness/tenant keys.
- Canary and prove quality remains acceptable.
Architecture
- Host controls UX, consent, policy, model, and clients.
- Focused servers expose bounded tools/resources with schemas.
- Discover/load tools on demand; avoid flooding context.
- Registry, health, versioning, audit, evaluation, revocation.
Security
- Per-user scoped auth; no token passthrough.
- Trusted servers, explicit consent, short-lived tokens.
- Sandbox execution; minimize data entering model context.
- Budgets, validation, approval, and kill switch.
Clarify: async concurrency, timeout, retryable errors, backoff/jitter, rate limits, idempotency, validation, budget, cancellation, metrics, fallback.
async def bounded_llm_call(request, *, timeout_s, max_attempts, budget): # validate + reserve budget; acquire concurrency permit # call with timeout/idempotency; retry transient errors only # validate output; emit trace/metrics; release resources ...
Likely Cross-Questions
- Why did you move from full-stack to data engineering, then data science, then GenAI?
- Why are you considering a move after joining EPAM in August 2025?
- What makes you “senior” beyond years of experience?
- Which role changed your technical judgment most?
- Why remote, and what evidence shows you succeed remotely?
- Which parts of your resume are hands-on versus team-level outcomes?
Strong Answer Anchors
- Make the progression intentional: software → data platforms → models → LLM products.
- For a current-job move, stay positive and name the specific scope you seek.
- Define seniority through ambiguity handling, trade-offs, reliability, mentoring, and hiring.
- Keep the walkthrough to 90 seconds; let the interviewer choose the deep dive.
Architecture Pressure Test
- What made it hybrid? How did you fuse sparse and dense results?
- What was multimodal: scanned PDFs, tables, charts, images, or all four?
- How did you parse tables without losing row/column relationships?
- Why OpenSearch instead of Pinecone, pgvector, or Bedrock Knowledge Bases?
- Why Lambda; what happens with cold starts, long jobs, or GPU inference?
- How did you enforce tenant isolation and finance-domain access control?
- How were citations, abstention, and hallucination controls implemented?
- What would break first at 10x corpus size or QPS?
Answer Checklist
- Draw the full request path in under two minutes.
- Separate decisions you owned from platform constraints you inherited.
- Give one rejected design and the reason it lost.
- Name the golden dataset and the retrieval metric used to tune hybrid weights.
- Explain failure modes: parser errors, stale index, no relevant context, model timeout.
Cross-Questions
- Was sub-second measured at P50, P95, or P99?
- Did streaming make perceived latency lower than total latency?
- What were the before/after values and number of sampled requests?
- Which change contributed most: caching, parallelism, smaller model, connection reuse, or retrieval tuning?
- How did you handle Lambda cold starts and Bedrock throttling?
- What accuracy or cost trade-off did latency optimization introduce?
Measurement Template
- Metric: TTFT / retrieval / end-to-end.
- Distribution: P50, P95, P99, not average alone.
- Conditions: concurrency, prompt/context tokens, model, warm/cold.
- Instrumentation: trace spans around every stage.
- Guardrail: verify quality and error rate did not regress.
Must-Answer Technical Questions
- What task required fine-tuning rather than RAG or prompting?
- Which LLaMA/Gemma sizes and base versus instruct variants?
- What were the dataset size, schema, train/validation split, and cleaning rules?
- Which target modules, rank
r, alpha, dropout, learning rate, and epochs? - Did you use QLoRA? Explain NF4, double quantization, and compute dtype.
- How did you detect overfitting and catastrophic forgetting?
- What baseline and evaluation metric proved improvement?
- How were adapters merged, versioned, and served?
Decision Defense
- Fine-tune for behavior/style/task adaptation; use RAG for changing factual knowledge.
- Explain parameter and memory savings, not only “cheaper.”
- Discuss data licensing, PII removal, memorization, and rollback.
- Compare LoRA with full fine-tuning, prompt tuning, and continued pretraining.
- Bring one failed experiment and what it taught you.
Cross-Questions
- Why multiple agents instead of one workflow with tools?
- What did each agent own, and how did they communicate?
- How did you prevent duplicate actions and non-deterministic loops?
- What state was persisted, and how did retries/checkpointing work?
- How were tools authorized and outputs validated?
- Where was human-in-the-loop mandatory?
- Why microservices, and what operational cost did that add?
- How did you evaluate agent-level versus end-to-end success?
Senior-Level Trade-Offs
- Agents increase flexibility but reduce predictability and observability.
- Use deterministic nodes for rules, calculations, and compliance gates.
- Bound execution with budgets, timeouts, allowed transitions, and circuit breakers.
- Expose the exact portion you personally architected or implemented.
| Resume Claim | Interviewer Will Ask | Your Required Evidence | Common Trap |
|---|---|---|---|
| 40% retrieval accuracy | Accuracy means Recall@K, MRR, NDCG, or answer correctness? | Golden queries, labels, baseline, final metric, confidence/error analysis | Calling semantic similarity “accuracy” |
| 70% assessment time | Whose time and from how long to how long? | Workflow timing, sample count, review rate, quality guardrail | Ignoring time shifted to reviewers |
| 68% execution time | Which stages were parallelized and why safe? | Trace waterfall, dependency DAG, before/after percentiles | Comparing unmatched workloads |
Follow-Ups
- What is chain parallelism, and how does it differ from batching?
- How did you handle partial failure when parallel calls diverged?
- How did you tune hybrid retrieval and reranking?
- What did the risk-intelligence agents produce, and how was correctness verified?
- What changed after production feedback?
Answer Pattern
- Give the formula before the percentage.
- State absolute values as well as relative improvement.
- Name the quality metric held constant during speed optimization.
- Acknowledge limitations and what you would measure next.
Semantic Search / RAG
- Why Pinecone, which index/metric, and how did namespaces/metadata work?
- What did the 35% accuracy improvement measure?
- The resume says latency reduced but gives no number; what can you honestly claim?
- How did you select embeddings, chunk size, top-k, and reranker?
- What was required to move an R&D prototype into production?
Other Cognizant Claims
- Why Azure OpenAI versus Gemini Pro for each workflow step?
- What is metadata linking, and how was link quality evaluated?
- How did Checkmarx, Sonar, and Black Duck differ?
- What does “100% security compliance” specifically mean?
- How did you handle secrets, prompt injection, PII, and dependency risk?
Data Engineering Questions
- Draw one GCP ETL pipeline end to end: sources, orchestration, transforms, storage, consumers.
- Airflow versus Cloud Functions: what belonged where?
- How did you reduce cost by 40%: slots, partitioning, scheduling, storage, or serverless changes?
- How was manual effort reduced by 80%, and what remained manual?
- How did you ensure idempotency, backfills, schema evolution, and data quality?
- Explain BigQuery partitioning, clustering, and query-cost control.
Transition Questions
- Why leave data engineering for data science?
- How does your data-engineering background make you better at GenAI?
- Which parts of MLOps/LLMOps are direct extensions of data-platform work?
- What ML knowledge did you have to build deliberately?
- Would you still be comfortable owning a production data pipeline today?
Oneness Tech Solutions
- What prediction did the analytics app make, using which model and features?
- How did Django, REST APIs, database, model inference, and UI connect?
- What changed to improve organic traffic by 65%, and which analytics source measured it?
- What production incident or performance bottleneck did you solve?
- What does “end-to-end product lifecycle” mean in concrete deliverables?
CodeSpeedy Internship
- How did the weather app handle API errors, caching, and rate limits?
- How was certificate generation automated?
- Explain the multithreaded socket system and race-condition risks.
- Threads versus processes versus asyncio in Python?
- How did technical writing improve your engineering communication?
| Cluster | Likely Deep Probe | Have Ready |
|---|---|---|
| LangChain / LangGraph / LlamaIndex | When would you avoid each framework? | One production example, one limitation, one plain-Python alternative |
| AWS / GCP | Map equivalent services and explain operational trade-offs | One architecture on each cloud, IAM/networking/cost story |
| PyTorch / TensorFlow / MLflow | Show training/evaluation/registry experience | Real model, experiment setup, artifact/version flow |
| FastAPI / Flask / Django | Async, validation, middleware, deployment, testing | Why FastAPI for LLM serving; when Django still wins |
| Docker / Kubernetes / CI/CD | Health checks, secrets, autoscaling, rollback | A deployment you operated, not only packaged |
Leadership / Awards Questions
- Why did you receive nine PwC awards and seven Cognizant awards? Give two distinct stories.
- How was the Cognizant top-1% / 5-of-5 rating determined?
- What did the EPAM Delivery Head Award recognize?
- How did you influence without formal authority?
- Tell me about a teammate you mentored and the observable result.
Interviewer Questions
- What was your A3-level interview rubric?
- How did you distinguish memorization from real depth?
- Tell me about a close hire/no-hire decision.
- How did you reduce bias and calibrate with other interviewers?
- What common GenAI weakness did you observe across 50+ candidates?
Behavioral Cross-Questions
- What is your biggest technical failure in production?
- When did you disagree with an architect or client?
- What metric did you improve but later realize was incomplete?
- What work are you least proud of, and what changed afterward?
- When did a GenAI approach fail and a simpler approach win?
- What is the hardest feedback you received?
Closing Questions
- Why should we hire you over a stronger ML researcher?
- Why should we hire you over a stronger backend engineer?
- What are your two biggest gaps for this role?
- What would your 30/60/90-day plan be?
- Which architecture decision would you revisit first in our product?
- What questions do you have for us that reveal engineering maturity?
| Company | Remote Signal | Best-Fit Role | Typical Emphasis | Difficulty |
|---|---|---|---|---|
| Twilio | Remote-first; remote India roles appear | Senior AI/ML, backend platform, applied AI | APIs, distributed systems, coding, ownership | High |
| Atlassian | Team Anywhere; distributed-first with entity/timezone rules | Senior/Principal ML or platform engineer | DSA, code design, system design, values | High |
| Databricks | Remote/hybrid varies strongly by role/location | ML platform, GenAI, solutions/field engineering | DSA, distributed data, ML systems, depth | Very high |
| GitLab | All-remote | AI-powered features, backend, MLOps | Async writing, values, practical technical depth | High |
| Elastic | Distributed company | Search/GenAI, relevance, platform engineering | Search internals, distributed systems, collaboration | High |
| Canonical | Remote-first | ML platform, Python/backend, cloud | Written application, academics, technical breadth | High / long |
| Automattic | Fully remote/global | Applied AI, experienced software engineer | Async communication, paid trial, product judgment | High / practical |
| Zapier | Fully remote | Applied AI, automation platform, backend | Written evidence, skills assessment, values | Medium-high |
| Remote | Remote-first, globally async | AI/data/backend platform | Async ownership, product engineering, 4–5 video rounds | Medium-high |
| Deel | Work-from-anywhere, global | AI automation, data, backend | Speed, ownership, automation, product impact | High |
Prepare
- Design a globally reliable notification or conversational-AI API.
- Idempotency keys, retries, rate limits, queues, webhook security, observability.
- Medium DSA in Python with clean tests and complexity analysis.
- Explain how you would add safe LLM features to customer communications.
Questions to Expect
- How would you prevent duplicate message delivery?
- Design a multi-tenant AI support platform with strict latency SLOs.
- How do you operate an API during provider/model failure?
- Tell me about a time you improved reliability across teams.
Prepare
- DSA medium problems while narrating trade-offs and edge cases.
- Code design: extensible classes/APIs, tests, readability, change requests.
- System design: multi-tenant collaboration, permissions, search, scale, reliability.
- Five values stories with conflict, customer impact, and learning.
Likely Questions
- Design AI search across Jira and Confluence with permissions preserved.
- How would you evaluate and roll out an AI feature safely?
- Tell me when you changed your mind after strong disagreement.
- Write code that remains easy to extend after a new requirement.
Prepare
- Spark execution: partitions, shuffle, joins, skew, caching, AQE, failure recovery.
- Lakehouse concepts: Delta transaction log, ACID, schema evolution, streaming.
- ML/LLM platform design: experiment tracking, feature/data lineage, serving, evaluation.
- Medium-hard DSA and rigorous complexity analysis.
Likely Questions
- Design an enterprise RAG/evaluation platform for thousands of teams.
- Why does a distributed join become slow, and how do you fix skew?
- How would you make LLM evaluation reproducible and governed?
- Compare warehouse, lake, and lakehouse architectures.
Typical emphasis: recruiter/hiring-manager conversations, role-specific technical assessment or interview, cross-functional conversations, values, and strong written async communication. Difficulty is high because remote effectiveness is evaluated as an engineering competency.
Prepare
- Write a concise design proposal with assumptions, alternatives, risks, and rollout.
- Show documentation-first remote habits and transparent decision-making.
- Prepare product-aware answers for AI features across the DevSecOps lifecycle.
Expect
- How do you unblock yourself asynchronously?
- How would you ship and evaluate an AI coding feature?
- Tell me about a documented decision that prevented rework.
Typical loop: recruiter screen followed by role-specific interviews with technical and team stakeholders. Your strongest bridge is OpenSearch, hybrid retrieval, vector search, observability, and distributed remote delivery.
Prepare
- BM25, inverted indexes, HNSW, shards/replicas, refresh, mappings, filters.
- Hybrid retrieval evaluation and relevance tuning.
- Distributed failure, consistency, hot shards, and capacity planning.
Expect
- Why do vector and keyword search fail differently?
- How would you debug poor relevance?
- Design search for 10M changing documents.
Typical pattern: detailed written application/questions, domain review, assessments, and multiple interviews shown in a personalized candidate dashboard. Difficulty comes from breadth, written precision, and process length.
Prepare
- Polished written evidence for every claim; remove vague adjectives.
- Python/Linux/cloud fundamentals and open-source awareness.
- Academic and career choices, motivation, remote collaboration.
Expect
- Why Canonical and open source?
- What have you operated on Linux?
- Explain a technical decision clearly in writing.
Published process: application → interview → paid trial → final interview → offer. Developer hiring is conducted through the tools the company uses, so communication and practical execution matter as much as discussion.
Prepare
- A strong written project narrative with personal ownership and trade-offs.
- Practical coding, tests, incremental delivery, and async updates.
- A truthful explanation of how AI has changed your workflow.
Expect
- Why remote and why Automattic?
- Complete a realistic paid work trial.
- Show how you communicate progress without meetings.
Published early stages: application review → recruiter interview → 45–60 minute hiring-manager interview, followed by role-specific skills assessment/final stages. Applications focus strongly on evidence of what you can do.
Prepare
- Automation/product thinking and user impact.
- Clear written answers using specific outcomes.
- Reliable integrations: retries, idempotency, rate limits, webhooks.
Expect
- How would you add AI to an automation safely?
- How do you prioritize speed versus reliability?
- Show an async ownership example.
Remote
- Published average: roughly four weeks and 4–5 video interviews.
- Emphasize async ownership, documentation, product judgment, and compliant global systems.
- Prepare AI/data use cases for global HR, payroll, and support.
Deel
- Work-from-anywhere with high speed and accountability.
- Emphasize automation, measurable impact, and comfort across time zones.
- Prepare for product/system questions around global employment and compliance.
| Days | Primary Work | Output You Must Produce |
|---|---|---|
| 1–2 | Resume defense and metric evidence | 90-second pitch; one evidence sheet for every percentage |
| 3–4 | RAG, vector DB, evaluation, hallucination | Draw your EPAM/PwC architecture from memory; answer 15 follow-ups aloud |
| 5 | Agents, LangGraph, reliability | Defend why agents were needed; design failure controls |
| 6 | PEFT/LoRA, Transformers, LLM inference | Explain LoRA math and one real experiment with hyperparameters |
| 7 | ML/DL/NLP/statistics fundamentals | Rapid-fire recall without notes |
| 8 | AWS/GCP, backend, security, LLMOps | One production design and one incident/debug story |
| 9–10 | Python, SQL, DSA | 6 timed mediums; explain tests and complexity |
| 11 | System and ML system design | Two 45-minute mocks: RAG platform and AI support system |
| 12 | Company-specific prep | One-page Twilio, Atlassian, or Databricks brief |
| 13 | Behavioral and remote stories | Eight STAR stories with metrics and lessons |
| 14 | Full mock and gap repair | Record, review, shorten, and correct weak answers |
Core Questions
- Bias versus variance; diagnose each from train/validation curves.
- L1 versus L2 regularization; effect on coefficients and feature selection.
- Bagging versus boosting; Random Forest versus XGBoost.
- How trees choose splits; entropy versus Gini.
- When scaling matters and when it does not.
- How to handle missing values, outliers, high cardinality, and leakage.
- Why cross-validation can still be wrong for time series or grouped data.
Senior Answer Signals
- Start with the business cost of false positives/negatives.
- Choose splits that mimic production.
- Keep preprocessing inside the fitted pipeline.
- Explain calibration when probabilities drive decisions.
- Compare simple baseline before complex model.
| Question | Concise Answer Anchor |
|---|---|
| Precision vs recall? | Precision controls false positives; recall controls false negatives; choose from business cost. |
| ROC-AUC vs PR-AUC? | PR-AUC is more informative for rare positives; ROC-AUC can look optimistic. |
| Data drift vs concept drift? | Input distribution changes versus relationship between input and target changes. |
| Statistical vs practical significance? | A small effect can be statistically real but not worth shipping. |
| Offline vs online metric? | Offline enables fast iteration; online validates real behavior and system effects. |
Rapid Fire
- Confidence interval and bootstrap.
- Type I/II error and power.
- A/B-test sample ratio mismatch.
- Class imbalance and threshold tuning.
- Model calibration and Brier score.
LLM Evaluation Bridge
- Why LLM-as-judge can be biased.
- How to create a golden set.
- Pairwise versus pointwise judging.
- How to measure inter-rater agreement.
- How to gate releases on quality and latency.
Core Questions
- Backpropagation and the chain rule.
- Vanishing/exploding gradients and mitigations.
- ReLU, GELU, sigmoid, softmax: where and why.
- SGD versus Adam/AdamW; what weight decay changes.
- Batch norm versus layer norm.
- Dropout behavior in training versus inference.
- Learning-rate warmup, schedules, gradient clipping, accumulation.
Debugging Questions
- Loss becomes NaN: what do you inspect first?
- Train loss falls but validation worsens: what now?
- GPU out-of-memory: which levers preserve quality?
- Training is slow: data loader, precision, kernels, batching, profiling.
- How do mixed precision and loss scaling work?
Questions
- TF-IDF/BM25 versus dense embeddings.
- Word2Vec CBOW versus skip-gram; static versus contextual embeddings.
- Cosine, dot product, and Euclidean distance.
- Tokenization: BPE, WordPiece, SentencePiece, unknown/rare words.
- Why embedding anisotropy matters.
- Bi-encoder versus cross-encoder.
- NER, classification, sequence labeling, and generation metrics.
Applied Follow-Ups
- How do you pick an embedding model for finance?
- How do multilingual embeddings change evaluation?
- Why can cosine similarity retrieve irrelevant text?
- How do you detect embedding/index drift?
- When is BM25 the correct answer?
Architecture Questions
- Derive scaled dot-product attention:
softmax(QKᵀ/√d)V. - Why divide by
√d? - Why multiple attention heads?
- Encoder-only versus decoder-only versus encoder-decoder.
- Causal masks, padding masks, and cross-attention.
- Absolute, sinusoidal, RoPE, and ALiBi positions.
- Residual connections, layer norm, and feed-forward blocks.
Scaling / Inference
- Why attention is quadratic in sequence length.
- KV cache: what it stores and its memory cost.
- Prefill versus decode; why decode is memory-bound.
- Continuous batching and paged attention.
- Quantization trade-offs: weights, activations, KV cache.
- Mixture-of-Experts benefits and routing challenges.
| Stage | Purpose | Data / Objective | Main Risk |
|---|---|---|---|
| Pretraining | Learn language/world patterns | Large corpus; next-token prediction | Cost, contamination, unsafe knowledge |
| Continued pretraining | Adapt domain language | Unlabeled domain corpus | Forgetting and domain overfit |
| SFT | Teach instruction behavior | Prompt-response examples | Data quality and imitation limits |
| RLHF | Optimize human preferences | Preference/reward model + RL | Complexity and reward hacking |
| DPO | Preference alignment directly | Chosen/rejected pairs | Preference-data bias |
Rapid Fire
- Temperature, top-k, top-p, repetition penalty.
- Distillation versus quantization versus pruning.
- Catastrophic forgetting.
- Data contamination and benchmark leakage.
Decision Questions
- Prompting vs RAG vs fine-tuning.
- Open model vs hosted API.
- Small specialized model vs large general model.
- When not to use an LLM.
SQL / Data
- Window functions, CTEs, joins, deduplication, top-N per group.
- Query plan, indexes, partition pruning, clustering.
- Batch versus streaming and exactly-once semantics.
- Idempotency, backfills, late data, schema evolution.
- Data quality checks and lineage.
MLOps / LLMOps
- Experiment tracking, registry, reproducibility, and rollback.
- Feature/data/model/prompt version compatibility.
- Shadow, canary, and A/B deployment.
- Monitoring drift, quality, cost, latency, and safety.
- How LLMOps differs from classical MLOps.
Threats
- Direct and indirect prompt injection.
- Data exfiltration through tools or retrieved context.
- Over-privileged agents and confused-deputy attacks.
- Training-data poisoning and unsafe dependencies.
- PII/PHI leakage, retention, and regional compliance.
Controls
- Least privilege, scoped credentials, allowlisted tools, policy checks.
- Separate instructions from untrusted data; sanitize and label context.
- Input/output validation, DLP, encryption, audit logs.
- Human approval for consequential actions.
- Red-team evals, incident response, and kill switches.
Naive RAG uses only dense vector search (bi-encoder embeddings + ANN). It captures semantic similarity well but fails on exact keywords, entity names, numeric IDs, or ticker symbols.
Hybrid RAG combines dense vector search with sparse BM25 (TF-IDF based keyword matching). The two result lists are merged using Reciprocal Rank Fusion (RRF): each document gets a score of Σ 1/(k + rank_i) across all rankers, and documents are re-sorted by this fused score.
| Dimension | Naive (Dense) | Hybrid (Dense + BM25) |
|---|---|---|
| Semantic queries | Excellent | Excellent |
| Exact keyword/ID | Misses | Catches |
| Financial codes | Poor | Good |
| Setup complexity | Low | Medium |
| Recall improvement | Baseline | +15–40% typical |
from langchain.retrievers import EnsembleRetriever from langchain_community.retrievers import BM25Retriever from langchain_community.vectorstores import OpenSearchVectorSearch # Sparse retriever bm25 = BM25Retriever.from_documents(docs, k=10) # Dense retriever vs = OpenSearchVectorSearch.from_documents(docs, embeddings) dense = vs.as_retriever(search_kwargs={"k": 10}) # Hybrid: 40% BM25 + 60% semantic, merged via RRF hybrid = EnsembleRetriever( retrievers=[bm25, dense], weights=[0.4, 0.6] # tune per domain )
Chunking directly controls what the retriever can find. The wrong chunk size either splits context across boundaries or retrieves too much noise.
| Strategy | How it works | Best for | Tradeoff |
|---|---|---|---|
| Fixed-size | Splits every N tokens with optional overlap | Simple docs, fast prototype | Splits mid-sentence |
| Sentence splitter | Splits at sentence boundaries | General prose | Uneven chunk sizes |
| Recursive character | Tries \n\n → \n → ". " in order | Most text documents (LangChain default) | Slight overhead |
| Semantic chunking | Groups sentences with cosine similarity shift | Dense research docs | Slow; needs embedding at ingest |
| Parent-child | Small child chunks indexed; parent context served to LLM | Long reports, contracts | Larger index size |
| Late chunking | Embed full doc, then chunk — preserves context in embeddings | Context-sensitive passages (jina-embeddings-v3) | Newer, less tooling |
ParentDocumentRetriever) with child chunk 256 tokens, parent 1024 tokens. Retrieve small chunks for high precision; inject the parent window into the LLM for full context.Always add chunk overlap (10–20% of chunk size) so sentences at boundaries aren't orphaned. Metadata-enrich every chunk: source_file, page_number, created_at, entity_type — these become filterable fields in the vector DB.
Initial retrieval uses a bi-encoder (query and document embedded separately) because it's fast — you embed the query once and do ANN lookup. But bi-encoders miss subtle relevance nuances.
A cross-encoder re-ranker takes (query, document) pairs and scores them jointly with full attention — far more accurate. You run it on just the top-20 retrieved results (not the full index), making the cost manageable.
| Stage | Model type | Speed | Accuracy | Scale |
|---|---|---|---|---|
| ANN retrieval | Bi-encoder | ~5ms | Good | Millions of docs |
| Re-ranking | Cross-encoder | ~80–200ms | Excellent | Top 20–50 only |
Use re-ranking when: precision matters over throughput (finance, healthcare, legal), corpus has many near-duplicate docs, or faithfulness is critical.
from langchain.retrievers import ContextualCompressionRetriever from langchain_cohere import CohereRerank reranker = CohereRerank(model="rerank-english-v3.0", top_n=5) retriever = ContextualCompressionRetriever( base_compressor=reranker, base_retriever=hybrid_retriever # retrieve top-20 first )
RAG Fusion uses the LLM itself to generate multiple alternative query reformulations from the user's original query, runs each through the retriever independently, then fuses all result lists with RRF before generating the answer.
Standard hybrid RAG uses one query, two retrieval methods (dense + sparse), then fuses. RAG Fusion uses N query variations, one or more retrieval methods, then fuses — it solves vocabulary mismatch and query ambiguity at the query level.
def rag_fusion(query: str, k=5): # 1. Generate query variations via LLM variations = llm.invoke(f"Generate 4 alternative versions of: {query}") # 2. Retrieve for each variation all_results = {} for q in [query] + variations: results = retriever.invoke(q) for rank, doc in enumerate(results): doc_id = doc.metadata["id"] all_results[doc_id] = all_results.get(doc_id, 0) + 1/(60 + rank) # 3. RRF sort and return top-k return sorted(all_results, key=all_results.get, reverse=True)[:k]
RAG evaluation has two orthogonal dimensions: retrieval quality (did we get the right chunks?) and generation quality (did the LLM use them faithfully?).
| Metric | Formula (simplified) | Target | Tool |
|---|---|---|---|
| Context Precision | Relevant retrieved / Total retrieved | >0.8 | RAGAS |
| Context Recall | Facts in context / Total needed facts | >0.75 | RAGAS |
| Faithfulness | Claims supported by context / Total claims | >0.85 | RAGAS / LangSmith |
| Answer Relevancy | Cosine(answer embedding, question embedding) | >0.7 | RAGAS |
| Latency P95 | 95th pctile end-to-end ms | <2000ms | CloudWatch |
| Token cost / query | Input + output tokens × price | Track trend | LangSmith |
from ragas import evaluate from ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall ) from datasets import Dataset dataset = Dataset.from_dict({ "question": questions, "answer": generated_answers, "contexts": retrieved_chunks_per_q, # List[List[str]] "ground_truth": reference_answers }) results = evaluate( dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall] ) print(results.to_pandas())
- 1Grounding instructions in system prompt: Explicitly instruct the LLM: "Answer only from the provided context. If the answer is not in the context, say 'I don't have enough information.'" This alone reduces hallucinations ~40%.
- 2Citations at generation time: Ask the LLM to cite the source chunk for every claim. You can then programmatically verify the citation exists in the retrieved context.
- 3Faithfulness guardrail: Run RAGAS faithfulness check on a sample of production traffic. Alert if score drops below threshold.
- 4Confidence thresholding: If retrieval similarity scores are all below a threshold (e.g., cosine < 0.6), return "no relevant information found" rather than hallucinating.
- 5NeMo Guardrails / Bedrock Guardrails: Add a post-generation layer that checks the answer doesn't reference topics outside the retrieved context.
GraphRAG (Microsoft, 2024) builds a knowledge graph from documents — entities become nodes, relationships become edges. Retrieval traverses the graph rather than (or in addition to) doing vector search.
It excels when the question requires connecting multiple entities across many documents ("What are all the risk factors shared by companies X and Y?") — a pattern that naive RAG misses because no single chunk contains the answer.
| Use case | Standard RAG | GraphRAG |
|---|---|---|
| Factual Q&A on single docs | ✅ Best | Overkill |
| Multi-hop reasoning across entities | Struggles | ✅ Best |
| Relationship queries ("who reports to X") | Misses | ✅ Best |
| Real-time index updates | ✅ Easy | Slow (graph rebuild) |
Multi-turn RAG needs to resolve references ("it", "the same document", "as you mentioned") to prior turns before retrieval — otherwise the query is incomplete.
- Query condensation: Pass last N messages + current query to LLM with instruction "Rewrite the final question as a standalone query." Then retrieve with the condensed query.
- ConversationalRetrievalChain: LangChain's built-in pattern handles this automatically.
- Context window management: Summarise old turns with a summarisation LLM when conversation exceeds 20 turns, to keep token cost bounded.
- External memory stores: For long-running sessions, persist summaries to Redis or a DB keyed by session_id.
| Dimension | Workflow (DAG) | Agent |
|---|---|---|
| Routing logic | Code decides | LLM decides |
| Predictability | High — deterministic | Low — emergent |
| Auditability | Full trace known ahead | Trace varies per run |
| Flexibility | Fixed paths only | Open-ended tasks |
| Failure modes | Step failure, timeout | Loops, hallucinated tool calls |
| Best for | Finance compliance, ETL, pipelines | Research, assistant, open-ended |
LangGraph is a low-level orchestration runtime for long-running, stateful agents. A StateGraph models work as nodes and edges; typed shared state flows through the graph and nodes return partial updates. Persistence enables durable execution, streaming, human-in-the-loop interrupts, and recovery. LangChain agents are a higher-level interface built on LangGraph; use LangGraph directly when you need explicit state and control.
from langgraph.graph import StateGraph, END from langgraph.checkpoint.sqlite import SqliteSaver from typing import TypedDict, List, Annotated import operator class AgentState(TypedDict): messages: Annotated[List, operator.add] # reducer: append retrieved: List[str] answer: str iteration: int def retrieve_node(state: AgentState) -> dict: docs = retriever.invoke(state["messages"][-1].content) return {"retrieved": [d.page_content for d in docs]} def generate_node(state: AgentState) -> dict: answer = llm.invoke(state["messages"] + state["retrieved"]) return {"answer": answer, "iteration": state["iteration"] + 1} def should_continue(state: AgentState) -> str: if state["iteration"] >= 3: return "end" if "insufficient" in state["answer"]: return "retrieve" return "end" graph = StateGraph(AgentState) graph.add_node("retrieve", retrieve_node) graph.add_node("generate", generate_node) graph.set_entry_point("retrieve") graph.add_edge("retrieve", "generate") graph.add_conditional_edges("generate", should_continue, {"retrieve": "retrieve", "end": END}) # Checkpointing = persistence across turns (human-in-the-loop) memory = SqliteSaver.from_conn_string(":memory:") app = graph.compile(checkpointer=memory, interrupt_before=["generate"])
ReAct interleaves reasoning and action so a model can use observations from tools before choosing the next step. In production, do not depend on exposing the model's private reasoning trace. Log auditable state transitions, tool calls, arguments, results, policy decisions, and concise model-provided rationales instead.
Function/tool calling is the typed application interface: the model emits a structured tool request, the application validates and authorizes it, executes the tool, then returns the result. ReAct is a reasoning-and-action pattern; tool calling is an execution contract, and they can be used together.
| Pattern | Transparency | Latency | Best for |
|---|---|---|---|
| ReAct | Observable actions and observations | Usually multi-step | Adaptive tool-using tasks |
| Function calling | Typed tool request and result | Often lower | Reliable application/tool integration |
| Plan-and-Execute | Explicit plan and task state | Varies | Long, decomposable tasks |
- 1Explicit budgets: Set graph-step, per-tool retry, wall-clock, token, and cost limits. Force a safe terminal state when any budget is exhausted.
- 2Tool output validation: Wrap every tool in a Pydantic model. If the tool returns unexpected schema, raise a ToolException — the agent sees the error and can self-correct (max 1 retry per tool).
- 3Structured outputs for routing: Force the routing decision via JSON schema (
with_structured_output). Eliminates free-text routing that the LLM might misformat. - 4Human-in-the-loop for side-effecting actions: Any tool that writes to a DB, calls an API with side effects, or sends a message gets an
interrupt_beforecheckpoint in LangGraph — requires explicit human approval. - 5Deterministic policy layer: Treat all model and tool output as untrusted. Enforce identity, authorization, egress, argument validation, effect receipts, and audit outside the model.
- 1Microservice isolation: Each agent runs as a separate AWS Lambda / ECS container. Agents communicate through an SQS queue, decoupling their lifecycles and enabling independent scaling.
- 2Async fan-out: Orchestrator dispatches sub-tasks via
asyncio.gather— all sub-agents run concurrently. Only the final merge step waits. - 3Result caching: Sub-agent results for repeated sub-tasks are cached in Redis (TTL: 5 min for volatile data, 1hr for reference data).
- 4Circuit breaker: If a sub-agent fails 3× in 60s, the orchestrator marks it unavailable and falls back to a degraded path rather than propagating failure.
MCP (Model Context Protocol) is an open client-server protocol for connecting AI applications to reusable tools, resources, and prompts. It reduces one-off integration code, but it does not remove the need to trust the server, authenticate users, authorize every action, validate outputs, and obtain approval for high-impact effects.
As of June 14, 2026: the stable specification is 2025-11-25. The 2026-07-28 release candidate is available for testing, with final release scheduled for July 28, 2026. MCP connects applications to context/tools; A2A focuses on agent-to-agent discovery and communication.
| Layer | Use it for | Do not confuse it with |
|---|---|---|
| Tool calling | A model requests a typed application function | Cross-vendor integration discovery |
| MCP | A host connects to reusable tool/resource/prompt servers | Autonomous agent-to-agent delegation |
| A2A | Agents advertise capabilities and coordinate tasks across systems | Direct database/tool access |
Security answer: protocol compatibility is not authorization. Authenticate identities, apply least privilege, validate schemas, constrain egress, record effect receipts, and require approval based on risk.
| Need | Candidate | Decision signal |
|---|---|---|
| Fast high-level agent | LangChain agents, OpenAI Agents SDK, Google ADK | Provider/tool fit, tracing, evals, team familiarity |
| Explicit durable orchestration | LangGraph or custom state machine | Persistence, interrupts, recovery, deterministic control |
| Agent harness | Deep Agents or an internal harness | Planning, filesystem/subagents, long-running task ergonomics |
| Managed production runtime | Bedrock AgentCore, Vertex AI Agent Engine, or equivalent | IAM, networking, observability, compliance, regional support |
Strong answer: start from the task and operating constraints. Prototype the simplest bounded workflow, add model-driven autonomy only where it improves measured task success, and choose the platform that minimizes undifferentiated operations without trapping critical business logic.
- 1Semantic caching (biggest win): Cache (query_embedding, response) pairs in Redis. On each new query, compute embedding similarity — if cosine > 0.92 with a cached query, return cached response. Typically serves 30–50% of traffic from cache at near-zero latency.
- 2Streaming responses: Use SSE/streaming so the user sees the first token in ~400ms instead of waiting for the full response. Perceived latency drops dramatically.
- 3Async concurrent retrieval: Run BM25 and dense retrieval concurrently with
asyncio.gather. If retrieval and LLM call can overlap, pipeline them. - 4Prompt compression: Use LLMLingua to compress long retrieved contexts before sending to the LLM — reduces input tokens 20–40%, direct latency reduction.
- 5Model routing: Route classification and simple extraction to the smallest evaluated low-latency model tier; reserve a stronger reasoning tier for complex generation. Re-benchmark by provider and region because model names, pricing, and latency change.
- 6Connection pooling: Reuse HTTP connections to the Bedrock endpoint. AWS SDK connection pool + keep-alive saves 80–120ms per cold connection.
| Metric | Tool | Alert threshold | Why it matters |
|---|---|---|---|
| Latency P95 / P99 | CloudWatch | P95 > 2s | Direct UX impact |
| Error rate (5xx) | CloudWatch Alarms | > 1% | System stability |
| Token cost / hour | LangSmith + custom | > 30% spike | Runaway costs |
| Faithfulness score | RAGAS online eval | < 0.80 | Hallucination proxy |
| Context precision | RAGAS | < 0.70 | Retrieval degradation |
| Cache hit rate | Redis metrics | < 20% (unexpected drop) | Cost/latency efficiency |
| Prompt injection attempts | Custom classifier | Any detection | Security |
| Embedding drift | Scheduled job | Cosine shift > 0.15 vs baseline | Model / data drift |
Prompts are code. Changing a prompt is a deployment. Without versioning, you can't roll back a bad prompt change that's causing hallucinations.
- 1Store prompts in LangSmith Hub or a config store (SSM Parameter Store / Git): Every prompt has a version hash. Production always pins a specific version.
- 2Evaluation gate in CI: PR changing a prompt triggers an eval pipeline. RAGAS scores are computed on a golden test set. Merge only if faithfulness ≥ 0.82 and answer_relevancy ≥ 0.72.
- 3A/B shadow testing: New prompt gets 10% of traffic in shadow mode. Compare metrics vs current. Promote if better.
- 4Instant rollback: Update SSM parameter to prior version hash. No redeploy required.
| Technique | Cost reduction | Implementation effort |
|---|---|---|
| Semantic caching | 30–50% fewer LLM calls | Medium (Redis + GPTCache) |
| Model routing (smart/cheap) | 40–70% on routing tasks | Medium (LLM router) |
| Bedrock batch inference | 50% discount vs on-demand | Low (async jobs only) |
| Prompt compression (LLMLingua) | 20–40% input token reduction | Medium |
| Reduce top-K retrieval | 5–15% input token reduction | Low (tune k from 20 → 5) |
| Context window right-sizing | Variable | Low |
Traditional APM (Datadog, New Relic) tracks: latency, error rate, CPU, memory — all mechanical metrics. LLM observability adds a semantic layer: does the system say the right things?
- Traces: Full chain trace — input query → retrieval results → LLM prompt sent → response. LangSmith captures all of this per-run.
- Semantic metrics: Faithfulness, relevancy, hallucination rate — can only be measured by an LLM evaluator (judge model).
- Token-level cost attribution: Which component of your chain uses the most tokens? LangSmith shows cost breakdown per node.
- User feedback loops: Thumbs up/down signals feed back into your golden test set for future eval runs.
Model drift in LLMs shows up as: answer quality degradation (faithfulness drops), output format changes (model update broke JSON parsing), or latency shifts (new model version).
- 1Golden test set regression: Run a versioned, representative test set on every deployment. Gate release using task-specific, statistically meaningful tolerances calibrated to business risk.
- 2Embedding and traffic drift: Track query/topic distributions, retrieval success, score distributions, and labelled outcomes by version. Investigate material changes relative to calibrated baselines rather than one universal cosine threshold.
- 3Knowledge freshness: For time-sensitive domains, add a "freshness" metadata filter. Prefer recently indexed documents. Surface "as of [date]" disclaimers in the response.
Problem: Full fine-tuning a 7B model updates all ~7 billion weights — requires ~112GB GPU RAM (FP16), making it infeasible without a cluster.
LoRA insight: The weight update matrix ΔW during fine-tuning has intrinsically low rank. Instead of updating W directly, decompose the update: ΔW = A × B, where A ∈ ℝd×r and B ∈ ℝr×k, with r ≪ min(d, k). Only A and B are trained. W stays frozen.
16.7M parameters
65K parameters
65K parameters
without replacing W
from peft import LoraConfig, get_peft_model, TaskType config = LoraConfig( r=16, # rank lora_alpha=32, # scaling = alpha/r = 2 target_modules=["q_proj", "v_proj"], # attention layers lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM ) model = get_peft_model(base_model, config) model.print_trainable_parameters() # trainable: 4,194,304 / 6,738,415,616 total (0.062%)
QLoRA = Quantised LoRA. It quantises the base model to 4-bit NF4 (NormalFloat4) while keeping LoRA adapter weights in BF16. A 13B model at 4-bit takes ~6.5GB instead of ~26GB.
| Method | Base model precision | GPU RAM (13B) | Quality vs full FT |
|---|---|---|---|
| Full fine-tuning | BF16 | ~104GB | 100% (baseline) |
| LoRA | BF16 | ~26GB | ~98% |
| QLoRA | 4-bit NF4 | ~6.5GB | ~97% |
QLoRA also introduces double quantisation (quantise the quantisation constants) and paged optimisers (offload optimizer states to CPU RAM on memory spikes). These together enable fine-tuning a 65B model on a single A100 80GB.
| Criteria | Use RAG | Use Fine-tuning |
|---|---|---|
| Knowledge updates | Frequent (daily/weekly) | Stable domain knowledge |
| Data availability | Large unstructured corpus | Representative, high-quality labelled examples; required volume is task-dependent |
| Output style/format | Generic format OK | Specific tone, JSON schema, brand voice |
| Explainability | Source-grounded citations | Opaque — no citation |
| Latency | Adds retrieval overhead | No retrieval step |
| Cost | Low upfront, per-query retrieval cost | GPU training cost upfront |
RLHF (Reinforcement Learning from Human Feedback) trains a separate reward model from human preference pairs, then uses PPO to optimise the LLM against that reward model. Complex, unstable, requires large infra.
DPO (Direct Preference Optimisation) bypasses the explicit reward-model-plus-PPO pipeline. Given (prompt, chosen_response, rejected_response) pairs, it directly updates the model to prefer the chosen response. It is operationally simpler than classic RLHF, but quality and stability still depend on data, objective, hyperparameters, and evaluation.
| Method | Reward model | Stability | Data needed |
|---|---|---|---|
| RLHF / PPO | Required | Fragile | Large |
| DPO | Not needed | Stable | Moderate |
| ORPO | Not needed | Evaluate for the task | Task-dependent |
- 1Format: Use instruction-following format —
{"instruction": "...", "input": "...", "output": "..."}(Alpaca format) or multi-turn chat format for conversational models. - 2Quality before scale: Start with a small, curated, representative dataset; measure the gain on a held-out eval set, inspect failures, then add targeted examples. Deduplicate and use calibrated human/model review rather than trusting one judge.
- 3Diversity: Cover all intended use cases. Uneven distribution → model will overfit the majority class.
- 4Contamination check: Ensure test set has no overlap with training data. Use exact and near-duplicate detection.
Chain-of-Thought (CoT) prompting encourages intermediate reasoning on multi-step tasks. It can improve performance, but production systems should not rely on exposing a model's private reasoning. Ask for a concise rationale, assumptions, calculations, citations, or verification artifacts that can be audited.
| Variant | How to trigger | Best for |
|---|---|---|
| Zero-shot reasoning | Ask for careful analysis plus a concise, checkable answer | Baseline to evaluate, not a guaranteed win |
| Few-shot reasoning | Provide representative worked examples and expected artifacts | Domain-specific consistency |
| Tree of Thought (ToT) | Prompt to explore N paths, evaluate each, select best | Complex planning, search problems |
| ReAct | Thought/Action/Observation loop | Tool-using agents |
| Self-consistency | Sample multiple solutions and aggregate/verify | Cases where added cost is justified; never the only high-stakes control |
system = """You are a financial analyst. Before answering any question:
1. State the key facts given in the context.
2. Identify any missing information.
3. Show the calculations, citations, or checks needed to verify the result.
4. State the final answer and concise rationale clearly.
If you are uncertain, say so explicitly."""Prompt engineering = crafting the instruction/template text. It's about what you say.
Context engineering = dynamically constructing the entire model input — which retrieved documents to include, how to summarise prior conversation, which tool outputs to pass, how to structure memory, and where to place instructions. It is the broader discipline of managing a model's available context budget intelligently; the exact window varies by model and provider.
- Primacy + recency: LLMs attend most to content at the very start and very end of the context. Place the most critical instructions in both positions.
- Structured delimiters: Use XML tags (
<context>,<instructions>,<examples>) to cleanly separate sections — reduces instruction-following errors. - Conversation compression: Summarise turns older than 10 with a cheap LLM. Keeps token cost bounded without losing thread.
- Lost in the middle: Long contexts suffer from "lost in the middle" — relevant info in the middle of a 100K context gets underweighted. Front-load the most important chunks.
- 1Provider-native schema-constrained output: Prefer the current API's Structured Outputs / JSON Schema or typed tool-calling feature when supported. This constrains shape more strongly than plain JSON mode.
- 2Application validation: Parse into Pydantic or an equivalent schema, enforce semantic/business constraints, reject unknown fields, and retry only bounded, repairable failures.
- 3Fallbacks: Plain JSON mode guarantees syntax, not business correctness. Grammar/constrained-decoding libraries can help with self-hosted models, but still validate after generation.
from pydantic import BaseModel, Field from langchain_anthropic import ChatAnthropic class RiskAssessment(BaseModel): risk_level: str = Field(description="low | medium | high | critical") key_factors: list[str] = Field(description="Top 3 risk factors") recommendation: str llm = ChatAnthropic(model="current_evaluated_model") structured_llm = llm.with_structured_output(RiskAssessment) result = structured_llm.invoke("Assess the credit risk for...") # Then apply business validation and bounded error handling.
Prompt injection occurs when untrusted content influences model behavior against the application's intent. It cannot be solved reliably by one prompt, delimiter, classifier, or blocklist; design the surrounding system so a successful injection still cannot perform an unauthorized action.
- Separate instructions from untrusted data: mark provenance and prevent retrieved/user/tool content from becoming durable policy or memory without review.
- Deterministic authorization: the model may propose an action, but code validates identity, permission, arguments, data scope, and business rules.
- Least privilege and approvals: use scoped credentials, egress allowlists, read-only defaults, idempotency, and human confirmation for high-impact actions.
- Defense in depth: input/output classifiers and guardrails are useful signals, then add adversarial evals, logging, anomaly detection, kill switches, and incident response.
| Component | Purpose | Changes per request? | Token budget |
|---|---|---|---|
| System prompt | Role, rules, output format, persona | No (static) | 500–2000 tokens |
| Few-shot examples | Demonstrate expected format/style | Rarely | 500–3000 tokens |
| RAG context | Current factual knowledge for this query | Yes (per query) | 2000–8000 tokens |
| Conversation history | Prior turns context | Yes (grows) | 500–10000 tokens |
| User message | The actual query | Yes | 50–500 tokens |
- 1Ingestion: S3 PUT event → SQS → Lambda or container worker (parse, normalize, chunk, attach provenance) → an evaluated Bedrock embedding model → OpenSearch Serverless or the selected supported vector store. Pin model/version and plan re-indexing.
- 2Query path: API Gateway → orchestrator → embed query → hybrid retrieval with tenant filters → evaluated reranker when it improves quality enough to justify latency/cost → top evidence into an evaluated Bedrock generation model → validate, cite, and stream.
- 3Security: IAM roles per service (least privilege). Bedrock Guardrails for PII redaction. Row-level security in OpenSearch (users only see their org's documents). All queries logged to S3 for audit.
- 4Monitoring: CloudWatch custom metrics for faithfulness, latency P95, token cost. Alarm on any metric breaching threshold.
Bedrock Knowledge Bases is AWS's managed RAG capability for connecting supported data sources, parsing/chunking content, creating or using supported vector stores, retrieving evidence, reranking where supported, and integrating retrieval with generation. Exact sources, stores, models, and features vary by region and configuration.
| Dimension | Bedrock KB (managed) | Custom RAG |
|---|---|---|
| Delivery effort | Usually faster; managed integrations and operations | More engineering and operational ownership |
| Customisation | Configurable within currently supported sources, stores, models, and workflows | Full control over every retrieval and serving stage |
| Hybrid search | Yes — verify region and data-store support | Yes (full control) |
| Re-ranking | Built-in (Amazon Rerank) | Any reranker |
| Cost at scale | Benchmark total managed-service cost | Can optimize deeply, but include engineering and operations |
| Dimension | Lambda | ECS / Fargate |
|---|---|---|
| Execution model | Event/request-driven functions with platform limits | Long-running container services and tasks |
| Startup | Cold starts depend on runtime, package, networking, and configuration | Keep desired tasks warm; task startup still matters during scaling |
| Duration/resources | Bounded by current Lambda quotas | Broader task sizing and duration control; verify Fargate/EC2 limits |
| Streaming | Supported patterns, with integration-specific constraints | Native application HTTP/WebSocket patterns |
| Cost model | Usage-based function execution | Provisioned task resources while running |
Rule of thumb: use Lambda for bounded, event-driven orchestration and transformation when its current quotas fit. Use ECS/Fargate or EKS for long-running agents, custom runtimes, connection-heavy streaming, background workers, or sustained workloads. Verify today's quotas and benchmark both.
Bedrock Guardrails is a managed safety layer that intercepts LLM inputs and outputs. It operates independently of the model — wraps any Bedrock-hosted model.
- Topic blocking: Define topics the model must not discuss (e.g., competitor products, investment advice). Guardrail intercepts and returns a pre-defined safe response.
- PII redaction: Automatically detects and redacts (or masks) PII in both input and output — names, SSNs, credit card numbers.
- Grounding check: Compares the generated response against the retrieved context. Flags responses that introduce facts not present in the context (hallucination detection).
- Word filters: Block specific words, phrases, or regex patterns in input/output.
Step Functions provides a managed, visual workflow orchestrator. For deterministic multi-step GenAI pipelines, it's a strong alternative to LangGraph — especially when you need:
- Built-in retry logic with exponential backoff on each step.
- Long-running workflows (days) — impossible with Lambda alone, natural with Step Functions.
- Human approval steps via
waitForTaskToken— workflow pauses until a human approves, then resumes. - Parallel fan-out via
Mapstate — process 1000 documents in parallel, each calling a Lambda.
| Option | Choose When | You Own | Interview Trade-off |
|---|---|---|---|
| Bedrock | Managed access to supported foundation/custom models; fastest product delivery | Application, prompts, evals, governance configuration | Less runtime/inference-engine control |
| SageMaker AI endpoint | Custom model serving, MLOps integration, managed endpoints | Model packaging/configuration and endpoint operations | More control and more operational work |
| HyperPod / EKS / EC2 | Specialized training/serving, custom kernels/runtime, maximum control | Capacity, orchestration, scaling, resilience, cost utilization | Highest flexibility and operational burden |
2026 positioning: AgentCore is the AWS platform layer for securely deploying and operating agents built with different frameworks and models. Discuss runtime isolation, enterprise tool/data connectivity, identity/access controls, tracing, debugging, and evaluation rather than treating an agent as only an LLM loop.
Use It When
- You need managed production agent runtime and governance.
- Agents connect to enterprise systems with controlled authentication.
- Tracing, debugging, evaluation, and operational consistency matter.
Custom Runtime When
- Runtime behavior is highly differentiated.
- You need unusual isolation, portability, or multi-cloud control.
- Platform feature/region constraints do not meet requirements.
Identity & Data
- Separate accounts/environments; least-privilege IAM roles and scoped agent/tool permissions.
- KMS encryption, Secrets Manager, tenant-level authorization, data classification and retention.
- Guardrails/DLP, immutable audit trail, prompt/response logging policy.
Network & Operations
- Private subnets and VPC endpoints/PrivateLink where supported; restrict egress.
- WAF/API Gateway throttling, CloudTrail, Config, Security Hub, CloudWatch/X-Ray.
- Threat-model model/vendor calls, tools, RAG data, agents, and human approvals.
Guardrails can be applied consistently around model interactions to filter harmful content, protect sensitive information, assess grounding, and enforce use-case policies. Automated Reasoning checks can detect logical issues and unstated assumptions against formalized policies, but return findings in detect mode; your application decides whether to serve, revise, clarify, or escalate.
Also discuss sustainability and business impact: avoid unnecessary large-model calls, measure value, and retire low-value workloads.
- IIngest: Cloud Storage events → Pub/Sub/Eventarc → Cloud Run or Dataflow workers; use Document AI where layout/OCR matters. Make jobs idempotent and dead-letter failures.
- RRetrieve: Choose Vertex AI RAG Engine for managed orchestration, Vector Search for high-scale ANN, or AlloyDB/pgvector when relational joins and transactions dominate. Store tenant and ACL metadata with every chunk.
- GGenerate: Call an evaluated Vertex AI model, enforce grounded citations and structured output, then validate before serving through Cloud Run/API Gateway.
- OOperate: IAM service accounts, VPC Service Controls/Private Service Connect where applicable, CMEK, Cloud Logging/Monitoring/Trace, offline golden sets and sampled online evaluation.
| Choice | Best Fit | Trade-off |
|---|---|---|
| Agent Engine | Managed deployment and operations for custom agents; sessions, memory, code execution and governance | Platform constraints and regional feature checks |
| Gemini Enterprise Agent Platform | Enterprise discovery, connected agents and governed employee workflows | Less freedom than a fully custom product runtime |
| Cloud Run/GKE custom | Maximum framework, portability and runtime control | You own isolation, scaling, tracing, state and upgrades |
Interview answer: start with compliance, integration, latency, portability and operating-model requirements; then choose. Do not select an agent platform merely because the workflow calls an LLM.
| Service | Choose When | Probe |
|---|---|---|
| RAG Engine | You want managed ingestion/retrieval integration with Vertex AI | Supported sources, regions, quotas and customization |
| Vector Search | Large-scale, low-latency ANN is central | Index update strategy, filtering and cost |
| AlloyDB/pgvector | Vectors must live beside relational data and transactions | Scale ceiling and query plan behavior |
| Agent Search | Enterprise search/discovery is the product | Connectors, relevance controls and permissions |
Cross-question: Prove the choice with corpus size, update rate, filter selectivity, recall@k, P95 latency, data residency, team skills and total cost.
| Runtime | Use It For | Watch-outs |
|---|---|---|
| Cloud Run | Stateless APIs, streaming gateways, workers and portable containers | Cold starts, request limits, concurrency and downstream quotas |
| Cloud Run functions | Small event handlers and glue logic | Keep complex orchestration out of tiny handlers |
| GKE | Custom serving, GPUs, sidecars, service mesh and deep scheduling control | Cluster operations and utilization |
Separate the latency-sensitive serving path from asynchronous ingestion/evaluation. Scale each independently and put Pub/Sub between bursty producers and workers.
- Managed endpoint: prefer when managed autoscaling, model lifecycle, monitoring and IAM integration outweigh runtime customization.
- Cloud Run/GKE: prefer for custom inference engines, unusual dependencies, portability or tight hardware control.
- Decision evidence: quality benchmark, tokens/sec, first-token latency, concurrency, accelerator utilization, availability, region, safety and cost per successful task.
Prevent
- Dedicated service accounts and least-privilege IAM.
- VPC Service Controls, Private Service Connect/private access and restricted egress where supported.
- CMEK, Secret Manager, DLP, tenant authorization and tool allowlists.
Detect & Respond
- Cloud Audit Logs, Security Command Center, Cloud Logging and alerting.
- Trace prompts, retrieval, tools and policy decisions with redaction.
- Human approval for high-impact actions; incident kill switch and credential rotation.
Threat model: prompt injection, data exfiltration, poisoned retrieval, excessive agency, insecure tool arguments, cross-tenant leakage and model/vendor outages.
- 1Publish immutable ingestion/evaluation events with schema version, tenant, correlation ID and idempotency key.
- 2Use Dataflow when transformations are high-volume, streaming, windowed or need replay; use Cloud Run workers for simpler task queues.
- 3Land sanitized traces and evaluation facts in BigQuery; partition by event date, cluster by tenant/model/version, and enforce retention/access policies.
- 4Monitor backlog age, dead letters, duplicate rate, processing latency, quality drift and evaluation cost.
| Store | Primary Role | Example |
|---|---|---|
| BigQuery | Analytical warehouse | Trace analytics, offline evals, cohorts and cost reporting |
| AlloyDB | Relational operational data with PostgreSQL compatibility | Conversation metadata, app transactions and vector-relational queries |
| Spanner | Globally consistent, horizontally scalable operational data | Multi-region agent state requiring strong consistency |
State your access patterns, consistency needs, region topology, transaction boundaries, retention and expected growth before naming a database.
- Build/deploy: Cloud Build → Artifact Registry → Cloud Deploy or controlled infrastructure pipeline; promote immutable application, prompt, model-policy and dataset versions.
- Pre-production gates: unit/contract/security tests plus golden-set quality, latency and cost thresholds; canary before full rollout.
- Runtime: Cloud Logging/Monitoring/Trace with correlation IDs across retrieval, model and tools. Alert on SLOs, quotas, safety, groundedness, tool failures and spend.
- Rollback: independently roll back prompt, model route, index alias and application version.
| Concern | AWS Example | GCP Example |
|---|---|---|
| Managed foundation models | Amazon Bedrock | Vertex AI |
| Managed agent operations | Bedrock AgentCore | Vertex AI Agent Engine / Gemini Enterprise Agent Platform |
| Serverless containers/functions | ECS/Fargate, Lambda | Cloud Run, Cloud Run functions |
| Messaging/streaming | SQS/SNS/Kinesis | Pub/Sub |
| Vector/search | OpenSearch, Bedrock Knowledge Bases | Vector Search, RAG Engine, Agent Search |
| Warehouse/analytics | Redshift/Athena | BigQuery |
Senior answer: map capabilities, then re-evaluate IAM, networking, regional availability, quotas, reliability, operating skills and cost. A service-name translation is not an architecture migration.
HNSW (Hierarchical Navigable Small World) builds a multi-layer proximity graph. Higher layers have few nodes with long-range "highway" connections; lower layers get progressively denser with short-range connections — mimicking how road networks work (motorways → main roads → local streets).
Search: Starts at an entry point in the top layer, greedily walks toward the query vector, then descends to lower layers for precision. Gives O(log n) average search time with high recall.
| Algorithm | Speed | Recall | Memory | Best for |
|---|---|---|---|---|
| HNSW | Fastest | Highest | High (graph) | Low-latency production |
| IVF-Flat | Fast | Medium | Low | Large-scale, budget memory |
| IVF-PQ | Fast | Medium | Very low | Billion-scale (with compression) |
| Flat (exact) | Slow O(n) | Perfect | Lowest | Eval benchmarks, <100K docs |
Key HNSW parameters: ef_construction (build quality, more = slower build but better graph), M (connections per node, more = better recall but more memory), ef_search (search time recall trade-off).
Metadata filtering restricts ANN search to a subset of vectors matching a filter condition (e.g., doc_type == "earnings_report" AND date >= 2026-01-01).
Three approaches with different tradeoffs:
- Pre-filtering (filter then search): Narrow to matching documents first, then ANN search over that subset. High precision but if the subset is small, HNSW degrades toward brute force.
- Post-filtering (search then filter): Run ANN over full index, then discard non-matching results. Fast but can return fewer than k results if many are filtered out.
- ACORN / filtered HNSW: Modern approach — metadata stored alongside vectors in the graph. Filter-aware graph traversal. Best of both worlds. Used in Pinecone, Weaviate.
keyword type (not text) for exact-match filtering. Use post-filtering for high-cardinality filters, pre-filtering for low-cardinality (e.g., tenant_id).| DB | Best for | Hybrid search | Self-host? | Notes |
|---|---|---|---|---|
| OpenSearch | Search-heavy workloads, AWS integration, BM25+vector | Yes | Yes | Strong search feature set; benchmark relevance and operations |
| Pinecone | Managed vector search with minimal operations | Yes (sparse+dense) | No | Benchmark cost, filters, tenancy, and region fit |
| Qdrant | Vector-first workloads and payload filtering | Yes | Yes | Managed and self-hosted choices; benchmark your workload |
| Weaviate | Multi-tenancy, schema-rich | Yes (BM25+vector) | Yes | Strong for SaaS multi-tenant products |
| pgvector | Postgres-native joins, transactions, and vector search | Combine with Postgres FTS | Yes | Can scale significantly with tuning/partitioning; benchmark recall, latency, and ops |
| Chroma | Local, single-node, or distributed/Cloud vector workloads | Product-dependent | Yes | Choose deployment mode by scale and operational needs |
Storing a 1536-dim float32 embedding takes 6KB. For 10M documents that's 60GB. Quantisation compresses vectors to reduce memory at the cost of some recall accuracy.
Product Quantisation (PQ) splits the 1536-dim vector into M sub-vectors (e.g., M=64, each 24-dim). Each sub-vector is mapped to its nearest centroid in a codebook of 256 centroids. Result: 64 bytes instead of 6144 bytes — 96× compression. Distance computation uses a precomputed lookup table, keeping it fast.
Binary quantisation: Convert each float to a single bit (positive=1, negative=0). 32× compression. Works surprisingly well for high-dimensional embeddings with cosine similarity.
- Upsert by document ID: Every document has a stable ID. Reprocess and upsert when the document changes. All major vector DBs support upsert (Pinecone, OpenSearch, Qdrant).
- Chunk deletion on update: When a document is updated, delete all chunks with
parent_doc_id == document.id, then re-ingest. Track chunk-to-document mapping in a separate metadata store (DynamoDB/RDS). - Event-driven ingestion: S3 Object Lambda or DynamoDB Streams → SQS → Lambda ingestion pipeline. Near-real-time index updates without polling.
- Versioned indexes: For zero-downtime updates on large corpora, build the new index alongside the old one, then do an atomic alias swap (OpenSearch index aliases).
- IIngestion: S3 → SQS (decouple) → Lambda fleet (parse PDF/DOCX/XLSX, hierarchical chunk: parent 1024t / child 256t, 50t overlap) → an evaluated Bedrock embedding model → OpenSearch Serverless (kNN + BM25 fields). Metadata:
doc_type, entity_id, created_at, source. Row-level security by entity_id. - DData model: OpenSearch index with both
embedding: knn_vector(dimension)andcontent: text(for BM25). Store the evaluated embedding dimension and model version in the index schema. Separate metadata fields askeywordtype. DynamoDB tracks chunk-to-document mapping for updates/deletions. - ERetrieval: Hybrid BM25 + kNN with evaluated fusion → an evaluated reranker when useful → evidence budget sent to the generation model. Add a version-aware semantic cache only after measuring correctness, staleness, latency, and hit rate.
- GGeneration: An evaluated Bedrock reasoning/generation model. System prompt enforces: cite source for every claim, flag if answer not in context. Bedrock Guardrails for PII + topic blocking. Response streaming via Lambda Response Streaming.
- EvEvaluation: versioned offline set plus risk-based production sampling. Calibrate retrieval, groundedness, task success, safety, latency, and cost gates against human review and business impact; do not present generic RAGAS thresholds as universal.
- MMonitoring: CloudWatch: latency P95 < 1.5s, error rate < 0.5%, cost per query. LangSmith: full trace per request. X-Ray for distributed tracing across Lambda chains.
- SSecurity: IAM roles per service. Bedrock Guardrails PII redaction. Entity-level index partitioning (user A cannot retrieve user B's documents). All queries audit-logged to S3.
- 1Architecture — Supervisor pattern in LangGraph: IntentClassifier node (routes query type) → either KnowledgeAgent (RAG over support docs), ActionAgent (CRM/order system APIs), or EscalationAgent (human handoff).
- 2Intent classification: An evaluated low-latency model tier classifies intent: information_request | account_action | complaint | escalation. Select it against an intent test set and explicit latency/cost SLO rather than relying on a model-name assumption.
- 3Escalation triggers: (a) User explicitly asks for human, (b) agent confidence < threshold for 2 turns, (c) complaint category detected, (d) max turns (10) reached. LangGraph uses
interrupt_beforeat escalation node. - 4Context handoff: Full conversation summary + entity extraction (customer ID, issue category, sentiment score) passed to human agent dashboard via Amazon Connect + DynamoDB.
- 5Memory: Short-term in LangGraph state. Long-term in DynamoDB keyed by customer_id (preferences, past issues) — injected as context on next session.
- 1Data collection: Risk-based, privacy-reviewed production sampling to an asynchronous evaluation queue. Oversample rare, high-impact, low-confidence, and newly changed slices.
- 2Judge layer: Use deterministic checks where possible and one or more evaluated judge models for rubric-based criteria. Calibrate judges against blinded human labels, monitor agreement/bias, and version judge prompts/models.
- 3Aggregation: Daily Lambda aggregates scores → writes to RDS → visualised in Grafana dashboard.
- 4CI gate: Every code, prompt, model, retriever, or policy change triggers a representative regression suite. Fail or require review when calibrated quality/safety tolerances or SLOs are breached.
- 5Feedback loop: Low-scoring samples surfaced to human reviewers → added to golden test set → improves future evals.
- CCapacity: derive concurrent streams from QPS × average duration; size connection pools and rate limits against provider quotas, not only CPU.
- SSecurity: per-tenant keys/policies, secret isolation, egress allowlists, audit trail and prompt/response retention controls.
- XCross-question answers: prevent duplicate billable calls with an invocation ledger, idempotency key and explicit handling of ambiguous timeouts. Route fallback only to providers that satisfy the required schema/tool capability contract, then validate output. Test residency with policy-as-code, regional routing tests, blocked cross-region egress, and audit-log evidence.
- 1Control plane: agent definitions, prompt/model/tool versions, policy, evaluations, rollout and audit.
- 2Data plane: isolated runtime executes a bounded state machine; tool gateway authenticates, authorizes, validates arguments and records effects.
- 3Memory: separate immutable event history, short-term working state and curated long-term memory with consent, expiry and deletion.
- 4Safety: risk-tier tools; read-only may auto-run, money/data mutations require deterministic checks and human approval. Never trust model text as authorization.
- 5Operate: trace every reasoning-independent event, tool call, policy decision and approval; cap steps/tokens/time and provide kill switches.
Pipeline
- Upload → malware scan → immutable object store.
- OCR/layout/table extraction → normalized document graph.
- Rules + model extraction → schema validation → confidence routing.
- Human review for low-confidence/high-value fields.
Production Concerns
- Idempotent jobs, page-level retries and lineage to source coordinates.
- Per-field precision/recall, review rate and business-value metrics.
- PII isolation, retention, tenant keys and deletion propagation.
- Versioned extractors and replayable raw documents.
API/data model: POST /documents, GET /jobs/{id}, webhook completion; Document, Page, Element, Extraction, Evidence, ReviewDecision and ModelVersion.
- PPath: WebRTC/media gateway → streaming ASR → dialogue/agent runtime → tools/RAG → streaming TTS. Keep regional session state and a durable event summary.
- LLatency: budget each stage; stream partial transcripts and speech; use VAD and speculative work carefully. Measure time-to-first-audio and interruption-stop latency.
- BBarge-in: cancel TTS and downstream work, advance the turn epoch, and reject late results from the old epoch.
- RReliability: reconnect token, session checkpoint, provider fallback, bounded silence/timeouts and graceful transfer to human.
Cross-question: How do you prevent a partially heard confirmation from triggering a payment? Require explicit deterministic confirmation before high-impact tools.
- Router inputs: task type, complexity, modality, context size, tenant policy, region, safety tier, live health, measured quality, latency and cost.
- Policy: deterministic hard constraints first; learned or rules-based ranking second; fallback must preserve schema/tool capability and safety.
- Cache: key includes normalized intent plus tenant, policy, prompt, model family, knowledge/index version and safety context. Never share across authorization boundaries.
- Evaluate: counterfactual shadow traffic, quality-regret, cache precision, hit rate, cost per accepted task and tail latency.
Key principle: the model proposes a tool call; deterministic systems decide whether and how it executes. Treat tool output as untrusted input before returning it to an agent.
class ModelProvider(Protocol):
async def generate(self, req: GenerationRequest) -> GenerationResult: ...
async def stream(self, req: GenerationRequest) -> AsyncIterator[TokenEvent]: ...
class Gateway:
def __init__(self, router, providers, policies, telemetry): ...
async def generate(self, req):
checked = self.policies.validate(req)
route = self.router.choose(checked)
return await self.providers[route.provider].generate(route.request)Use canonical request/result types and provider adapters. Keep routing, retry, safety, telemetry and provider calls separate. Preserve provider-specific capabilities through an explicit capability contract rather than leaking arbitrary dictionaries everywhere.
Cross-question answers: Record each invocation ID and outcome so an ambiguous timeout is reconciled before retry. Surface streaming failures as typed terminal events while preserving already emitted tokens and trace IDs. Test adapters with provider contract tests, deterministic fake providers, recorded fixtures and a small gated live integration suite.
Document(id, tenant_id, source_uri, content_hash, version)
IngestionJob(id, document_id, version, state, attempt, lease_until)
Chunk(id, document_id, version, ordinal, text_hash, metadata)
RECEIVED -> PARSED -> CHUNKED -> EMBEDDED -> INDEXED -> ACTIVE
\-> FAILED_RETRYABLE | FAILED_FINAL- Use
(tenant_id, source_uri, content_hash)or an explicit idempotency key to deduplicate. - Workers claim jobs with leases; every transition uses compare-and-set. Retries resume from durable artifacts.
- Build a new document version, verify counts/quality, atomically switch active version, then garbage-collect old chunks.
ToolSpec(name, version, input_schema, required_scopes, risk_tier, timeout) Invocation(id, actor, agent, tool, args_hash, status, effect_receipt) execute(invocation): authenticate -> authorize -> schema_validate -> policy_check -> approve_if_required -> run_isolated -> validate_output -> audit
Use Registry/Repository for discovery, Command for invocations, Policy object for authorization, and an idempotency store for mutations. The executor receives short-lived scoped credentials; the LLM never receives a reusable secret.
| Entity | Purpose | Important Fields |
|---|---|---|
| Conversation | User-facing durable container | tenant, participants, policy, created_at |
| Turn/Event | Append-only source of truth | sequence, role/type, content_ref, trace_id, timestamp |
| Session | Runtime/checkpoint state | epoch, state, lease, expires_at |
| Memory | Curated reusable fact | subject, fact, evidence, consent, confidence, expiry |
Use optimistic concurrency on sequence/epoch. Summaries are derived views, never the sole source of truth. Support deletion and retention across events, embeddings, caches and analytics copies.
- Rate limit: hierarchical token buckets for tenant, model/provider and global capacity; account for requests and estimated tokens.
- Retry: only retry transient errors within the request deadline; exponential backoff with jitter; honor provider retry hints.
- Circuit breaker: CLOSED → OPEN after threshold → HALF_OPEN probes → CLOSED on recovery. Scope by provider/region/model capability.
- Correctness: attach invocation IDs and record outcomes so a timeout does not automatically become a duplicate billable/action call.
class Evaluator(Protocol):
name: str
async def score(self, case: EvalCase, output: Output) -> Score: ...
EvalCase(id, input, expected, rubric, tags, dataset_version)
Run(id, system_version, dataset_version, seed, status)
Score(metric, value, rationale, evaluator_version, cost)Support deterministic checks, retrieval metrics, human rubrics and model judges behind one interface. Run asynchronously with bounded concurrency; persist every version and raw artifact. Compare paired cases and confidence intervals, not only averages.
Two-stage lookup: exact canonical key first; semantic ANN lookup second. A candidate is reusable only if similarity passes an evaluated threshold and all hard dimensions match: tenant/ACL, locale, policy, prompt version, model capability, knowledge/index version and tool-state class.
- Store answer, evidence, creation/expiry, safety decision and provenance; encrypt and enforce tenant boundaries.
- Invalidate by version bump or event; use short TTL for volatile facts and never cache sensitive/high-impact actions by default.
- Measure cache precision with sampled replay, not just hit rate. A false hit can be more expensive than a miss.
POST /v1/conversations/{id}/messages
Headers: Idempotency-Key, Last-Event-ID
Response: text/event-stream
events: accepted, retrieval, token, citation, completed | failed
cancel_event set -> stop provider stream -> persist terminal status
client disconnect -> bounded cleanup; do not leak provider connectionsWhat interviewer checks: async iteration and backpressure, cancellation, timeout budgets, idempotency, ordered event persistence, reconnect/resume, dependency injection, validation and tests for disconnect/error races.
Given an integer array and a target, return indices of the two numbers that add up to target. You may not use the same element twice.
Key insight: For each element n, we need target - n. Store each value's index in a hash map. On each step, check if the complement is already in the map.
def two_sum(nums: list[int], target: int) -> list[int]: seen = {} # value → index for i, n in enumerate(nums): diff = target - n if diff in seen: return [seen[diff], i] seen[n] = i return [] # Example: two_sum([2,7,11,15], 9) → [0, 1]
Find the length of the longest sequence of consecutive integers. Must run in O(n) time.
Key insight: For each number, only start counting if it's the beginning of a sequence (n-1 is not in the set). This ensures each number is visited at most twice — O(n) overall.
def longest_consecutive(nums: list[int]) -> int: num_set = set(nums) best = 0 for n in num_set: if n - 1 not in num_set: # sequence start cur, length = n, 1 while cur + 1 in num_set: cur += 1; length += 1 best = max(best, length) return best # [100,4,200,1,3,2] → 4 (sequence 1,2,3,4)
Group strings that are anagrams of each other into sublists.
from collections import defaultdict def group_anagrams(strs: list[str]) -> list[list[str]]: groups = defaultdict(list) for s in strs: key = tuple(sorted(s)) # e.g. 'eat' → ('a','e','t') groups[key].append(s) return list(groups.values())
key = tuple(Counter(s).get(c, 0) for c in 'abcdefghijklmnopqrstuvwxyz')For each index i, compute the product of all elements except nums[i]. O(n) time, no division operator.
Insight: result[i] = prefix_product[i-1] * suffix_product[i+1]. Do two passes: left prefix in one array, multiply by right suffix on the fly.
def product_except_self(nums: list[int]) -> list[int]: n = len(nums) res = [1] * n # Left pass: res[i] = product of nums[0..i-1] prefix = 1 for i in range(n): res[i] = prefix prefix *= nums[i] # Right pass: multiply by product of nums[i+1..n-1] suffix = 1 for i in range(n - 1, -1, -1): res[i] *= suffix suffix *= nums[i] return res
Return the k most frequent elements. O(n) solution using bucket sort.
from collections import Counter def top_k_frequent(nums: list[int], k: int) -> list[int]: freq = Counter(nums) # Bucket index = frequency value buckets = [[] for _ in range(len(nums) + 1)] for val, cnt in freq.items(): buckets[cnt].append(val) res = [] for i in range(len(buckets) - 1, -1, -1): for val in buckets[i]: res.append(val) if len(res) == k: return res
def length_of_longest_substring(s: str) -> int: seen = {} # char → last index left = best = 0 for right, ch in enumerate(s): if ch in seen and seen[ch] >= left: left = seen[ch] + 1 # shrink from left seen[ch] = right best = max(best, right - left + 1) return best # "abcabcbb" → 3 ("abc")
Find the minimum window in string s that contains all characters of string t.
from collections import Counter def min_window(s: str, t: str) -> str: need = Counter(t) window = {} have, total = 0, len(need) best = (float("inf"), 0, 0) left = 0 for right, ch in enumerate(s): window[ch] = window.get(ch, 0) + 1 if ch in need and window[ch] == need[ch]: have += 1 while have == total: if (right - left + 1) < best[0]: best = (right - left + 1, left, right) window[s[left]] -= 1 if s[left] in need and window[s[left]] < need[s[left]]: have -= 1 left += 1 l, r = best[1], best[2] return s[l:r+1] if best[0] != float("inf") else ""
def max_area(height: list[int]) -> int: left, right = 0, len(height) - 1 best = 0 while left < right: area = (right - left) * min(height[left], height[right]) best = max(best, area) # Move the shorter wall inward if height[left] < height[right]: left += 1 else: right -= 1 return best
from collections import deque def level_order(root) -> list[list[int]]: if not root: return [] result, q = [], deque([root]) while q: level = [] for _ in range(len(q)): # snapshot size = current level node = q.popleft() level.append(node.val) if node.left: q.append(node.left) if node.right: q.append(node.right) result.append(level) return result
def num_islands(grid: list[list[str]]) -> int: rows, cols = len(grid), len(grid[0]) count = 0 def dfs(r, c): if r < 0 or c < 0 or r >= rows or c >= cols \ or grid[r][c] != "1": return grid[r][c] = "0" # mark visited in-place for dr, dc in [(1,0),(-1,0),(0,1),(0,-1)]: dfs(r+dr, c+dc) for r in range(rows): for c in range(cols): if grid[r][c] == "1": dfs(r, c); count += 1 return count
Given n courses and prerequisites, determine if you can finish all courses. Equivalent to detecting a cycle in a directed graph.
def can_finish(n: int, prereqs: list[list[int]]) -> bool: adj = [[] for _ in range(n)] for a, b in prereqs: adj[b].append(a) # 0=unvisited, 1=in-stack (cycle), 2=done state = [0] * n def dfs(node): if state[node] == 1: return False # cycle if state[node] == 2: return True # already processed state[node] = 1 for nei in adj[node]: if not dfs(nei): return False state[node] = 2 return True return all(dfs(i) for i in range(n) if state[i] == 0)
def lca(root, p, q): if not root or root == p or root == q: return root left = lca(root.left, p, q) right = lca(root.right, p, q) # If found in both subtrees → current node is LCA if left and right: return root return left or right
def climb_stairs(n: int) -> int: if n <= 2: return n a, b = 1, 2 for _ in range(3, n + 1): a, b = b, a + b # rolling Fibonacci return b
Pattern recognition: Whenever f(n) = f(n-1) + f(n-2) or similar recurrence appears, recognise it as Fibonacci DP. Roll two variables instead of an array for O(1) space.
Directly used in diff algorithms, tokenisation comparison, and DNA sequence alignment.
def lcs(text1: str, text2: str) -> int: m, n = len(text1), len(text2) dp = [[0] * (n + 1) for _ in range(m + 1)] for i in range(1, m + 1): for j in range(1, n + 1): if text1[i-1] == text2[j-1]: dp[i][j] = dp[i-1][j-1] + 1 else: dp[i][j] = max(dp[i-1][j], dp[i][j-1]) return dp[m][n]
def knapsack(weights: list, values: list, W: int) -> int: n = len(weights) dp = [0] * (W + 1) for i in range(n): for w in range(W, weights[i] - 1, -1): # reverse! dp[w] = max(dp[w], dp[w - weights[i]] + values[i]) return dp[W]
Key: Iterate weights in reverse when using 1D DP (ensures each item is used at most once). Forward iteration = unbounded knapsack (items can repeat).
Determine if string s can be segmented using words from a dictionary. Directly related to LLM tokenisation problems.
def word_break(s: str, words: list[str]) -> bool: word_set = set(words) dp = [False] * (len(s) + 1) dp[0] = True # empty string is breakable for i in range(1, len(s) + 1): for j in range(i): if dp[j] and s[j:i] in word_set: dp[i] = True; break return dp[len(s)]
from fastapi import FastAPI, HTTPException from fastapi.responses import StreamingResponse from pydantic import BaseModel import anthropic, asyncio app = FastAPI() client = anthropic.AsyncAnthropic() class ChatRequest(BaseModel): query: str session_id: str = "default" @app.post("/chat") async def chat_stream(req: ChatRequest): async def generate(): try: async with client.messages.stream( model="claude-sonnet-4-6", max_tokens=1024, system="You are a helpful GenAI assistant.", messages=[{"role": "user", "content": req.query}] ) as stream: async for text in stream.text_stream(): yield f"data: {text}\n\n" # SSE format except Exception as e: yield f"data: [ERROR] {str(e)}\n\n" return StreamingResponse(generate(), media_type="text/event-stream") # Batch concurrent calls async def batch_process(queries: list[str]) -> list[str]: tasks = [single_call(q) for q in queries] return await asyncio.gather(*tasks, return_exceptions=True)
import asyncio, functools, random, logging from typing import Type def retry( max_tries: int = 3, base_delay: float = 1.0, exceptions: tuple[Type[Exception], ...] = (Exception,) ): def decorator(func): @functools.wraps(func) async def async_wrapper(*args, **kwargs): for attempt in range(max_tries): try: return await func(*args, **kwargs) except exceptions as e: if attempt == max_tries - 1: raise delay = base_delay * (2 ** attempt) + random.uniform(0, 0.5) logging.warning(f"Attempt {attempt+1} failed: {e}. Retry in {delay:.1f}s") await asyncio.sleep(delay) return async_wrapper return decorator # Usage @retry(max_tries=3, base_delay=1.0) async def call_llm(prompt: str) -> str: ...
import numpy as np, json, redis from openai import OpenAI r = redis.Redis(host="localhost", port=6379) oai = OpenAI() THRESHOLD = 0.92 def embed(text: str) -> np.ndarray: resp = oai.embeddings.create(model="text-embedding-3-small", input=text) return np.array(resp.data[0].embedding) def cosine_sim(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) def cached_llm(query: str, llm_fn) -> str: q_emb = embed(query) # Scan cache for similar queries for key in r.scan_iter("cache:*"): entry = json.loads(r.get(key)) if cosine_sim(q_emb, entry["emb"]) >= THRESHOLD: return entry["response"] # cache hit # Cache miss — call LLM response = llm_fn(query) r.setex(f"cache:{hash(query)}", 3600, json.dumps({"emb": q_emb.tolist(), "response": response})) return response
| Pattern | When to use | Example in GenAI |
|---|---|---|
Sync generator yield | CPU-bound iteration, large dataset processing | Streaming chunked document ingestion from S3 |
Async generator async yield | Awaiting I/O in each iteration | Streaming LLM tokens to FastAPI response |
asyncio.Queue | Multiple producers/consumers | Agent observation queue (multiple tools running) |
async def token_stream(prompt: str): """Async generator: yields tokens as they arrive""" async with client.messages.stream( model="claude-sonnet-4-6", max_tokens=512, messages=[{"role": "user", "content": prompt}] ) as stream: async for token in stream.text_stream(): yield token # caller sees each token immediately # Consumer async def main(): async for token in token_stream("Explain RAG"): print(token, end="", flush=True)
At PwC, a US healthcare client's multi-agent clinical query system had P95 latency of 3.8 seconds — too slow for real-time clinical use. The system was seeing 200+ QPS during peak hospital hours.
I owned the end-to-end latency reduction initiative — profiling, identifying bottlenecks, proposing and implementing solutions within a 3-week sprint.
- 1Profiled with LangSmith traces — found 60% of latency was redundant sequential sub-agent calls that could run in parallel.
- 2Refactored to async fan-out using
asyncio.gather— all sub-agents dispatch concurrently, only the merge step waits. - 3Added Redis semantic cache (GPTCache library) with cosine threshold 0.92 — served ~40% of traffic from cache at <10ms.
- 4Routed intent classification to an evaluated low-latency model tier instead of the stronger reasoning tier — 3× cheaper and 5× faster for that subtask.
70% latency reduction — P95 from 3.8s → 1.1s. The system went live and the client reported a 60% efficiency uplift for clinical staff. I received PwC's Delivery Head Award for this sprint.
At Cognizant's GenAI Lab, I was asked simply to "explore if LLMs can improve our client's enterprise search." No timeline, no team, no defined success metric.
Define the problem, build a prototype, benchmark it, and present a path to production — all independently.
- 1Interviewed 3 internal stakeholders to learn the actual pain: keyword search missed semantic intent ("show me Q3 revenue analysis" found nothing).
- 2Ran a 2-week spike: built a Pinecone-backed semantic search with fine-tuned sentence-transformers embeddings. Collected 200 test queries with relevance labels.
- 3Benchmarked Recall@5 vs existing keyword search: 35% improvement. Presented the spike results with a 3-phase production roadmap.
Prototype became the foundation for a production semantic search shipped to the client — 35% accuracy improvement. Received Cognizant's Rising Star Award.
Early in a RAG project at PwC, I used fixed-size 512-token chunks with no overlap because it was the fastest to implement. The system went to QA.
QA caught that multi-page financial tables were being split across chunks — the retriever would get half a table, giving wrong answers. I spent 2 days re-chunking with a hierarchical splitter and adding 50-token overlap. If I'd spent 4 hours upfront analysing the document corpus structure, I would have picked the right chunking strategy immediately and avoided the rework.
What I do now: Always spend the first day profiling the document corpus (distribution of doc types, average length, presence of tables/headers) before writing a single line of chunking code.
Shows self-awareness without being self-deprecating. The mistake was reasonable and the lesson is concrete and technical — not vague.
At EPAM, my team spans India (IST), Poland (CEST, +3.5hrs from IST), and the US East coast (EST, −10.5hrs from IST). We have critical weekly demos with a US Finance client.
- 1Async-first documentation: Every design decision gets a Confluence ADR (Architecture Decision Record) written the same day. PRs have detailed descriptions so teammates in other zones can review without a live sync call.
- 2Slack status tagging: I tag every update with
[DONE],[BLOCKED: needs X], or[DECISION NEEDED: by EOD]. This tells teammates exactly what action, if any, they need to take. - 3Flexible overlap windows: 2 days/week I flex my start time to 6:30 AM IST to get 2 hours overlap with US EST morning for the client demo prep.
- 4Loom walkthroughs: For complex architecture changes, I record a 5-min Loom instead of a document. Faster to create, easier to consume across time zones.
Zero missed client demo milestones in 10 months. The Delivery Head Award (2025) cited "consistent, reliable cross-timezone delivery" as a specific reason.
"I'm Purnendu Das, a Senior GenAI Developer with 5 years building production AI systems — not demos, but real systems that clients use every day.
My core expertise is three things: first, RAG architectures that work at scale — hybrid retrieval, re-ranking, evaluation pipelines. Second, multi-agent systems using LangGraph for complex, stateful workflows. Third, LLMOps on AWS — deploying with Bedrock, Lambda, and OpenSearch, with proper monitoring and cost control.
At PwC I built a multi-agent platform for a healthcare client that reduced response latency by 70% and delivered a 60% efficiency improvement for clinical staff. At EPAM right now I'm architecting a Hybrid RAG system for a Tier-1 US Finance client, and we've achieved sub-second inference latency. Before that at Cognizant I founded and ran our GenAI research lab, shipping a semantic search product with a 35% accuracy improvement.
I work fully remote and have delivered consistently across India, Poland, and US time zones. I'm looking for a senior role where I can keep solving enterprise-scale AI problems with a high-quality engineering team."