Enterprise LLM Platforms — Azure AI Foundry, Amazon Bedrock, OpenAI¶

The three platforms that show up in most large-scale GenAI projects: Azure AI Foundry, Amazon Bedrock, and the OpenAI Platform. Each takes a different shape — model marketplace, managed-services hub, or first-party API — and the testing implications differ accordingly.

Quick Comparison¶

Dimension	Azure AI Foundry	Amazon Bedrock	OpenAI Platform
Owner	Microsoft	AWS	OpenAI
Positioning	End-to-end agent / app dev platform	Managed model + agent serving on AWS	First-party API to OpenAI's frontier models
Model catalogue	1,800+ models (OpenAI, Mistral, Meta, Cohere, NVIDIA, DeepSeek, open-source via models-as-a-service)	Anthropic Claude, Meta Llama, Mistral, AI21, Cohere, Stability, Amazon Titan, Amazon Nova	OpenAI GPT, o-series reasoning, DALL·E, embeddings, audio (Whisper, TTS)
Managed RAG	Azure AI Search integration; "Prompt Flow" / "Foundry Agents"	Bedrock Knowledge Bases (S3 / Confluence / Salesforce → managed vector store)	Built into Assistants API ("File Search"); responses API has hosted retrieval
Managed agents	Foundry Agent Service — orchestration, tool calling, state, MCP-compatible	Bedrock AgentCore + Bedrock Agents (Action Groups via Lambda)	Assistants API + Responses API; Agents SDK (open source)
Guardrails / safety	Azure AI Content Safety (Prompt Shields, Groundedness, Protected Material)	Bedrock Guardrails (denied topics, content filters, PII, prompt-attack, contextual grounding)	Moderation API; built-in policy enforcement; safety-trained models
Evaluation tooling	Azure AI Foundry Evaluators + AI Red Teaming Agent (PyRIT-powered)	Bedrock Model Evaluation (automatic + human); Bedrock Guardrail evaluation	Evals (open-source framework); built-in fine-tune eval; OpenAI dashboards
Observability	Foundry tracing (OpenTelemetry); App Insights integration	CloudWatch + Bedrock model invocation logs; Bedrock Studio traces	Dashboard, usage logs, run-step traces in Assistants/Responses
MCP support	Native — Foundry Agents speak MCP; Microsoft pushed MCP across Copilot Studio + AKS	Bedrock AgentCore supports MCP tool integration	MCP-compatible via Responses API + Agents SDK
Pricing model	Per-model token pricing + platform usage	Per-model token pricing; on-demand or provisioned throughput	Per-model token pricing; pricing tiers (standard / batch / fine-tune)
Best for	Microsoft-anchored enterprises; Office/Teams/Copilot extension	AWS-anchored enterprises; regulated workloads with strict data residency	Cutting-edge model access; OpenAI-specific features (o-series reasoning, GPT image, real-time API)
Watch-outs	Sprawl — many overlapping services as the platform evolves rapidly	Per-model regional availability; Knowledge-Base vector-store options vary	Single-vendor dependency; rate-limit ceilings at scale; less data-residency control

1. Azure AI Foundry¶

Microsoft's unified AI app + agent development platform. Combines what used to be Azure OpenAI Service, Azure AI Studio, and Azure AI Services into one portal and SDK surface.

What it is¶

Azure AI Foundry is positioned as the Microsoft platform for building, deploying, and operating production GenAI applications and agents. It bundles:

Model catalogue — 1,800+ models including OpenAI (exclusive Azure-hosted access to GPT-4o, GPT-4.1, o-series), Anthropic, Meta Llama, Mistral, Cohere, NVIDIA Nemotron, DeepSeek, plus open-source via "Models-as-a-Service"
Foundry Agent Service — managed agent runtime with tool calling, planning, state, and MCP-native tool integration
Foundry Local — run Foundry workloads on edge / on-prem
Prompt Flow — visual prompt/chain authoring with versioning
AI Red Teaming Agent — automated adversarial testing built on Microsoft's PyRIT (Public Preview as of 2025)
Content Safety — Prompt Shields (direct + indirect injection detection), Groundedness checks for RAG, Protected-Material detection (copyrighted content)
Tracing & evaluators — OpenTelemetry-based tracing; built-in evaluators for groundedness, relevance, coherence, fluency, similarity, F1, BLEU, ROUGE; custom evaluators

Architecture flow¶

┌────────────────────────────────────┐ │ Foundry Portal │ │ (model catalogue, hub, projects) │ └────────────────┬───────────────────┘ │ ┌─────────────────────────┼────────────────────────┐ ▼ ▼ ▼ Prompt Flow Foundry Agents Evaluators + (chains/prompts) (orchestration, MCP) Red Team Agent │ │ │ └─────────────┬───────────┴────────────────────────┘ ▼ Azure OpenAI + Models-as-a-Service │ ▼ Azure AI Search · Content Safety · App Insights

Testing implications (QE view)¶

Layer	What to Test
Model selection	Per-model behaviour can differ; eval suite should re-run on every model swap
Prompt Flow	Version control on flows; regression tests via Foundry Evaluators in CI
Foundry Agents	Trace-level tool-call assertions; MCP tool conformance; multi-step task completion
Content Safety	Configure Prompt Shields then test their bypass — multilingual, encoded, indirect injection through retrieved content
Groundedness	Validate RAG outputs cite retrieved context; faithfulness threshold per release
Red Teaming Agent	Run before each significant release; treat findings as regression tests

Common project patterns¶

Copilot extensions — Word/Teams/Outlook plugins backed by Foundry Agents
Internal Q&A assistants — Foundry + Azure AI Search over SharePoint/Confluence
Customer-facing chat — usually fronted by content-safety policies + groundedness checks
Regulated workloads — Foundry's data-residency, encryption, and audit logging support the EU AI Act / GDPR posture

Where it shines / where it stings¶

Shines: Microsoft 365 / Teams / Copilot integration; first-party access to frontier OpenAI models with enterprise SLA; strong identity (Entra ID) + governance story
Stings: Surface area sprawls quickly; product names change frequently (Azure OpenAI → AI Studio → AI Foundry); region-by-region feature parity gaps

2. Amazon Bedrock¶

AWS's managed service for foundation-model access and agentic AI on AWS infrastructure. Bedrock is the abstraction; under it sit specific model providers and a stack of managed primitives (Guardrails, Agents, Knowledge Bases).

What it is¶

A unified API to call a curated catalogue of foundation models — without managing infrastructure — plus a set of higher-level services:

Foundation Models — Claude (Anthropic), Llama (Meta), Mistral, Jurassic (AI21), Command (Cohere), Stable Diffusion (Stability), Titan (Amazon), Nova (Amazon's 2025 multimodal family)
Bedrock Guardrails — configurable policy layer: denied topics, content filters (hate/insults/sexual/violence/misconduct), word filters, PII filters, prompt-attack filter, contextual grounding for RAG
Bedrock Knowledge Bases — managed RAG over S3 / Confluence / Salesforce / SharePoint / web crawlers; vector store options (OpenSearch Serverless, Aurora PostgreSQL, Pinecone, Redis Enterprise)
Bedrock Agents — managed agentic orchestration with Action Groups (Lambda-backed tools), Knowledge-Base attachment, session state
Bedrock AgentCore (2025) — modular runtime for agent orchestration, memory, code interpreter, browser tool, observability, identity, gateway — MCP-compatible
Bedrock Model Evaluation — automatic and human-in-the-loop eval jobs for accuracy, robustness, toxicity
Bedrock Studio / Flows — visual builder for prompt chains and workflows
Provisioned Throughput — reserved capacity for latency / cost predictability

Architecture flow¶

┌─────────────────────────┐ │ Bedrock API │ │ (unified model access) │ └────────────┬─────────────┘ │ ┌────────────────────────────┼──────────────────────────┐ ▼ ▼ ▼ Foundation Models Bedrock Agents Knowledge Bases (Claude / Llama / + AgentCore (managed RAG) Mistral / Titan / │ │ Nova / etc.) Action Groups Vector store │ (Lambda tools) (OpenSearch / └─────────────────────┬─────────────────────────── Aurora / Pinecone) ▼ Bedrock Guardrails (content + safety + grounding) │ ▼ CloudWatch · Bedrock Studio traces

Testing implications (QE view)¶

Layer	What to Test
Guardrails — configuration	Each rule triggers on inputs it should, and not on inputs it shouldn't (false-positive control)
Guardrails — bypass	Multilingual, encoded payloads (Base64, ROT13), role-play, paraphrase, hypothetical framing, indirect injection through Knowledge-Base content
Action Groups	Lambda input/output schema; auth boundaries; idempotency; error handling
Knowledge Bases	Ranking quality on golden queries; indirect injection via poisoned documents (plant a malicious instruction in a doc and verify the agent refuses to execute it)
End-to-end Agent	Trace-level — which tools called, in what order, with what arguments; latency and cost budgets per workflow
AgentCore	Memory persistence; gateway policy enforcement; observability emission

Common project patterns¶

Regulated finance / healthcare workloads — data stays in AWS regions; Guardrails enforce policy; auditable via CloudWatch
Internal knowledge assistants — Bedrock Agents + Knowledge Bases over S3-backed corpora
Multi-tenant SaaS — per-tenant Guardrails + isolated Knowledge Bases
Document-processing pipelines — Claude / Nova for extraction + Action Groups for downstream tooling

Where it shines / where it stings¶

Shines: AWS-native data plane (no cross-cloud egress); Claude access for enterprise customers; strong IAM + KMS + audit story; regional data residency
Stings: Region-by-region model availability varies; Knowledge-Base vector-store options have different cost/scale profiles; Bedrock pricing has many dimensions (input/output tokens, KB queries, agent invocations, guardrail policies)

Bedrock-specific testing references¶

See Randstad — Job Analysis (Randstad-AI-RedTeam-Lead/02-job-analysis.md) for deeper Bedrock-as-system-under-test framing — that role is built around this stack.

3. OpenAI Platform (Enterprise Scale)¶

Direct access to OpenAI's frontier models via OpenAI's own API and platform — no cloud-provider abstraction layer. Popular for projects that need the latest model features, fastest, or that prefer a single-vendor relationship.

What it is¶

OpenAI's hosted platform combining model access, an agent stack, and operational tooling:

Models — GPT-5 / GPT-4.1 / GPT-4o family for general LLM use; o-series (o3, o4-mini) for reasoning; embeddings (text-embedding-3); image gen (GPT image / DALL·E 3); audio (Whisper, GPT-4o-audio, TTS)
Responses API (2025) — successor to Chat Completions + Assistants; one-shot API supporting hosted tools, structured outputs, multi-turn state
Assistants API — managed agent surface with hosted tools (File Search, Code Interpreter)
Agents SDK — Python framework for building multi-agent systems; pairs with the API
Evals — open-source evaluation framework; private + shared eval boards
Fine-tuning — supervised + preference (DPO) fine-tuning for most models
Batch API — 50%-cost asynchronous batch processing for non-interactive workloads
OpenAI Realtime API — bidirectional speech-to-speech with low latency
Enterprise features — SAML SSO, audit logs, data residency (EU + US), private endpoints, custom DPA
Moderation API — free content classification across harm categories

Architecture flow¶

┌──────────────────────────────────┐ │ OpenAI API surface │ │ (Responses · Assistants · Chat) │ └─────────────────┬────────────────┘ │ ┌────────────────────────┼─────────────────────────┐ ▼ ▼ ▼ Models Hosted Tools Evals + GPT-5 / o-series / File Search / Moderation API embed / image / Code Interpreter / audio Web Search (preview) │ ▼ Agents SDK (Python) · Batch API │ ▼ Enterprise: SSO · Logs · DPA

Testing implications (QE view)¶

Layer	What to Test
Model selection	Behaviour differs sharply between model families (GPT-4o vs o-series reasoning); rerun eval suites on every model swap
Responses / Assistants API	Multi-turn state persistence; hosted-tool invocation correctness; structured-output schema conformance
File Search (built-in RAG)	Retrieval quality on golden queries; ranking; citation correctness
Code Interpreter	Sandbox isolation; resource limits; output sanitisation
Moderation API	Coverage across harm categories; false-positive control on benign inputs
Rate limits	Tier-based; tests must handle 429s and exponential backoff; production resilience tests
Cost surfaces	Token costs vary 10× between Mini and Pro tiers; budget assertions per workflow

Common project patterns¶

Cutting-edge prototypes — first-mover access to new models (o-series reasoning, GPT image, Realtime API)
Single-vendor SaaS startups — fastest path to production without cloud-platform overhead
Specialised use cases — Whisper for transcription, Realtime API for voice agents, image gen
Enterprises blending OpenAI direct + Azure OpenAI — Azure for production governance, OpenAI direct for evaluation of new features pre-Azure parity

Where it shines / where it stings¶

Shines: Day-zero access to OpenAI frontier models; cleanest API ergonomics; fastest iteration on new features; strong dev tooling (playground, dashboard, eval framework)
Stings: Single-vendor dependency (no model diversity at the platform layer); per-tier rate-limit ceilings can bite at scale; less data-residency control than Azure/Bedrock; pricing changes faster than enterprise procurement cycles

Picking Between Them (Project Sizing Heuristics)¶

Pick Azure AI Foundry when…¶

The org is Microsoft-anchored (Entra ID, Office 365, Teams, Copilot)
You need first-party OpenAI access with enterprise governance
You want one platform spanning models + agents + safety + eval
Regulated EU workloads benefiting from Microsoft's data-residency commitments

Pick Amazon Bedrock when…¶

The org is AWS-anchored (data already in S3, identity in IAM)
You want Claude at enterprise scale
Data residency or air-gap requirements are tight
The agentic workload benefits from Bedrock Knowledge Bases + Lambda Action Groups

Pick OpenAI Platform when…¶

You need access to the latest frontier model the day it ships
Single-vendor dependency is acceptable
You're prototyping fast or building a SaaS where time-to-market dominates
Specialised features matter: o-series reasoning, Realtime API for voice, GPT image, advanced fine-tuning

Pick multiple when…¶

Critical workloads require model diversity for resilience or evaluation
Different teams have different cloud allegiances and need parallel tracks
You want to A/B test the same task across model families
Regulatory / commercial reasons mandate provider-redundancy (financial services often want at least two)

Cross-Platform Testing Strategy¶

For QE work that spans more than one platform:

Provider-neutral eval harness — write tests against a thin abstraction so the same prompt/scenario runs on any of the three; libraries like LiteLLM normalise the API shape
Versioned model identifiers in every result — never log "Claude" or "GPT" without the exact model + version
Per-provider safety configuration — Guardrails (Bedrock) vs Content Safety (Azure) vs Moderation (OpenAI) have different shapes; test each native config plus a unified policy layer above
Latency / cost normalisation — report token-equivalent cost and p50/p95 latency on the same chart across providers
Drift across releases — model updates ship continuously; rerun the full eval suite on a schedule, not just on code changes

Cross-References¶

MCP fundamentals → mcp-servers-faq.md
MCP testing process → mcp-testing-roadmap.md
RAG / Agents / Agentic RAG context → rag-vs-agents-vs-agentic-rag.md
RAG eval metric layer → ragas-faq.md
Pytest-style AI test framework → deepeval-faq.md
Red-team toolchain by platform → commercial-llm-mcp-testing-tools.md
Red-team theory → red-blue-purple-team-ai-faq.md

Interview Sound-Bites¶

"The platform choice usually reflects the cloud strategy, not the AI strategy — Azure-shop, AWS-shop, or vendor-direct. The interesting QE question is the bit that's identical across all three: provider-neutral eval coverage, versioned model identifiers, and trace-level assertions on agent behaviour."
"Each platform has its own guardrails primitive — Bedrock Guardrails, Azure Content Safety, OpenAI Moderation. They're not interchangeable; they all need their own configuration coverage and their own bypass-testing corpus."
"For multi-vendor projects I'd standardise on a LiteLLM-style abstraction at the runtime layer and a unified evaluation harness on top. The platform-specific testing — Knowledge Base injection, Foundry Red Team agent runs, OpenAI rate-limit resilience — lives in dedicated suites."
"OpenAI direct is the fastest way to evaluate a new frontier model; Azure or Bedrock is the right place to operate it. Most mature enterprises end up with both — evaluation track and production track."