Skip to content

Enterprise LLM Platforms — Azure AI Foundry, Amazon Bedrock, OpenAI

The three platforms that show up in most large-scale GenAI projects: Azure AI Foundry, Amazon Bedrock, and the OpenAI Platform. Each takes a different shape — model marketplace, managed-services hub, or first-party API — and the testing implications differ accordingly.


Quick Comparison

Dimension Azure AI Foundry Amazon Bedrock OpenAI Platform
Owner Microsoft AWS OpenAI
Positioning End-to-end agent / app dev platform Managed model + agent serving on AWS First-party API to OpenAI's frontier models
Model catalogue 1,800+ models (OpenAI, Mistral, Meta, Cohere, NVIDIA, DeepSeek, open-source via models-as-a-service) Anthropic Claude, Meta Llama, Mistral, AI21, Cohere, Stability, Amazon Titan, Amazon Nova OpenAI GPT, o-series reasoning, DALL·E, embeddings, audio (Whisper, TTS)
Managed RAG Azure AI Search integration; "Prompt Flow" / "Foundry Agents" Bedrock Knowledge Bases (S3 / Confluence / Salesforce → managed vector store) Built into Assistants API ("File Search"); responses API has hosted retrieval
Managed agents Foundry Agent Service — orchestration, tool calling, state, MCP-compatible Bedrock AgentCore + Bedrock Agents (Action Groups via Lambda) Assistants API + Responses API; Agents SDK (open source)
Guardrails / safety Azure AI Content Safety (Prompt Shields, Groundedness, Protected Material) Bedrock Guardrails (denied topics, content filters, PII, prompt-attack, contextual grounding) Moderation API; built-in policy enforcement; safety-trained models
Evaluation tooling Azure AI Foundry Evaluators + AI Red Teaming Agent (PyRIT-powered) Bedrock Model Evaluation (automatic + human); Bedrock Guardrail evaluation Evals (open-source framework); built-in fine-tune eval; OpenAI dashboards
Observability Foundry tracing (OpenTelemetry); App Insights integration CloudWatch + Bedrock model invocation logs; Bedrock Studio traces Dashboard, usage logs, run-step traces in Assistants/Responses
MCP support Native — Foundry Agents speak MCP; Microsoft pushed MCP across Copilot Studio + AKS Bedrock AgentCore supports MCP tool integration MCP-compatible via Responses API + Agents SDK
Pricing model Per-model token pricing + platform usage Per-model token pricing; on-demand or provisioned throughput Per-model token pricing; pricing tiers (standard / batch / fine-tune)
Best for Microsoft-anchored enterprises; Office/Teams/Copilot extension AWS-anchored enterprises; regulated workloads with strict data residency Cutting-edge model access; OpenAI-specific features (o-series reasoning, GPT image, real-time API)
Watch-outs Sprawl — many overlapping services as the platform evolves rapidly Per-model regional availability; Knowledge-Base vector-store options vary Single-vendor dependency; rate-limit ceilings at scale; less data-residency control

1. Azure AI Foundry

Microsoft's unified AI app + agent development platform. Combines what used to be Azure OpenAI Service, Azure AI Studio, and Azure AI Services into one portal and SDK surface.

What it is

Azure AI Foundry is positioned as the Microsoft platform for building, deploying, and operating production GenAI applications and agents. It bundles:

  • Model catalogue — 1,800+ models including OpenAI (exclusive Azure-hosted access to GPT-4o, GPT-4.1, o-series), Anthropic, Meta Llama, Mistral, Cohere, NVIDIA Nemotron, DeepSeek, plus open-source via "Models-as-a-Service"
  • Foundry Agent Service — managed agent runtime with tool calling, planning, state, and MCP-native tool integration
  • Foundry Local — run Foundry workloads on edge / on-prem
  • Prompt Flow — visual prompt/chain authoring with versioning
  • AI Red Teaming Agent — automated adversarial testing built on Microsoft's PyRIT (Public Preview as of 2025)
  • Content Safety — Prompt Shields (direct + indirect injection detection), Groundedness checks for RAG, Protected-Material detection (copyrighted content)
  • Tracing & evaluators — OpenTelemetry-based tracing; built-in evaluators for groundedness, relevance, coherence, fluency, similarity, F1, BLEU, ROUGE; custom evaluators

Architecture flow

┌────────────────────────────────────┐ │ Foundry Portal │ │ (model catalogue, hub, projects) │ └────────────────┬───────────────────┘ │ ┌─────────────────────────┼────────────────────────┐ ▼ ▼ ▼ Prompt Flow Foundry Agents Evaluators + (chains/prompts) (orchestration, MCP) Red Team Agent │ │ │ └─────────────┬───────────┴────────────────────────┘ ▼ Azure OpenAI + Models-as-a-Service │ ▼ Azure AI Search · Content Safety · App Insights

Testing implications (QE view)

Layer What to Test
Model selection Per-model behaviour can differ; eval suite should re-run on every model swap
Prompt Flow Version control on flows; regression tests via Foundry Evaluators in CI
Foundry Agents Trace-level tool-call assertions; MCP tool conformance; multi-step task completion
Content Safety Configure Prompt Shields then test their bypass — multilingual, encoded, indirect injection through retrieved content
Groundedness Validate RAG outputs cite retrieved context; faithfulness threshold per release
Red Teaming Agent Run before each significant release; treat findings as regression tests

Common project patterns

  • Copilot extensions — Word/Teams/Outlook plugins backed by Foundry Agents
  • Internal Q&A assistants — Foundry + Azure AI Search over SharePoint/Confluence
  • Customer-facing chat — usually fronted by content-safety policies + groundedness checks
  • Regulated workloads — Foundry's data-residency, encryption, and audit logging support the EU AI Act / GDPR posture

Where it shines / where it stings

  • Shines: Microsoft 365 / Teams / Copilot integration; first-party access to frontier OpenAI models with enterprise SLA; strong identity (Entra ID) + governance story
  • Stings: Surface area sprawls quickly; product names change frequently (Azure OpenAI → AI Studio → AI Foundry); region-by-region feature parity gaps

2. Amazon Bedrock

AWS's managed service for foundation-model access and agentic AI on AWS infrastructure. Bedrock is the abstraction; under it sit specific model providers and a stack of managed primitives (Guardrails, Agents, Knowledge Bases).

What it is

A unified API to call a curated catalogue of foundation models — without managing infrastructure — plus a set of higher-level services:

  • Foundation Models — Claude (Anthropic), Llama (Meta), Mistral, Jurassic (AI21), Command (Cohere), Stable Diffusion (Stability), Titan (Amazon), Nova (Amazon's 2025 multimodal family)
  • Bedrock Guardrails — configurable policy layer: denied topics, content filters (hate/insults/sexual/violence/misconduct), word filters, PII filters, prompt-attack filter, contextual grounding for RAG
  • Bedrock Knowledge Bases — managed RAG over S3 / Confluence / Salesforce / SharePoint / web crawlers; vector store options (OpenSearch Serverless, Aurora PostgreSQL, Pinecone, Redis Enterprise)
  • Bedrock Agents — managed agentic orchestration with Action Groups (Lambda-backed tools), Knowledge-Base attachment, session state
  • Bedrock AgentCore (2025) — modular runtime for agent orchestration, memory, code interpreter, browser tool, observability, identity, gateway — MCP-compatible
  • Bedrock Model Evaluation — automatic and human-in-the-loop eval jobs for accuracy, robustness, toxicity
  • Bedrock Studio / Flows — visual builder for prompt chains and workflows
  • Provisioned Throughput — reserved capacity for latency / cost predictability

Architecture flow

┌─────────────────────────┐ │ Bedrock API │ │ (unified model access) │ └────────────┬─────────────┘ │ ┌────────────────────────────┼──────────────────────────┐ ▼ ▼ ▼ Foundation Models Bedrock Agents Knowledge Bases (Claude / Llama / + AgentCore (managed RAG) Mistral / Titan / │ │ Nova / etc.) Action Groups Vector store │ (Lambda tools) (OpenSearch / └─────────────────────┬─────────────────────────── Aurora / Pinecone) ▼ Bedrock Guardrails (content + safety + grounding) │ ▼ CloudWatch · Bedrock Studio traces

Testing implications (QE view)

Layer What to Test
Guardrails — configuration Each rule triggers on inputs it should, and not on inputs it shouldn't (false-positive control)
Guardrails — bypass Multilingual, encoded payloads (Base64, ROT13), role-play, paraphrase, hypothetical framing, indirect injection through Knowledge-Base content
Action Groups Lambda input/output schema; auth boundaries; idempotency; error handling
Knowledge Bases Ranking quality on golden queries; indirect injection via poisoned documents (plant a malicious instruction in a doc and verify the agent refuses to execute it)
End-to-end Agent Trace-level — which tools called, in what order, with what arguments; latency and cost budgets per workflow
AgentCore Memory persistence; gateway policy enforcement; observability emission

Common project patterns

  • Regulated finance / healthcare workloads — data stays in AWS regions; Guardrails enforce policy; auditable via CloudWatch
  • Internal knowledge assistants — Bedrock Agents + Knowledge Bases over S3-backed corpora
  • Multi-tenant SaaS — per-tenant Guardrails + isolated Knowledge Bases
  • Document-processing pipelines — Claude / Nova for extraction + Action Groups for downstream tooling

Where it shines / where it stings

  • Shines: AWS-native data plane (no cross-cloud egress); Claude access for enterprise customers; strong IAM + KMS + audit story; regional data residency
  • Stings: Region-by-region model availability varies; Knowledge-Base vector-store options have different cost/scale profiles; Bedrock pricing has many dimensions (input/output tokens, KB queries, agent invocations, guardrail policies)

Bedrock-specific testing references

  • See Randstad — Job Analysis (Randstad-AI-RedTeam-Lead/02-job-analysis.md) for deeper Bedrock-as-system-under-test framing — that role is built around this stack.

3. OpenAI Platform (Enterprise Scale)

Direct access to OpenAI's frontier models via OpenAI's own API and platform — no cloud-provider abstraction layer. Popular for projects that need the latest model features, fastest, or that prefer a single-vendor relationship.

What it is

OpenAI's hosted platform combining model access, an agent stack, and operational tooling:

  • Models — GPT-5 / GPT-4.1 / GPT-4o family for general LLM use; o-series (o3, o4-mini) for reasoning; embeddings (text-embedding-3); image gen (GPT image / DALL·E 3); audio (Whisper, GPT-4o-audio, TTS)
  • Responses API (2025) — successor to Chat Completions + Assistants; one-shot API supporting hosted tools, structured outputs, multi-turn state
  • Assistants API — managed agent surface with hosted tools (File Search, Code Interpreter)
  • Agents SDK — Python framework for building multi-agent systems; pairs with the API
  • Evals — open-source evaluation framework; private + shared eval boards
  • Fine-tuning — supervised + preference (DPO) fine-tuning for most models
  • Batch API — 50%-cost asynchronous batch processing for non-interactive workloads
  • OpenAI Realtime API — bidirectional speech-to-speech with low latency
  • Enterprise features — SAML SSO, audit logs, data residency (EU + US), private endpoints, custom DPA
  • Moderation API — free content classification across harm categories

Architecture flow

┌──────────────────────────────────┐ │ OpenAI API surface │ │ (Responses · Assistants · Chat) │ └─────────────────┬────────────────┘ │ ┌────────────────────────┼─────────────────────────┐ ▼ ▼ ▼ Models Hosted Tools Evals + GPT-5 / o-series / File Search / Moderation API embed / image / Code Interpreter / audio Web Search (preview) │ ▼ Agents SDK (Python) · Batch API │ ▼ Enterprise: SSO · Logs · DPA

Testing implications (QE view)

Layer What to Test
Model selection Behaviour differs sharply between model families (GPT-4o vs o-series reasoning); rerun eval suites on every model swap
Responses / Assistants API Multi-turn state persistence; hosted-tool invocation correctness; structured-output schema conformance
File Search (built-in RAG) Retrieval quality on golden queries; ranking; citation correctness
Code Interpreter Sandbox isolation; resource limits; output sanitisation
Moderation API Coverage across harm categories; false-positive control on benign inputs
Rate limits Tier-based; tests must handle 429s and exponential backoff; production resilience tests
Cost surfaces Token costs vary 10× between Mini and Pro tiers; budget assertions per workflow

Common project patterns

  • Cutting-edge prototypes — first-mover access to new models (o-series reasoning, GPT image, Realtime API)
  • Single-vendor SaaS startups — fastest path to production without cloud-platform overhead
  • Specialised use cases — Whisper for transcription, Realtime API for voice agents, image gen
  • Enterprises blending OpenAI direct + Azure OpenAI — Azure for production governance, OpenAI direct for evaluation of new features pre-Azure parity

Where it shines / where it stings

  • Shines: Day-zero access to OpenAI frontier models; cleanest API ergonomics; fastest iteration on new features; strong dev tooling (playground, dashboard, eval framework)
  • Stings: Single-vendor dependency (no model diversity at the platform layer); per-tier rate-limit ceilings can bite at scale; less data-residency control than Azure/Bedrock; pricing changes faster than enterprise procurement cycles

Picking Between Them (Project Sizing Heuristics)

Pick Azure AI Foundry when…

  • The org is Microsoft-anchored (Entra ID, Office 365, Teams, Copilot)
  • You need first-party OpenAI access with enterprise governance
  • You want one platform spanning models + agents + safety + eval
  • Regulated EU workloads benefiting from Microsoft's data-residency commitments

Pick Amazon Bedrock when…

  • The org is AWS-anchored (data already in S3, identity in IAM)
  • You want Claude at enterprise scale
  • Data residency or air-gap requirements are tight
  • The agentic workload benefits from Bedrock Knowledge Bases + Lambda Action Groups

Pick OpenAI Platform when…

  • You need access to the latest frontier model the day it ships
  • Single-vendor dependency is acceptable
  • You're prototyping fast or building a SaaS where time-to-market dominates
  • Specialised features matter: o-series reasoning, Realtime API for voice, GPT image, advanced fine-tuning

Pick multiple when…

  • Critical workloads require model diversity for resilience or evaluation
  • Different teams have different cloud allegiances and need parallel tracks
  • You want to A/B test the same task across model families
  • Regulatory / commercial reasons mandate provider-redundancy (financial services often want at least two)

Cross-Platform Testing Strategy

For QE work that spans more than one platform:

  1. Provider-neutral eval harness — write tests against a thin abstraction so the same prompt/scenario runs on any of the three; libraries like LiteLLM normalise the API shape
  2. Versioned model identifiers in every result — never log "Claude" or "GPT" without the exact model + version
  3. Per-provider safety configuration — Guardrails (Bedrock) vs Content Safety (Azure) vs Moderation (OpenAI) have different shapes; test each native config plus a unified policy layer above
  4. Latency / cost normalisation — report token-equivalent cost and p50/p95 latency on the same chart across providers
  5. Drift across releases — model updates ship continuously; rerun the full eval suite on a schedule, not just on code changes

Cross-References

  • MCP fundamentalsmcp-servers-faq.md
  • MCP testing processmcp-testing-roadmap.md
  • RAG / Agents / Agentic RAG contextrag-vs-agents-vs-agentic-rag.md
  • RAG eval metric layerragas-faq.md
  • Pytest-style AI test frameworkdeepeval-faq.md
  • Red-team toolchain by platformcommercial-llm-mcp-testing-tools.md
  • Red-team theoryred-blue-purple-team-ai-faq.md

Interview Sound-Bites

  • "The platform choice usually reflects the cloud strategy, not the AI strategy — Azure-shop, AWS-shop, or vendor-direct. The interesting QE question is the bit that's identical across all three: provider-neutral eval coverage, versioned model identifiers, and trace-level assertions on agent behaviour."
  • "Each platform has its own guardrails primitive — Bedrock Guardrails, Azure Content Safety, OpenAI Moderation. They're not interchangeable; they all need their own configuration coverage and their own bypass-testing corpus."
  • "For multi-vendor projects I'd standardise on a LiteLLM-style abstraction at the runtime layer and a unified evaluation harness on top. The platform-specific testing — Knowledge Base injection, Foundry Red Team agent runs, OpenAI rate-limit resilience — lives in dedicated suites."
  • "OpenAI direct is the fastest way to evaluate a new frontier model; Azure or Bedrock is the right place to operate it. Most mature enterprises end up with both — evaluation track and production track."