What can designing-real-world-ai-agents-workshop do?

gemini-grounded iterative research with google search integration, mcp-based multi-agent orchestration with decoupled server architecture, configuration management with environment variables and pydantic settings, structured research persistence and markdown-based knowledge representation, evaluator-optimizer loop for iterative content refinement, ai image generation with gemini imagen integration, dataset-driven evaluation with llm-as-judge metrics, mcp tool and resource definition with schema-based routing, prompt template system with writing profiles and context injection, end-to-end workflow orchestration from research to published content, observability and tracing with opik integration, workflow test scripts and batch processing automation

designing-real-world-ai-agents-workshop

MCP ServerFree

Hands-on workshop: Build a multi-agent AI system from scratch — Deep Research Agent + Writing Workflow served as MCP servers. Includes code, slides, and video

Open Source

/ 100

12 capabilities

Capabilities12 decomposed

gemini-grounded iterative research with google search integration

Medium confidence

Executes multi-turn research workflows using Google Gemini API with built-in Google Search grounding to retrieve factual, up-to-date information. The Deep Research Agent (src/research/server.py) implements a tool-use pattern where Gemini can invoke search tools iteratively, refining queries based on intermediate results, and persists findings into a structured research.md file. Supports YouTube transcript extraction when URLs are provided, enabling multi-modal source integration.

Solves for

I need to gather current, factually-grounded information on a topic without hallucinationI want to automate research workflows that iterate on search queries based on what's discoveredI need to extract and structure research findings from multiple sources including video transcripts

Best for

Teams building content pipelines requiring factual accuracy (journalism, technical writing, marketing)

Developers implementing agentic systems that need grounded search without external RAG infrastructure

Organizations automating research-to-content workflows at scale

Requires

Python 3.12+

Google Gemini API key with Search grounding enabled

FastMCP framework (pyproject.toml dependency)

Limitations

Requires Google Gemini API key and active Google Search grounding subscription — not free tier compatible

YouTube transcript extraction limited to publicly available transcripts; no support for age-restricted or private videos

Research depth constrained by Gemini context window (~32k tokens) — very large research topics may require chunking

What makes it unique

Uses Gemini's native Google Search grounding (not external RAG) combined with tool-use agents for iterative query refinement, eliminating hallucination risk while maintaining real-time information access. YouTube transcript extraction is built-in, enabling multi-modal research without separate API calls.

vs alternatives

Faster and more accurate than RAG-based research systems because it queries live search results directly rather than relying on static embeddings, and cheaper than multi-step LLM chains because grounding is native to Gemini's API.

mcp-based multi-agent orchestration with decoupled server architecture

Medium confidence

Implements a two-server MCP architecture (Deep Research Agent + LinkedIn Writer Agent) using FastMCP framework, where each server exposes tools, resources, and prompts independently and communicates through standardized MCP protocol. The architecture decouples research and writing concerns, allowing each agent to be developed, tested, and scaled independently while maintaining a unified interface. Configuration is managed via .mcp.json and environment variables, enabling runtime server discovery and tool registration.

Solves for

I want to build modular AI agents that can be developed and tested independently but orchestrated togetherI need a production-grade pattern for multi-agent systems that avoids monolithic codeI want to expose agent capabilities as reusable tools/resources that can be consumed by different clients

Best for

Teams building complex agentic workflows with multiple specialized agents

Developers migrating from monolithic LLM applications to modular, composable architectures

Organizations standardizing on MCP for AI tool integration across multiple products

Requires

Python 3.12+

FastMCP framework (pyproject.toml: fastmcp>=0.1.0)

MCP-compatible client harness (Claude Code, Cursor, or custom implementation)

Limitations

MCP protocol overhead adds ~50-100ms per tool invocation due to serialization and IPC

Debugging multi-server workflows requires tracing across process boundaries — standard debuggers insufficient

No built-in load balancing or failover for MCP servers — requires external orchestration (Docker, Kubernetes)

What makes it unique

Uses FastMCP framework to expose agents as standardized MCP servers rather than monolithic functions, enabling true decoupling where each agent (research, writing) has its own process, configuration, and tool registry. This pattern allows IDE integration (Claude Code, Cursor) without custom client code.

vs alternatives

More modular and testable than LangChain agent chains because each agent is independently deployable and has explicit tool/resource contracts, and more flexible than REST-based agent APIs because MCP provides native IDE integration without custom UI.

configuration management with environment variables and pydantic settings

Medium confidence

Centralizes configuration using Pydantic Settings models (src/research/config/, src/writing/config/) that load from environment variables and .env files, enabling environment-specific configuration without code changes. Configuration includes API keys, model parameters, evaluation thresholds, and server endpoints. Pydantic validation ensures type safety and provides helpful error messages for missing or invalid configuration.

Solves for

I want to configure API keys, model parameters, and thresholds without editing codeI need different configurations for development, testing, and production environmentsI want validation errors for missing configuration to fail fast at startup

Best for

Teams deploying agents across multiple environments (dev, staging, prod)

Developers managing sensitive configuration (API keys) securely

Organizations standardizing on environment-based configuration

Requires

Python 3.12+

Pydantic 2.x (pyproject.toml: pydantic>=2.0)

Pydantic Settings (pyproject.toml: pydantic-settings>=2.0)

Limitations

Environment variables are limited to string values — complex nested configuration requires JSON encoding

Pydantic Settings validation happens at import time — errors may not surface until first agent invocation

.env files are not suitable for production (security risk) — requires external secret management (AWS Secrets Manager, etc.)

What makes it unique

Uses Pydantic Settings for type-safe, validated configuration with automatic environment variable loading. Configuration is centralized in dedicated config modules (src/research/config/, src/writing/config/), making it easy to add new configuration options without modifying agent code.

vs alternatives

More robust than manual environment variable parsing because Pydantic validates types and provides helpful error messages, and more maintainable than hardcoded configuration because all settings are in one place.

structured research persistence and markdown-based knowledge representation

Medium confidence

Persists research findings to a structured markdown file (research.md) that serves as the knowledge base for the writing agent. The markdown format enables human readability while maintaining machine-parseable structure (headings, lists, citations). Research findings include source citations, timestamps, and iterative search history, creating an auditable record of how conclusions were reached. The writing agent reads this markdown to generate content, ensuring factual grounding.

Solves for

I want to persist research findings in a human-readable format that can be manually inspected and editedI need to maintain citations and source tracking for factual accuracy and credibilityI want the writing agent to have access to structured research context without re-running searches

Best for

Content teams that need to review and edit research before writing

Organizations requiring audit trails for factual claims (journalism, academic writing)

Developers building research-to-content pipelines with human-in-the-loop review

Requires

Python 3.12+

Writable filesystem for persisting research.md

Markdown parsing library (built-in Python, or external like markdown-it)

Limitations

Markdown parsing is fragile — malformed research.md may break writing agent's ability to extract context

No built-in versioning or diff tracking — multiple research iterations overwrite previous findings

Markdown structure is not standardized — different research agents may produce incompatible formats

What makes it unique

Uses markdown as the primary knowledge representation format, enabling both machine parsing (for writing agent) and human inspection (for manual review). Includes source citations and search history, creating an auditable record of research methodology.

vs alternatives

More transparent than vector databases because research is human-readable and manually editable, and more flexible than structured databases because markdown can accommodate unstructured notes and citations.

evaluator-optimizer loop for iterative content refinement

Medium confidence

Implements a multi-iteration content generation and evaluation pattern in the LinkedIn Writer Agent (src/writing/server.py) where an LLM generates initial content, an evaluator (LLM-as-judge) scores it against quality criteria, and an optimizer refines it based on feedback. The loop continues until quality thresholds are met or max iterations reached. Uses Opik for tracing and LLM-based evaluation metrics, enabling observable, measurable content quality improvement without human-in-the-loop.

Solves for

I want to automatically improve generated content through multiple refinement cycles without manual reviewI need to measure content quality objectively using LLM-based evaluation metricsI want to trace and observe each iteration of content generation for debugging and optimization

Best for

Content teams automating post/article generation with quality guarantees

Developers building self-improving agentic systems that refine outputs iteratively

Organizations implementing LLM-as-judge evaluation at scale with observability

Requires

Python 3.12+

Google Gemini API key (for generation and evaluation)

Opik API key and account (for tracing and LLM-as-judge metrics)

Limitations

Each iteration requires 2-3 LLM calls (generation + evaluation + optional optimization), increasing latency and cost by 2-3x vs single-pass generation

LLM-as-judge evaluation is only as good as the evaluation prompt — requires careful prompt engineering and validation

No guarantee of convergence — poorly defined quality criteria may cause infinite loops (mitigated by max_iterations)

What makes it unique

Combines LLM-as-judge evaluation with iterative optimization in a closed loop, using Opik for full observability of each refinement cycle. Unlike simple prompt engineering, this pattern measures quality objectively and refines based on measurable feedback, not heuristics.

vs alternatives

More reliable than single-pass LLM generation because it validates and refines output against explicit criteria, and more transparent than black-box content APIs because every iteration is traced and evaluated metrics are visible.

ai image generation with gemini imagen integration

Medium confidence

Integrates Google Gemini's Imagen model for AI-generated images within the writing workflow, enabling automatic image creation to accompany generated LinkedIn posts. The image generation is triggered based on post content and writing profiles, with generated images persisted to the dataset directory. Supports prompt engineering for image generation based on post themes and audience preferences.

Solves for

I want to automatically generate relevant images for social media posts without manual designI need to create visually consistent content across multiple posts using AI image generationI want to include AI-generated images in my content pipeline without external design tools

Best for

Content creators automating LinkedIn post generation with visual assets

Marketing teams scaling content production with AI-generated imagery

Developers building end-to-end content pipelines that include visual media

Requires

Python 3.12+

Google Gemini API key with Imagen model access

Writing profiles with image generation prompts (src/writing/profiles/)

Limitations

Imagen quality and style consistency varies based on prompt engineering — requires iteration to achieve desired aesthetic

Image generation adds 5-10 seconds latency per post (Imagen API response time)

No fine-tuning or style transfer — limited control over visual consistency across multiple images

What makes it unique

Integrates Imagen directly into the writing workflow as a native step, not a separate tool — image generation is triggered automatically based on post content and writing profiles, enabling end-to-end content creation without manual image selection.

vs alternatives

More integrated than using external image APIs (DALL-E, Midjourney) because it's part of the same Gemini API ecosystem and can reference post content directly, and faster than manual image selection because generation is automated and parallelizable.

dataset-driven evaluation with llm-as-judge metrics

Medium confidence

Implements a structured dataset system (datasets/ directory) with batch evaluation scripts that process multiple content samples through the writing workflow and score them using LLM-as-judge metrics via Opik. The evaluation system measures quality across dimensions (clarity, engagement, relevance) and aggregates results for statistical analysis. Supports dataset versioning and comparison across model versions or writing profiles.

Solves for

I want to evaluate content quality across multiple samples systematically, not just spot-check individual postsI need to measure how different writing profiles or model versions affect content qualityI want to track quality improvements over time as I refine prompts and evaluation criteria

Best for

Teams running A/B tests on writing profiles or content strategies

Researchers evaluating agentic content generation systems quantitatively

Organizations building quality dashboards for AI-generated content

Requires

Python 3.12+

Google Gemini API key

Opik API key and account

Limitations

Batch evaluation requires running full workflow for each sample — 5-10 minutes per sample with Opik tracing overhead

LLM-as-judge metrics are subjective and require careful prompt validation — results may not correlate with human judgment

Dataset management is manual (CSV/JSON files) — no built-in versioning or experiment tracking beyond Opik

What makes it unique

Combines structured dataset management with Opik-based LLM-as-judge evaluation, enabling systematic quality measurement across multiple samples with full traceability. Unlike ad-hoc evaluation, this pattern produces reproducible, comparable metrics across writing profiles and model versions.

vs alternatives

More rigorous than manual spot-checking because it evaluates entire datasets systematically, and more transparent than black-box quality scores because each evaluation is traced in Opik with full iteration history visible.

mcp tool and resource definition with schema-based routing

Medium confidence

Defines MCP tools and resources using FastMCP decorators (@mcp.tool, @mcp.resource) with JSON schema validation, enabling type-safe tool invocation and automatic schema generation. The research and writing servers expose distinct tool sets (search, research persistence, content generation, evaluation) with Pydantic-based input/output validation. MCP routers (src/research/routers/, src/writing/routers/) map tool invocations to application logic, decoupling tool definitions from implementation.

Solves for

I want to expose agent capabilities as strongly-typed MCP tools that clients can discover and invoke safelyI need to validate tool inputs/outputs automatically without writing boilerplate validation codeI want to decouple tool definitions (schema) from implementation (business logic)

Best for

Teams building MCP servers that need to expose multiple tools with type safety

Developers integrating agents into IDEs (Claude Code, Cursor) that require schema-based tool discovery

Organizations standardizing on Pydantic for data validation across agentic systems

Requires

Python 3.12+

FastMCP framework (pyproject.toml: fastmcp>=0.1.0)

Pydantic 2.x for data models and validation

Limitations

FastMCP decorators add minimal overhead (~5ms per tool invocation) but require understanding of MCP protocol

JSON schema generation from Pydantic models may not capture all validation constraints (e.g., regex patterns, custom validators)

Tool routing logic must be manually implemented in routers/ — no automatic routing based on tool name

What makes it unique

Uses FastMCP decorators with Pydantic models to automatically generate MCP tool schemas, eliminating manual JSON schema writing. Router pattern (src/research/routers/, src/writing/routers/) decouples tool definitions from implementation, enabling easy tool addition without modifying server core.

vs alternatives

More maintainable than hand-written JSON schemas because Pydantic models are single source of truth, and more discoverable than REST APIs because MCP clients can introspect tool schemas at runtime without documentation.

prompt template system with writing profiles and context injection

Medium confidence

Implements a prompt template system (src/writing/profiles/) where writing profiles define tone, style, audience, and quality criteria as structured data, and prompt templates inject these profiles into system/user messages. The system uses Jinja2-style templating (or similar) to dynamically construct prompts based on profile attributes and research content. Profiles are versioned and can be A/B tested to measure impact on content quality.

Solves for

I want to define reusable writing styles/personas that can be applied consistently across multiple postsI need to A/B test different writing profiles to see which produces higher-quality contentI want to separate prompt logic (templates) from profile data (configuration)

Best for

Content teams managing multiple writing styles or brand voices

Researchers running controlled experiments on prompt engineering

Organizations scaling content generation with consistent quality across profiles

Requires

Python 3.12+

Writing profile definitions (YAML, JSON, or Pydantic models in src/writing/profiles/)

Prompt template files (Jinja2 or similar templating syntax)

Limitations

Profile effectiveness depends on prompt engineering quality — poorly designed profiles may not produce desired results

No automatic profile discovery or recommendation — profiles must be manually created and tested

Template rendering adds ~10-20ms per prompt construction (negligible but measurable)

What makes it unique

Separates writing profiles (data) from prompt templates (logic), enabling non-technical users to create new writing styles by editing profile files without touching prompt code. Profiles are versioned and A/B testable, making it easy to measure impact of style changes on content quality.

vs alternatives

More flexible than hard-coded prompts because profiles can be changed without code deployment, and more systematic than ad-hoc prompt engineering because profiles are versioned and evaluated quantitatively.

end-to-end workflow orchestration from research to published content

Medium confidence

Orchestrates a complete workflow (src/research/server.py → src/writing/server.py) where research findings are automatically fed into the writing agent, which generates, evaluates, and refines content, then generates accompanying images. The workflow is exposed as a high-level skill (.claude/skills/research-and-write/SKILL.md) that can be invoked from Claude Code or Cursor with a single topic input. Workflow state is persisted to the filesystem (research.md, generated posts, images), enabling resumption and inspection at any stage.

Solves for

I want to go from a topic idea to a published LinkedIn post with image in one automated workflowI need to inspect intermediate outputs (research, draft posts, evaluation scores) for debugging or manual refinementI want to run this workflow repeatedly with different topics without manual orchestration

Best for

Content creators automating their entire post creation pipeline

Teams scaling content production from ideation to publication

Developers building end-to-end agentic workflows as reference implementations

Requires

Python 3.12+

Google Gemini API key with Search grounding and Imagen access

Opik API key (for evaluation tracing)

Limitations

End-to-end latency is 5-15 minutes per topic (research + generation + evaluation + image generation) — not suitable for real-time content needs

Workflow is sequential (research → write → evaluate → image) — no parallelization of independent steps

Failure at any stage (API timeout, invalid output) stops the entire workflow — no built-in retry or error recovery

What makes it unique

Exposes the entire research-to-content pipeline as a single Claude Code skill, enabling non-technical users to run complex multi-agent workflows without understanding MCP or agent architecture. Filesystem-based state persistence allows inspection and manual intervention at any stage.

vs alternatives

More complete than individual agent tools because it handles the full pipeline (research + writing + evaluation + images), and more accessible than custom orchestration code because it's exposed as a Claude Code skill with natural language invocation.

observability and tracing with opik integration

Medium confidence

Integrates Opik for end-to-end tracing of agent workflows, capturing every LLM call, tool invocation, and evaluation metric. Opik traces are automatically generated for research iterations, content generation cycles, and evaluation steps, with links persisted in output metadata. The system enables post-hoc analysis of agent behavior, debugging of failed workflows, and measurement of cost/latency across workflow stages.

Solves for

I want to see exactly what LLM calls and tool invocations happened in my workflow for debuggingI need to measure latency and cost breakdown across research, writing, and evaluation stagesI want to compare traces across different writing profiles or model versions to understand performance differences

Best for

Teams debugging complex agentic workflows with multiple LLM calls

Developers optimizing agent performance (latency, cost, quality)

Organizations monitoring production agentic systems for quality and cost

Requires

Python 3.12+

Opik Python SDK (pyproject.toml: opik>=0.1.0)

Opik API key and account

Limitations

Opik integration adds 100-200ms overhead per workflow due to trace serialization and API calls

Opik requires external account and API key — adds operational dependency

Trace data is stored in Opik cloud — requires data residency compliance for sensitive workflows

What makes it unique

Provides native Opik integration throughout the codebase, automatically capturing traces for research iterations, content generation, and evaluation without manual instrumentation. Opik traces include LLM-as-judge evaluation metrics, enabling measurement of content quality alongside cost and latency.

vs alternatives

More comprehensive than print-based debugging because it captures full trace context (model, parameters, latency, cost), and more actionable than generic LLM monitoring because it includes domain-specific metrics (evaluation scores, iteration counts).

workflow test scripts and batch processing automation

Medium confidence

Provides Python scripts (scripts/test_research_workflow.py, batch dataset processing scripts) that automate end-to-end testing and evaluation of the multi-agent system. Scripts handle dataset loading, workflow invocation, result collection, and metric aggregation. Uses GNU Make (Makefile) for task orchestration, enabling developers to run complex workflows with simple commands (e.g., `make test-research`, `make evaluate-dataset`).

Solves for

I want to test the entire research-to-content workflow programmatically without manual invocationI need to run batch evaluation across a dataset of topics and collect results systematicallyI want to automate repetitive tasks (dataset processing, metric calculation) with simple commands

Best for

Developers testing agentic workflows in CI/CD pipelines

Teams running batch evaluations on datasets

Researchers automating experimental workflows

Requires

Python 3.12+

GNU Make (for Makefile task orchestration)

All dependencies from pyproject.toml

Limitations

Scripts are tightly coupled to project structure (src/, datasets/, scripts/) — not easily reusable in other projects

Batch processing is sequential — no parallelization across samples (would require async/multiprocessing refactoring)

Make-based task orchestration is shell-dependent — less portable than Python-based task runners (Invoke, Taskipy)

What makes it unique

Combines Python scripts with Makefile-based task orchestration, enabling both programmatic control (for CI/CD) and simple command-line invocation (for developers). Scripts handle full workflow automation including dataset loading, result collection, and metric aggregation.

vs alternatives

More accessible than custom Python orchestration because Make commands are simple and discoverable, and more flexible than hardcoded test suites because scripts are parameterized for different datasets and profiles.

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with designing-real-world-ai-agents-workshop, ranked by overlap. Discovered automatically through the match graph.

API37

Google Gemini API

Google's multimodal API — Gemini 2.5 Pro/Flash, 1M context, video understanding, grounding.

grounding with google search and maps for real-time information retrievaldeep research with collaborative planning and visualizationagentic planning and task execution with iterative refinement

3 shared capabilities

MCP Server37

gemini-mcp-tool

MCP server that enables AI assistants to interact with Google Gemini CLI, leveraging Gemini's massive token window for large file analysis and codebase understanding

environment-based configuration for gemini api credentials and model selectionmcp protocol bridging to gemini cli with request-response translationasynchronous subprocess spawning with gemini cli process management

3 shared capabilities

MCP Server24

DeepView MCP

** - Enables IDEs like Cursor and Windsurf to analyze large codebases using Gemini's 1M context window.

gemini api integration with google-generativeai sdkmcp-based codebase context bridging to gemini

2 shared capabilities

Model44

Gemini 2.5 Pro

Google's most capable model with 1M context and native thinking.

google-search-grounding-with-real-time-web-contextgoogle-ai-studio-web-interface-for-rapid-experimentation

2 shared capabilities

Product19

ai.google.dev

|[URL](https://gemini.google.com/) <br> |Free/Paid|

deep research with collaborative planning and visualizationagentic planning and task execution with function calling

2 shared capabilities

MCP Server40

gemini-flow

rUv's Claude-Flow, translated to the new Gemini CLI; transforming it into an autonomous AI development team.

multi-agent swarm orchestration with byzantine fault tolerancecli-based agent orchestration and task execution

2 shared capabilities

Best For

✓Teams building content pipelines requiring factual accuracy (journalism, technical writing, marketing)
✓Developers implementing agentic systems that need grounded search without external RAG infrastructure
✓Organizations automating research-to-content workflows at scale
✓Teams building complex agentic workflows with multiple specialized agents
✓Developers migrating from monolithic LLM applications to modular, composable architectures
✓Organizations standardizing on MCP for AI tool integration across multiple products
✓Teams deploying agents across multiple environments (dev, staging, prod)
✓Developers managing sensitive configuration (API keys) securely

Known Limitations

⚠Requires Google Gemini API key and active Google Search grounding subscription — not free tier compatible
⚠YouTube transcript extraction limited to publicly available transcripts; no support for age-restricted or private videos
⚠Research depth constrained by Gemini context window (~32k tokens) — very large research topics may require chunking
⚠No built-in deduplication of search results across iterations — may retrieve redundant information
⚠MCP protocol overhead adds ~50-100ms per tool invocation due to serialization and IPC
⚠Debugging multi-server workflows requires tracing across process boundaries — standard debuggers insufficient

Requirements

Python 3.12+Google Gemini API key with Search grounding enabledFastMCP framework (pyproject.toml dependency)Pydantic 2.x for data validationMCP-compatible client (Claude Code, Cursor, or custom harness)FastMCP framework (pyproject.toml: fastmcp>=0.1.0)MCP-compatible client harness (Claude Code, Cursor, or custom implementation).mcp.json configuration file with server definitions

Input / Output

Accepts: text (topic/research query), URL (for YouTube transcript extraction), structured research guidelines (Pydantic models), MCP tool definitions (JSON schema), MCP resources (markdown, text files), MCP prompts (system/user message templates), Configuration files (.mcp.json, .env), environment variables (KEY=VALUE), .env file (KEY=VALUE pairs), Pydantic Settings model definitions, search results (from Gemini Search API), YouTube transcripts (if URLs provided), research metadata (timestamps, sources), text (research content, writing guidelines), structured profiles (writing style, tone, audience), evaluation criteria (quality metrics, thresholds), max_iterations parameter (integer), text (LinkedIn post content), structured profile (writing style, audience, brand guidelines), image generation prompt (derived from post content), CSV/JSON dataset files (topics, writing guidelines, expected outputs), evaluation criteria (quality dimensions, scoring rubrics), writing profiles (for A/B testing different styles), Pydantic model definitions (for tool input/output schemas), FastMCP decorator parameters (tool name, description, input_schema), writing profile (structured data: tone, style, audience, quality criteria), research content (markdown or text), template variables (topic, keywords, target audience), text (topic or research query), optional: writing profile name (defaults to 'default'), optional: max_iterations for evaluation loop, LLM calls (model, prompt, response), tool invocations (tool name, inputs, outputs), evaluation metrics (metric name, score, reasoning), dataset files (CSV/JSON with topics and metadata), Makefile targets (test-research, evaluate-dataset, etc.), command-line arguments (optional: sample count, profile name)

Produces: markdown file (research.md with structured findings), JSON (research metadata and source citations), text (iterative search queries and results), Tool execution results (JSON, text, markdown), Resource content (structured data, markdown), Prompt templates (rendered with context), Server metadata (capabilities, schema), validated configuration objects (Python dataclasses), configuration errors (validation exceptions with helpful messages), research.md file (markdown with structured sections), research metadata (JSON with citations and timestamps), refined text content (LinkedIn post), evaluation scores (JSON with metric values), iteration history (trace logs in Opik), optimization feedback (text explaining changes), PNG/JPEG image file (persisted to datasets/), image metadata (filename, generation timestamp, prompt used), image URL or local path reference, evaluation results (JSON with scores per sample), aggregated metrics (mean, std dev, percentiles), comparison reports (profile A vs profile B), Opik trace links (for detailed iteration inspection), JSON schema (auto-generated from Pydantic models), MCP tool definitions (discoverable by clients), validated tool results (typed Python objects), rendered prompt (system message + user message with injected context), profile metadata (name, version, description), research.md (structured research findings), generated LinkedIn post (text), evaluation scores (JSON), AI-generated image (PNG/JPEG), workflow trace (Opik links), Opik trace URL (shareable link to full workflow trace), trace metadata (latency, cost, token counts), evaluation results (scores, feedback), test results (pass/fail, error logs), evaluation metrics (JSON with scores per sample), generated content (posts, images, research files), aggregated reports (mean scores, comparisons)

UnfragileRank

Adoption14%(30% weight)

Quality51%(25% weight)

Ecosystem60%(25% weight)

Match Graph10%(15% weight)

Freshness75%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: MCP Server

12 capabilities

Visit designing-real-world-ai-agents-workshop→

Repository Details

166

Stars

Forks

Python

Language

MIT

License

Topics

ai-agentai-skillsai-workflowdeep-researchmcpmulti-agent-systemsworkshop

Last commit: Apr 21, 2026

About

Hands-on workshop: Build a multi-agent AI system from scratch — Deep Research Agent + Writing Workflow served as MCP servers. Includes code, slides, and video

Alternatives to designing-real-world-ai-agents-workshop

IntelliCode50Extension

AI-assisted development

Compare →

GitHub Copilot Chat53Extension

AI chat features powered by Copilot

Compare →

GitHub Copilot52Extension

Your AI pair programmer

Compare →

Claude Code for VS Code52Extension

Claude Code for VS Code: Harness the power of Claude Code without leaving your IDE

Compare →

Are you the builder of designing-real-world-ai-agents-workshop?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

github

Looking for something else?

Search →

Capabilities12 decomposed

gemini-grounded iterative research with google search integration

Medium confidence

Solves for

Best for

Teams building content pipelines requiring factual accuracy (journalism, technical writing, marketing)

Developers implementing agentic systems that need grounded search without external RAG infrastructure

Organizations automating research-to-content workflows at scale

Requires

Python 3.12+

Google Gemini API key with Search grounding enabled

FastMCP framework (pyproject.toml dependency)

Limitations

Requires Google Gemini API key and active Google Search grounding subscription — not free tier compatible

YouTube transcript extraction limited to publicly available transcripts; no support for age-restricted or private videos

Research depth constrained by Gemini context window (~32k tokens) — very large research topics may require chunking

What makes it unique

vs alternatives

mcp-based multi-agent orchestration with decoupled server architecture

Medium confidence

Solves for

Best for

Teams building complex agentic workflows with multiple specialized agents

Developers migrating from monolithic LLM applications to modular, composable architectures

Organizations standardizing on MCP for AI tool integration across multiple products

Requires

Python 3.12+

FastMCP framework (pyproject.toml: fastmcp>=0.1.0)

MCP-compatible client harness (Claude Code, Cursor, or custom implementation)

Limitations

MCP protocol overhead adds ~50-100ms per tool invocation due to serialization and IPC

Debugging multi-server workflows requires tracing across process boundaries — standard debuggers insufficient

No built-in load balancing or failover for MCP servers — requires external orchestration (Docker, Kubernetes)

What makes it unique

vs alternatives

configuration management with environment variables and pydantic settings

Medium confidence

Solves for

Best for

Teams deploying agents across multiple environments (dev, staging, prod)

Developers managing sensitive configuration (API keys) securely

Organizations standardizing on environment-based configuration

Requires

Python 3.12+

Pydantic 2.x (pyproject.toml: pydantic>=2.0)

Pydantic Settings (pyproject.toml: pydantic-settings>=2.0)

Limitations

Environment variables are limited to string values — complex nested configuration requires JSON encoding

Pydantic Settings validation happens at import time — errors may not surface until first agent invocation

.env files are not suitable for production (security risk) — requires external secret management (AWS Secrets Manager, etc.)

What makes it unique

vs alternatives

structured research persistence and markdown-based knowledge representation

Medium confidence

Solves for

Best for

Content teams that need to review and edit research before writing

Organizations requiring audit trails for factual claims (journalism, academic writing)

Developers building research-to-content pipelines with human-in-the-loop review

Requires

Python 3.12+

Writable filesystem for persisting research.md

Markdown parsing library (built-in Python, or external like markdown-it)

Limitations

Markdown parsing is fragile — malformed research.md may break writing agent's ability to extract context

No built-in versioning or diff tracking — multiple research iterations overwrite previous findings

Markdown structure is not standardized — different research agents may produce incompatible formats

What makes it unique

vs alternatives

evaluator-optimizer loop for iterative content refinement

Medium confidence

Solves for

Best for

Content teams automating post/article generation with quality guarantees

Developers building self-improving agentic systems that refine outputs iteratively

Organizations implementing LLM-as-judge evaluation at scale with observability

Requires

Python 3.12+

Google Gemini API key (for generation and evaluation)

Opik API key and account (for tracing and LLM-as-judge metrics)

Limitations

Each iteration requires 2-3 LLM calls (generation + evaluation + optional optimization), increasing latency and cost by 2-3x vs single-pass generation

LLM-as-judge evaluation is only as good as the evaluation prompt — requires careful prompt engineering and validation

No guarantee of convergence — poorly defined quality criteria may cause infinite loops (mitigated by max_iterations)

What makes it unique

vs alternatives

ai image generation with gemini imagen integration

Medium confidence

Solves for

Best for

Content creators automating LinkedIn post generation with visual assets

Marketing teams scaling content production with AI-generated imagery

Developers building end-to-end content pipelines that include visual media

Requires

Python 3.12+

Google Gemini API key with Imagen model access

Writing profiles with image generation prompts (src/writing/profiles/)

Limitations

Imagen quality and style consistency varies based on prompt engineering — requires iteration to achieve desired aesthetic

Image generation adds 5-10 seconds latency per post (Imagen API response time)

No fine-tuning or style transfer — limited control over visual consistency across multiple images

What makes it unique

vs alternatives

dataset-driven evaluation with llm-as-judge metrics

Medium confidence

Solves for

Best for

Teams running A/B tests on writing profiles or content strategies

Researchers evaluating agentic content generation systems quantitatively

Organizations building quality dashboards for AI-generated content

Requires

Python 3.12+

Google Gemini API key

Opik API key and account

Limitations

Batch evaluation requires running full workflow for each sample — 5-10 minutes per sample with Opik tracing overhead

LLM-as-judge metrics are subjective and require careful prompt validation — results may not correlate with human judgment

Dataset management is manual (CSV/JSON files) — no built-in versioning or experiment tracking beyond Opik

What makes it unique

vs alternatives

mcp tool and resource definition with schema-based routing

Medium confidence

Solves for

Best for

Teams building MCP servers that need to expose multiple tools with type safety

Developers integrating agents into IDEs (Claude Code, Cursor) that require schema-based tool discovery

Organizations standardizing on Pydantic for data validation across agentic systems

Requires

Python 3.12+

FastMCP framework (pyproject.toml: fastmcp>=0.1.0)

Pydantic 2.x for data models and validation

Limitations

FastMCP decorators add minimal overhead (~5ms per tool invocation) but require understanding of MCP protocol

JSON schema generation from Pydantic models may not capture all validation constraints (e.g., regex patterns, custom validators)

Tool routing logic must be manually implemented in routers/ — no automatic routing based on tool name

What makes it unique

vs alternatives

prompt template system with writing profiles and context injection

Medium confidence

Solves for

Best for

Content teams managing multiple writing styles or brand voices

Researchers running controlled experiments on prompt engineering

Organizations scaling content generation with consistent quality across profiles

Requires

Python 3.12+

Writing profile definitions (YAML, JSON, or Pydantic models in src/writing/profiles/)

Prompt template files (Jinja2 or similar templating syntax)

Limitations

Profile effectiveness depends on prompt engineering quality — poorly designed profiles may not produce desired results

No automatic profile discovery or recommendation — profiles must be manually created and tested

Template rendering adds ~10-20ms per prompt construction (negligible but measurable)

What makes it unique

vs alternatives

end-to-end workflow orchestration from research to published content

Medium confidence

Solves for

Best for

Content creators automating their entire post creation pipeline

Teams scaling content production from ideation to publication

Developers building end-to-end agentic workflows as reference implementations

Requires

Python 3.12+

Google Gemini API key with Search grounding and Imagen access

Opik API key (for evaluation tracing)

Limitations

End-to-end latency is 5-15 minutes per topic (research + generation + evaluation + image generation) — not suitable for real-time content needs

Workflow is sequential (research → write → evaluate → image) — no parallelization of independent steps

Failure at any stage (API timeout, invalid output) stops the entire workflow — no built-in retry or error recovery

What makes it unique

vs alternatives

observability and tracing with opik integration

Medium confidence

Solves for

Best for

Teams debugging complex agentic workflows with multiple LLM calls

Developers optimizing agent performance (latency, cost, quality)

Organizations monitoring production agentic systems for quality and cost

Requires

Python 3.12+

Opik Python SDK (pyproject.toml: opik>=0.1.0)

Opik API key and account

Limitations

Opik integration adds 100-200ms overhead per workflow due to trace serialization and API calls

Opik requires external account and API key — adds operational dependency

Trace data is stored in Opik cloud — requires data residency compliance for sensitive workflows

What makes it unique

vs alternatives

workflow test scripts and batch processing automation

Medium confidence

Solves for

Best for

Developers testing agentic workflows in CI/CD pipelines

Teams running batch evaluations on datasets

Researchers automating experimental workflows

Requires

Python 3.12+

GNU Make (for Makefile task orchestration)

All dependencies from pyproject.toml

Limitations

Scripts are tightly coupled to project structure (src/, datasets/, scripts/) — not easily reusable in other projects

Batch processing is sequential — no parallelization across samples (would require async/multiprocessing refactoring)

Make-based task orchestration is shell-dependent — less portable than Python-based task runners (Invoke, Taskipy)

What makes it unique

vs alternatives

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Alternatives to designing-real-world-ai-agents-workshop

IntelliCode50Extension

AI-assisted development

Compare →

GitHub Copilot Chat53Extension

AI chat features powered by Copilot

Compare →

GitHub Copilot52Extension

Your AI pair programmer

Compare →

Claude Code for VS Code52Extension

Claude Code for VS Code: Harness the power of Claude Code without leaving your IDE

Compare →

designing-real-world-ai-agents-workshop

Capabilities12 decomposed

gemini-grounded iterative research with google search integration

mcp-based multi-agent orchestration with decoupled server architecture

configuration management with environment variables and pydantic settings

structured research persistence and markdown-based knowledge representation

evaluator-optimizer loop for iterative content refinement

ai image generation with gemini imagen integration

dataset-driven evaluation with llm-as-judge metrics

mcp tool and resource definition with schema-based routing

prompt template system with writing profiles and context injection

end-to-end workflow orchestration from research to published content

observability and tracing with opik integration

workflow test scripts and batch processing automation

Related Artifactssharing capabilities

Google Gemini API

gemini-mcp-tool

DeepView MCP

Gemini 2.5 Pro

ai.google.dev

gemini-flow

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

Repository Details

About

Categories

Alternatives to designing-real-world-ai-agents-workshop

Are you the builder of designing-real-world-ai-agents-workshop?

Get the weekly brief

Data Sources

designing-real-world-ai-agents-workshop

Capabilities12 decomposed

gemini-grounded iterative research with google search integration

mcp-based multi-agent orchestration with decoupled server architecture

configuration management with environment variables and pydantic settings

structured research persistence and markdown-based knowledge representation

evaluator-optimizer loop for iterative content refinement

ai image generation with gemini imagen integration

dataset-driven evaluation with llm-as-judge metrics

mcp tool and resource definition with schema-based routing

prompt template system with writing profiles and context injection

end-to-end workflow orchestration from research to published content

observability and tracing with opik integration

workflow test scripts and batch processing automation

Related Artifactssharing capabilities

Google Gemini API

gemini-mcp-tool

DeepView MCP

Gemini 2.5 Pro

ai.google.dev

gemini-flow

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

Repository Details

About

Categories

Alternatives to designing-real-world-ai-agents-workshop

Are you the builder of designing-real-world-ai-agents-workshop?

Get the weekly brief

Data Sources