Experimental Forem

Add PoW-skip + Lightning payments to any MCP server in 10 lines

Zeke — Sun, 17 May 2026 18:25:40 +0000

You built an MCP server. Now agents are hammering your premium tools for free and you've got no lever to pull.

The boring fix is "add auth" — OAuth tokens, API keys, a whole user management system. But that's overkill for a tool that should just cost 21 sats per call.

Here's the short fix.

What you need

Two packages:

npm install @powforge/captcha-paymcp-provider @powforge/paymcp-l402-provider paymcp

paymcp — decorator framework that wraps MCP tools with payment gates
@powforge/captcha-paymcp-provider — PoW-skip tier: agent solves SHA-256, no invoice needed
@powforge/paymcp-l402-provider — Lightning tier: agent pays a BOLT11 invoice via LNBits

The 10-line integration

const { PayMCP } = require('paymcp');
const { CaptchaPowProvider } = require('@powforge/captcha-paymcp-provider');
const { LnbitsPaymentProvider } = require('@powforge/paymcp-l402-provider');

PayMCP(mcp, {
  providers: [
    new CaptchaPowProvider({ captchaUrl: 'https://captcha.powforge.dev' }),
    new LnbitsPaymentProvider({
      lnbitsUrl: process.env.LNBITS_URL,
      lnbitsApiKey: process.env.LNBITS_KEY,
      satsAmount: 21,
    }),
  ],
});

Drop that right after you construct your McpServer. Tag any tool with { _meta: { price: 1 } } and it's now gated.

How it works

PoW path (free, ~5-10s of CPU):

createPayment fetches a SHA-256 challenge from the captcha server.
It mines the nonce server-side — no round-trip to the client needed.
Returns a pow:// URI encoding all params a PoW-capable MCP client SDK needs.
getPaymentStatus submits the nonce to /api/verify and returns 'paid' on confirm.

Lightning path (21 sats):

createPayment mints a BOLT11 invoice via LNBits.
Returns the invoice in the payment URL.
getPaymentStatus polls until the invoice is settled.

paymcp tries the PoW provider first. If the calling agent doesn't support pow:// URIs, it falls through to the Lightning invoice. The agent picks whichever it can satisfy.

Why both tiers

Some agents are compute-rich, sats-poor — they'd rather burn CPU cycles than need a wallet. Others are running in headless pipelines with a Lightning wallet already wired. Give them both options and you capture more traffic without managing two separate auth flows.

The pow:// URI scheme also means the payment proof travels in-band with the request — no session state, no cookies, no database lookup beyond the challenge ledger the captcha server already maintains.

Full example

'use strict';

const { McpServer } = require('@modelcontextprotocol/sdk/server/mcp.js');
const { StdioServerTransport } = require('@modelcontextprotocol/sdk/server/stdio.js');
const { PayMCP } = require('paymcp');
const { CaptchaPowProvider } = require('@powforge/captcha-paymcp-provider');
const { LnbitsPaymentProvider } = require('@powforge/paymcp-l402-provider');
const { z } = require('zod');

const mcp = new McpServer({ name: 'my-mcp-server', version: '1.0.0' });

PayMCP(mcp, {
  providers: [
    new CaptchaPowProvider({ captchaUrl: 'https://captcha.powforge.dev' }),
    new LnbitsPaymentProvider({
      lnbitsUrl: process.env.LNBITS_URL,
      lnbitsApiKey: process.env.LNBITS_KEY,
      satsAmount: 21,
    }),
  ],
});

mcp.tool(
  'premium_lookup',
  'Premium data lookup — PoW-skip (free) or Lightning (21 sats)',
  { query: z.string() },
  { _meta: { price: 1 } },
  async ({ query }) => ({
    content: [{ type: 'text', text: `Result for: ${query}` }],
  }),
);

const transport = new StdioServerTransport();
mcp.connect(transport).then(() => {
  process.stderr.write('MCP server running\n');
});

Self-hosting the captcha server

The captchaUrl above points to captcha.powforge.dev which handles challenge issuance and verification. You can self-host it too — it's @powforge/captcha running as a Node.js server. The whole thing is under 300 lines.

What it costs

PoW path: free for the agent, a few seconds of server CPU per call, and a round-trip to your captcha endpoint.
Lightning path: 21 sats (or whatever satsAmount you set) credited to your LNBits wallet.
No external auth services, no API keys to rotate, no user database.

The PoW path is also a natural rate limiter. Solving a difficulty-14 SHA-256 challenge takes roughly 5-10 seconds on a modern CPU — plenty of friction to discourage abuse, not so much that legitimate agents bail out.

Source on npm:

Debugging Multi-Agent Systems in TypeScript: From Flat Logs to Execution Trees

chintanonweb — Sun, 17 May 2026 18:25:38 +0000

AI agents are easy to demo when they follow a clean path: receive a task, call a tool, produce an answer, and finish successfully.

They become much harder to reason about when multiple agents run together.

In a real system, agents may plan, call tools, retry failures, make decisions from stale state, run in parallel, or touch the same resource from different paths. When something breaks, flat logs usually tell us what happened, but they rarely show why it happened.

That is the debugging gap I wanted to explore.

So I built a small TypeScript-based multi-agent incident-response simulator. The goal was simple: simulate a production incident where multiple agents diagnose and remediate infrastructure problems. The system had a diagnostic agent, database agent, network agent, scaling agent, and coordinator agent.

On paper, the design looked reasonable.

The DiagnosticAgent analyzed the incoming incident. The DatabaseAgent handled database-related issues. The NetworkAgent managed load balancer or routing problems. The ScalingAgent handled capacity decisions. The CoordinatorAgent orchestrated everything and was responsible for avoiding conflicting actions.

The architecture looked clean until the agents started working at the same time.

The Problem With Flat Logs

In the first version, the simulator emitted logs like this:

\[2:47:23\] DiagnosticAgent: High DB latency detected  
\[2:47:24\] DatabaseAgent: Initiating replica scale-up  
\[2:47:25\] DiagnosticAgent: Connection pool exhaustion detected  
\[2:47:26\] DatabaseAgent: Taking node-3 offline for maintenance  
\[2:47:27\] ScalingAgent: Database performance degraded, scaling up  
\[2:47:28\] NetworkAgent: Detected backend failures, restarting load balancer  
\[2:47:29\] CoordinatorAgent: Conflict detected  
\[2:47:32\] ERROR: Cluster quorum lost

These logs were useful, but only up to a point.

They showed that the database agent scaled replicas. They showed that another agent also tried to scale. They showed that a node was taken offline. They showed that the coordinator noticed a conflict.

But they did not clearly answer the important questions:

Which agent made a decision from stale state?

Did the coordinator run before or after the conflicting tool calls?

Were the database and scaling agents truly running in parallel?

Which exact tool call caused the final failure?

Was the problem an LLM decision, a tool execution issue, or a coordination issue?

This is where normal logging started to feel too flat. The system behavior was no longer a simple list of events. It was a tree of decisions, tool calls, retries, and parallel branches.

That is when I tried agent-inspect.

Adding Local Execution Tracing

agent-inspect is a local-first execution tree debugger for TypeScript and Node.js AI agents. Instead of sending traces to a hosted dashboard, it writes local traces that can be inspected from the terminal.

That local-first model is important during development. I did not want to set up a full observability platform just to understand one local agent run. I wanted something closer to a structured debugging layer between console.log and production-grade observability.

The first step was to wrap the coordinator flow.

import { inspectRun, step } from "agent-inspect";

async function handleIncident(incident: Incident) {  
 return inspectRun(  
   "incident-response-coordinator",  
   async () \=\> {  
     const diagnosis \= await step("diagnose-incident", async () \=\> {  
       return diagnosticAgent.analyze(incident);  
     });

     const actions \= await step("execute-remediation", async () \=\> {  
       return Promise.all(\[  
         step.tool("database-remediation", () \=\>  
           databaseAgent.handleIssue(diagnosis.dbIssues)  
         ),  
         step.tool("network-remediation", () \=\>  
           networkAgent.handleIssue(diagnosis.networkIssues)  
         ),  
         step.tool("scaling-remediation", () \=\>  
           scalingAgent.handleIssue(diagnosis.scalingIssues)  
         ),  
       \]);  
     });

     return step("resolve-conflicts", async () \=\> {  
       return resolveConflicts(actions);  
     });  
   },  
   {  
     traceDir: "./.agent-inspect",  
   }  
 );  
}

The code did not need a full rewrite. The main change was adding meaningful boundaries around the work.

The outer inspectRun represented one agent run. The normal step calls represented logical phases. The step.tool calls marked operations that touched external systems or simulated infrastructure.

Then I instrumented the database agent.

class DatabaseAgent {  
 async handleIssue(issues: DbIssue\[\]) {  
   return step("database-agent-execution", async () \=\> {  
     const dbState \= await step.tool("check-db-state", async () \=\> {  
       return this.getClusterState();  
     });

     const decision \= await step.llm("decide-db-action", async () \=\> {  
       return this.llm.chat({  
         messages: \[  
           {  
             role: "user",  
             content: JSON.stringify({  
               task: "Decide the safest database remediation action",  
               issues,  
               dbState,  
             }),  
           },  
         \],  
       });  
     });

     if (decision.action \=== "scale-up") {  
       return step.tool("scale-database", async () \=\> {  
         return this.scaleUpReplicas(decision.targetCount);  
       });  
     }

     if (decision.action \=== "restart-node") {  
       return step.tool("restart-node", async () \=\> {  
         return this.restartNode(decision.nodeId);  
       });  
     }

     return {  
       action: "no-op",  
       reason: "No safe database action selected",  
     };  
   });  
 }  
}

The important part is not just the tracing. It is the naming.

A trace is only useful if the steps describe the system in the same language engineers use during debugging. check-db-state, decide-db-action, scale-database, and restart-node are much more useful than generic messages like running task or tool call started.

Inspecting the Failed Run

After running the simulator, I listed the local traces:

npx agent-inspect list --dir ./.agent-inspect

Then I inspected the failed run:

npx agent-inspect view <run-id> --dir ./.agent-inspect

The execution tree made the issue much easier to understand:

incident-response-coordinator                              \[47.2s\] ✗  
├─ diagnose-incident                                       \[3.1s\] ✓  
├─ execute-remediation                                     \[41.8s\] ✗  
│  ├─ database-remediation                                 \[23.2s\] ✓  
│  │  └─ database-agent-execution                          \[23.1s\] ✓  
│  │     ├─ check-db-state                                 \[0.4s\] ✓  
│  │     ├─ decide-db-action                               \[2.1s\] ✓  
│  │     ├─ scale-database                                 \[18.3s\] ✓  
│  │     ├─ check-db-state                                 \[0.3s\] ✓  
│  │     ├─ decide-db-action                               \[1.9s\] ✓  
│  │     └─ restart-node                                   \[0.3s\] ✓  
│  ├─ network-remediation                                  \[5.2s\] ✓  
│  └─ scaling-remediation                                  \[41.7s\] ✗  
│     └─ scaling-agent-execution                           \[41.6s\] ✗  
│        ├─ check-scaling-state                            \[0.3s\] ✓  
│        ├─ decide-scaling-action                          \[2.2s\] ✓  
│        └─ scale-database                                 \[39.1s\] ✗  
│           └─ Error: Operation timeout \- cluster in inconsistent state  
└─ resolve-conflicts                                       \[not reached\]

This view showed the problem more clearly than the logs.

The database agent checked the state, decided to scale up, and started a database scaling operation. Then it checked state again and decided to restart a node. At the same time, the scaling agent also detected database pressure and started another scaling operation.

Both agents were acting on the same resource. Both believed their action was valid. The coordinator was supposed to resolve conflicts, but the trace showed that resolve-conflicts was never reached because the failure happened inside the parallel remediation step.

That was the real bug.

It was not simply a bad prompt. It was not only a database operation failure. It was a coordination bug caused by parallel agents acting on the same resource without a proper resource-level guard.

Fixing the Coordination Model

Once the execution tree made the failure visible, the fix became much more direct.

The first change was to add a state refresh guard. If the database cluster already had an operation in progress, the agent should wait for stable state before making another decision.

async function handleIssue(issues: DbIssue\[\]) {  
 return step("database-agent-execution", async () \=\> {  
   const dbState \= await step.tool("check-db-state", async () \=\> {  
     return this.getClusterState();  
   });

   if (dbState.hasInProgressOperations) {  
     return step("wait-for-stability", async () \=\> {  
       await this.waitForStableState();  
       return this.handleIssue(issues);  
     });  
   }

   return this.decideAndExecute(issues, dbState);  
 });  
}

The second change was to protect critical operations with a lock.

async function scaleUpReplicas(targetCount: number) {  
 return step.tool("scale-database", async () \=\> {  
   const lock \= await this.acquireLock("database-scaling", 60\_000);

   try {  
     return this.performScaleUp(targetCount);  
   } finally {  
     await lock.release();  
   }  
 });  
}

The third change was at the coordinator level. If multiple agents wanted to touch the same resource, the coordinator should not blindly run them in parallel.

const actions \= await step("execute-remediation-sequenced", async () \=\> {  
 const targets \= identifyResourceTargets(diagnosis);

 if (targets.database.length \> 0\) {  
   const dbActions \= await step.tool("database-remediation", () \=\>  
     databaseAgent.handleIssue(diagnosis.dbIssues)  
   );

   const networkActions \= await step.tool("network-remediation", () \=\>  
     networkAgent.handleIssue(diagnosis.networkIssues)  
   );

   return {  
     dbActions,  
     networkActions,  
   };  
 }

 return Promise.all(\[  
   step.tool("network-remediation", () \=\>  
     networkAgent.handleIssue(diagnosis.networkIssues)  
   ),  
   step.tool("scaling-remediation", () \=\>  
     scalingAgent.handleIssue(diagnosis.scalingIssues)  
   ),  
 \]);  
});

After the fix, the trace looked different:

incident-response-coordinator                              \[15.3s\] ✓  
├─ diagnose-incident                                       \[2.8s\] ✓  
├─ execute-remediation-sequenced                           \[11.2s\] ✓  
│  └─ database-remediation                                 \[8.4s\] ✓  
│     └─ database-agent-execution                          \[8.3s\] ✓  
│        ├─ check-db-state                                 \[0.3s\] ✓  
│        ├─ acquire-lock                                   \[0.1s\] ✓  
│        ├─ decide-db-action                               \[1.9s\] ✓  
│        ├─ scale-database                                 \[5.8s\] ✓  
│        └─ release-lock                                   \[0.1s\] ✓  
└─ resolve-conflicts                                       \[1.3s\] ✓

This is the kind of output I want during agent development.

Not just “something failed,” but where it failed. Not just “the tool timed out,” but what sequence caused the timeout. Not just “agents ran in parallel,” but which branches actually overlapped.

Why This Matters for AI Agent Engineering

As agent systems become more common, debugging needs to move beyond raw logs.

A single-agent workflow can often be debugged with a few log statements. But multi-agent systems introduce coordination problems. A bug may not live inside one function. It may live between two valid decisions that become unsafe when executed together.

That is why execution trees are useful.

They show the structure of the run. They show parent-child relationships. They separate normal logic from tool calls and LLM calls. They make retries, skipped steps, failed branches, and slow operations easier to reason about.

This also changes how we think about observability.

Production observability platforms are still important. Tools like LangSmith, Langfuse, OpenTelemetry-based pipelines, and APM platforms solve important team and production problems. But during local development, I often want something lighter. I want to run the agent, inspect the trace, make a change, and compare the result.

That is the space where a local-first tool like agent-inspect fits naturally.

It is not trying to replace production monitoring. It is closer to a developer workflow tool for understanding agent behavior before it reaches production.

Practical Lessons From the Project

The first lesson is that flat logs hide structure. In a multi-agent workflow, order alone is not enough. You need to know which step belonged to which agent, which steps were siblings, and which operation blocked or failed.

The second lesson is that not every agent bug is an LLM bug. In this simulator, the expensive failure came from tool coordination and stale state, not from a slow model call. Without tracing, it would have been easy to spend time tuning prompts while ignoring the actual failure path.

The third lesson is that instrumentation can become living documentation. A well-named step() call describes the architecture. When a new engineer reads the trace, they can understand the runtime behavior faster than reading scattered logs.

The fourth lesson is that local-first debugging is still valuable. Not every debugging session needs a dashboard, collector, account, or cloud upload. Sometimes the fastest path is a local trace file and a terminal command.

Final Thoughts

The more I build with AI agents, the more I feel that debugging is becoming an architecture problem.

It is not enough to know that an agent produced the wrong answer. We need to know what it planned, which tools it called, which state it observed, which branches ran in parallel, where retries happened, and what changed between two runs.

For TypeScript and Node.js teams building agentic systems, agent-inspect is a useful tool to explore that workflow. It gives you a lightweight way to turn agent runs into readable execution trees without committing to a hosted observability setup on day one.

For my multi-agent incident-response simulator, the biggest value was simple: it turned a confusing wall of logs into a system I could reason about.

And that is usually the first step toward making agent systems reliable.

Npm lib: https://www.npmjs.com/package/agent-inspect

Github repo: https://github.com/rajudandigam/agent-inspect

AWS Cloud Practitioner Exam - The Difficult Parts - Part 2: Planning and Costs

Cliff Claven — Sun, 17 May 2026 18:23:23 +0000

💰 Cost & Usage Report — The Billing Data Firehose

Think of it as a massive CSV delivered to an S3 bucket with every single charge broken down by hour, resource, tag, and account. The most granular billing data AWS produces — built for analysts and BI tools.

Billing tools ranked by detail level:

Pricing Calculator  →  estimate before you build (no real data)
Budgets             →  set thresholds, get alerts
Cost Explorer       →  charts/graphs of actual spend, up to 13 months back
Cost & Usage Report →  raw data firehose, most detailed of all ⬅ this one

📋 Exam trigger words
"detailed cost breakdown per resource" · "feed billing data into a BI tool" → Cost & Usage Report

The 6 Pillars

Scenario signal	Pillar	One-liner
Single point of failure, outage, recovery	Reliability	Stay up, recover fast
Paying for unused resources, bill too high	Cost Optimization	Don't waste money
Manual processes, inconsistent deployments	Operational Excellence	Run it well and keep improving
Credentials exposed, no encryption	Security	Protect everything, always
Slow for distant users, wrong instance type	Performance Efficiency	Use the right resource for the job
Carbon footprint, energy, managed services	Sustainability	Minimize environmental impact

AWS Service Scope: Global vs Regional vs Zonal

Scope	Examples
Global	IAM, Route 53, CloudFront, WAF, STS
Regional	S3, RDS, EFS, Lambda, SQS, SNS, AWS Batch
Zonal	EC2 instances, EBS volumes

The trick: EC2 feels regional but it's zonal — it lives in one AZ. EBS snapshots however are regional.

All 6 CAF Perspectives — Complete Master Table

Perspective	Owned by	Focuses on	Key capabilities
Business	CEO, CFO, COO	Cloud investment drives business outcomes	Strategy, portfolio, innovation
People	CHRO, HR leaders	Culture, skills, organizational change	Training, workforce, change management
Governance	CRO, Compliance	Risk, compliance, investment decisions	Portfolio management, data governance, risk
Platform	CTO, Architects	Architecture, infrastructure, tech standards	IaC, networking, data architecture
Security	CISO, Security engineers	Protect everything, detect threats	IAM, data protection, infrastructure protection
Operations	IT Operations, Support	Run and support cloud day to day	Incident mgmt, performance, patch management

Exam trick: CAF is NOT just technical — Business and People perspectives are tested heavily
Application Portfolio Management = Governance ← students always put this in Operations

CAF Security Perspective Capabilities

Capability	Does what
Infrastructure Protection	Protects against external threats and unauthorized access
Identity and Access Management	Controls who accesses what
Data Protection	Encryption, data security at rest and in transit
Threat Detection	Identifies existing threats
Incident Response	Responds when breaches occur
Application Security	Secures applications specifically

CAF Operations Perspective Capabilities

Observability
Event management (AIOps)
Incident and problem management
Change and release management
Performance and capacity management
Configuration management
Patch management
Availability and continuity management
Application management

Trigger: "meet SLAs" + "agreed-upon service levels" → Performance and Capacity Management

Remember: Application Portfolio Management = Governance perspective, NOT Operations

Shared Responsibility Model

Category	Examples
AWS owns	Physical infrastructure, host OS patching, networking hardware
Shared	Configuration management, patch management (guest OS = you), awareness & training
Customer owns	Guest OS, applications, data encryption, network traffic protection, Zone Security

The one-word trick: "host OS" = AWS. "Guest OS" = customer.

IAM Identities

IAM Concept	CLI/Access Keys?	Notes
IAM User	✅ Long-term credentials	Common but not best practice
IAM Role	✅ Temporary credentials	Best practice
IAM Group	❌	Collection of users only
IAM Policy	❌	Not an identity — it's a permission document

Pricing Calculator vs Cost Explorer

Tool	Use When
Pricing Calculator	Planning/estimating before you build
Cost Explorer	Analyzing actual spend after you've been running

Trusted Advisor — 5 Categories (memorize exactly)

Cost Optimization
Security
Fault Tolerance
Performance
Service Limits

Trap answers: "Instance Usage", "Infrastructure", "Storage Capacity" — none of these are real categories.

AWS Support Plans — Complete Feature Matrix

Feature	Basic	Business+	Enterprise
Cost	Free	Paid	More expensive
Trusted Advisor checks	Core only	Full	Full
Support API	❌	✅	✅
Technical Account Manager (TAM)	❌	❌	✅
Well-Architected Reviews	❌	❌	✅
Operations Reviews	❌	❌	✅
Infrastructure Event Management	❌	✅ extra fee	✅ included
Concierge billing support	❌	❌	✅
Response time (critical)	None	1 hour	15 minutes
For workloads	Dev/test	Production	Mission-critical

The rule: Business+ gets IEM for extra fee but NOT Well-Architected or Operations Reviews → those need Enterprise

Critical: If a question mentions Well-Architected Reviews OR Operations Reviews → Enterprise only

What Is Free vs What Costs Money

FREE	COSTS MONEY
VPCs	EC2 instances (per hour)
Subnets and route tables	RDS instances (per hour)
IAM users, groups, roles, policies	NAT Gateway (hourly + per GB processed)
CloudFormation	Elastic IPs — even attached to running instances
AWS Organizations	Data transfer OUT to internet
Security Groups and NACLs	Data transfer BETWEEN regions
AWS Console access	Data transfer BETWEEN AZs (small fee)
Inbound data transfer to AWS	EBS volumes (per GB per month)
S3 storage requests (mostly)	Load balancers (per hour + LCUs)
DNS resolution within VPC	Direct Connect (port hours + data transfer)
CloudWatch basic monitoring	CloudWatch detailed monitoring and custom metrics

Biggest surprises:

Elastic IPs cost money even when properly attached — AWS charges to discourage IPv4 hoarding
Data transfer INTO AWS is free — you're never charged for uploads
Data transfer BETWEEN AZs in same region costs a small amount — use this to justify multi-AZ design decisions
VPCs themselves are free — you pay for what's inside them
CloudFormation is free — you pay for resources it creates

Operational Hardening — Guardrails, Secrets Rotation & SLO — FSx ONTAP S3AP Phase 12

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Sun, 17 May 2026 18:21:39 +0000

TL;DR

Phase 12 hardens the Phase 11 event-driven pipeline for production: capacity guardrails, automated secrets rotation, SLO observability, and Persistent Store replay validated with zero event loss in tested scenarios.

Phase 12 is not about adding another UC. It is about turning the Phase 11 event-driven pipeline into an operator-ready system: safe automation, credential rotation, forecast-based capacity operations, lineage, SLOs, and validated replay behavior.

This is Phase 12 of the FSx for ONTAP S3AP serverless pattern library. Building on Phase 10 and Phase 11, Phase 12 delivers:

Capacity Guardrails: DRY_RUN/ENFORCE/BREAK_GLASS modes with DynamoDB tracking and CloudWatch EMF metrics
Secrets Rotation: 4-step ONTAP fsxadmin auto-rotation via VPC Lambda on 90-day interval
Synthetic Monitoring: CloudWatch Synthetics Canary with S3AP + ONTAP health checks (VPC constraints discovered)
Capacity Forecasting: Linear regression (stdlib only) with DaysUntilFull metric on daily EventBridge schedule
Data Lineage Tracking: DynamoDB table with GSI for processing history and opt-in integration
Protobuf TCP Framing: AUTO_DETECT/LENGTH_PREFIXED/FRAMELESS adaptive reader
SLO Definition: 4 SLO targets with CloudWatch Dashboard and alarm-based violation detection
FPolicy Pipeline E2E: NFS file creation → FPolicy → SQS delivery confirmed
Persistent Store Replay: Fargate stop → file creation → restart → zero event loss in tested 5-event and 20-event scenarios
Property-Based Testing: 16 Hypothesis properties, 53 tests, 3 bugs discovered
S3 Access Point Deep Dive: Multi-layer authorization, IAM ARN format, VPC network constraints

Key metrics: 59 files, 14,895 lines added · 116 unit tests + 53 property tests · 7 CloudFormation stacks deployed · 3 bugs found via property testing · Zero event loss in 5-event replay + 20-event burst tests · Secrets rotation: all 4 steps successful.

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns

1. Capacity Guardrails — DRY_RUN / ENFORCE / BREAK_GLASS

The problem

FSx ONTAP supports automatic storage capacity expansion, but uncontrolled auto-scaling can lead to runaway costs. Operations teams need rate limiting, daily caps, and cooldown periods — with an emergency bypass for critical situations.

The solution

A three-mode guardrail system backed by DynamoDB tracking and CloudWatch EMF metrics:

graph LR
    A[Auto-Expand Request] --> B{GuardrailMode?}
    B -->|DRY_RUN| C[Log + Allow<br/>fail-open on DDB error]
    B -->|ENFORCE| D[Check + Block<br/>fail-closed on DDB error]
    B -->|BREAK_GLASS| E[Bypass All Checks<br/>SNS Alert + Audit Log]
    C --> F[DynamoDB Tracking]
    D --> F
    E --> F
    F --> G[CloudWatch EMF Metrics]

Mode	Behavior on Check Failure	Behavior on DynamoDB Error
`DRY_RUN`	Log warning, allow action	Fail-open (allow)
`ENFORCE`	Block action, emit metric	Fail-closed (deny)
`BREAK_GLASS`	Skip all checks	SNS alert + audit log

Core implementation

from shared.guardrails import CapacityGuardrail, GuardrailMode

guardrail = CapacityGuardrail()  # Mode from GUARDRAIL_MODE env var

result = guardrail.check_and_execute(
    action_type="volume_grow",
    requested_gb=50.0,
    execute_fn=my_grow_function,
    volume_id="vol-abc123",
)

if result.allowed:
    print(f"Action executed: {result.action_id}")
else:
    print(f"Action denied: {result.reason}")
    # Reasons: rate_limit_exceeded | daily_cap_exceeded | cooldown_active

Three safety checks (ENFORCE mode)

Rate limit: Max 10 actions per day per action type
Daily cap: Max 500 GB cumulative expansion per day
Cooldown: 300-second minimum interval between actions

All thresholds are configurable via environment variables (GUARDRAIL_RATE_LIMIT, GUARDRAIL_DAILY_CAP_GB, GUARDRAIL_COOLDOWN_SECONDS).

DynamoDB tracking schema

Attribute	Type	Description
`pk`	String	Action type (e.g., `volume_grow`)
`sk`	String	Date (`YYYY-MM-DD`)
`daily_total_gb`	Number	Cumulative GB expanded today
`action_count`	Number	Number of actions today
`last_action_ts`	String	ISO timestamp of last action
`actions`	List	Audit trail of all actions
`ttl`	Number	30-day auto-expiry

BREAK_GLASS production considerations

In production, BREAK_GLASS should be treated as a temporary elevated operational state — time-bound, audited, and restricted to a small operator group. The Phase 12 implementation emits SNS alerts and DynamoDB audit logs on every BREAK_GLASS invocation. Additional hardening options for enterprise deployments include IAM condition keys to restrict who can set the mode, automatic revert to ENFORCE after a configurable TTL, and integration with change management approval workflows.

2. Secrets Rotation — ONTAP fsxadmin Auto-Rotation

The problem

ONTAP management credentials (fsxadmin) stored in Secrets Manager need periodic rotation. Manual rotation is error-prone and creates compliance gaps.

The solution

A VPC-deployed Lambda implements the standard 4-step Secrets Manager rotation protocol, directly calling the ONTAP REST API to change the password:

sequenceDiagram
    participant SM as Secrets Manager
    participant Lambda as Rotation Lambda (VPC)
    participant ONTAP as FSx ONTAP REST API

    SM->>Lambda: Step 1: createSecret
    Lambda->>SM: Generate new password, store as AWSPENDING

    SM->>Lambda: Step 2: setSecret
    Lambda->>ONTAP: PATCH /api/security/accounts/{owner_uuid}/{name} (new password)
    ONTAP-->>Lambda: 200 OK

    SM->>Lambda: Step 3: testSecret
    Lambda->>ONTAP: GET /api/cluster (using new password)
    ONTAP-->>Lambda: 200 OK (cluster UUID returned)

    SM->>Lambda: Step 4: finishSecret
    Lambda->>SM: Promote AWSPENDING → AWSCURRENT

Key design decisions

VPC deployment: Lambda must be in the same VPC as the ONTAP management LIF
90-day interval: Configurable via CloudFormation parameter
Validation: Step 3 (testSecret) verifies the new password works by calling the ONTAP cluster API
Rollback safety: If testSecret fails, the old password remains as AWSCURRENT

Bugs discovered during live testing

Three bugs were found and fixed during the actual rotation execution:

AWSPENDING empty check: createSecret must handle the case where get_secret_value(VersionStage='AWSPENDING') raises ResourceNotFoundException
management_ip fallback: The Lambda must support both management_ip (new) and ontap_mgmt_ip (legacy) keys in the secret JSON
Cluster UUID validation: testSecret now validates the response contains a valid uuid field, not just HTTP 200

Verification result

Step 1 (createSecret): ✅ New password generated, stored as AWSPENDING
Step 2 (setSecret):    ✅ ONTAP password changed via REST API
Step 3 (testSecret):   ✅ New password validated (cluster UUID confirmed)
Step 4 (finishSecret): ✅ AWSPENDING promoted to AWSCURRENT

Operational note

Rotating fsxadmin affects every automation path that depends on the same credential. Production deployments should verify that all ONTAP REST clients read from Secrets Manager rather than caching passwords or storing out-of-band copies. Additionally, ONTAP management endpoints use self-signed TLS certificates by default — ensure rotation Lambda's urllib3 or requests configuration handles certificate verification appropriately (see shared/ontap_client.py for the pattern used in this project).

For production environments, consider using a dedicated ONTAP automation account with the minimum privileges required for FPolicy engine updates and health checks, rather than sharing fsxadmin across all automation paths. This follows the principle of least privilege and limits the blast radius of credential compromise or rotation failures.

3. Synthetic Monitoring — CloudWatch Synthetics Canary

The problem

The FPolicy pipeline depends on both S3 Access Point availability and ONTAP management API health. Passive monitoring (waiting for failures) is insufficient for production SLOs.

The solution

A CloudWatch Synthetics Canary running every 5 minutes performs two health checks:

ONTAP Health Check: REST API call to the management endpoint (VPC-internal)
S3 Access Point Check: ListObjectsV2 against the S3AP alias

Critical finding: network-origin and endpoint configuration matter

During deployment, the VPC-internal Canary could reach the ONTAP management API but timed out when calling the S3 Access Point alias.

This should not be generalized as "VPC clients cannot access FSx ONTAP S3 Access Points." AWS documents support for both Internet-origin and VPC-origin access points. For VPC-origin access points, requests must arrive through a VPC endpoint (Gateway or Interface) in the bound VPC. For Internet-origin access points, requests must have a network path to the S3 service endpoint.

In this Phase 12 environment (Internet-origin S3 AP), the operational fix was to split monitoring into two paths:

Check	Observed requirement in this environment	Result
ONTAP REST API	VPC-internal access to management LIF	✅ Works
S3AP health check	Requires a network path consistent with the S3AP NetworkOrigin and endpoint policy	⚠️ Timed out from the initial VPC Canary configuration

Solution: Split into two monitoring paths:

ONTAP health: VPC-internal Canary (confirmed working, 88ms response)
S3AP health: VPC-external Lambda or correctly routed S3AP client path (Phase 13 work)

This is documented as a critical constraint in docs/guides/s3ap-fsxn-specification.md.

Canary runtime version lesson

The template initially specified syn-python-selenium-3.0, which was deprecated on 2026-02-03. Updated to syn-python-selenium-11.0. CloudWatch Synthetics runtimes are deprecated frequently — parameterize the version or keep defaults current.

AWS builder lesson: VPC placement is a design choice

A key takeaway from this Phase 12 discovery: placing a Lambda or Canary inside a VPC is not automatically "more secure" or "more correct." It changes the network path. When a Lambda function is connected to a VPC, it loses default internet access — outbound traffic must route through a NAT Gateway or VPC endpoint. For each dependency, decide whether the function needs VPC-private access (e.g., ONTAP management LIF), internet-routed service access (e.g., Internet-origin S3AP), or a split-path design combining both.

4. Capacity Forecasting — Linear Regression with stdlib Only

The problem

Reactive capacity alerts (disk full) cause outages. Proactive forecasting enables planned expansion before exhaustion.

The solution

A Lambda function running on a daily EventBridge schedule:

Fetches 30 days of FSx StorageUsed metrics from CloudWatch
Performs linear regression using only Python's math module (zero external dependencies)
Publishes DaysUntilFull as a CloudWatch custom metric
Sends SNS alert when forecast drops below threshold (default: 30 days)

Linear regression implementation (stdlib only)

def linear_regression(data_points: list[tuple[float, float]]) -> tuple[float, float]:
    """Least-squares linear regression using only math module."""
    n = len(data_points)
    if n < 2:
        raise ValueError("Need at least 2 data points for regression")

    sum_x = sum_y = sum_xy = sum_x2 = 0.0
    for x, y in data_points:
        sum_x += x
        sum_y += y
        sum_xy += x * y
        sum_x2 += x * x

    denominator = n * sum_x2 - sum_x * sum_x
    if abs(denominator) < 1e-10:
        return (0.0, sum_y / n)

    slope = (n * sum_xy - sum_x * sum_y) / denominator
    intercept = (sum_y - slope * sum_x) / n
    return (slope, intercept)

Edge cases handled

Scenario	DaysUntilFull	Behavior
< 2 data points	-1	Insufficient data, no prediction
slope ≤ 0 (shrinking/flat)	-1	Never fills up
Already over capacity	0	Immediate alert
Very low usage (0.03%)	169,374	Normal — far future prediction

Live verification

{
  "days_until_full": 169374,
  "current_usage_pct": 0.03,
  "total_capacity_gb": 1024.0,
  "growth_rate_gb_per_day": 0.006,
  "forecast_date": "2490-02-06T06:26:42Z"
}

The test environment has 0.03% usage — the prediction of 169,374 days is correct behavior. The alert threshold (30 days) ensures notifications only fire when action is genuinely needed.

This is intentionally a lightweight linear forecast, not a full capacity planning model. It does not account for seasonality, workload bursts, or one-time cleanup events; operators should treat DaysUntilFull as an early-warning signal, not an exact prediction.

5. Data Lineage Tracking — DynamoDB with GSI

The problem

When a file is processed through the pipeline, operators need to trace: which UC processed it, when, what outputs were generated, and whether it succeeded or failed.

The solution

A DynamoDB table with a Global Secondary Index (GSI) provides three query patterns:

graph TD
    subgraph "DynamoDB: fsxn-s3ap-data-lineage"
        PK[PK: source_file_key<br/>SK: processing_timestamp]
        GSI[GSI: uc_id-timestamp-index<br/>PK: uc_id, SK: processing_timestamp]
    end

    Q1[Query by file] -->|PK lookup| PK
    Q2[Query by UC + time range] -->|GSI query| GSI
    Q3[Query by execution ARN] -->|Scan + filter| PK

For high-volume environments, consider adding a dedicated GSI on step_functions_execution_arn. Phase 12 keeps execution-ARN lookup as scan+filter to avoid adding another index by default.

Integration helper (opt-in)

from shared.lineage import LineageTracker, LineageRecord

tracker = LineageTracker()
record = LineageRecord(
    source_file_key="/vol1/legal/contracts/deal-001.pdf",
    processing_timestamp="2026-05-16T14:30:45.123Z",
    step_functions_execution_arn="arn:aws:states:...:execution:...",
    uc_id="legal-compliance",
    output_keys=["s3://output-bucket/legal/reports/deal-001-analysis.json"],
    status="success",
    duration_ms=4523,
)
lineage_id = tracker.record(record)

Design principles

Non-blocking: Write failures emit a warning log but never interrupt the main processing pipeline
TTL: 365-day auto-expiry via DynamoDB TTL (configurable via LINEAGE_TTL_DAYS environment variable; regulated environments may require 7+ years — disable TTL and use S3 export for long-term retention)
Opt-in: UCs integrate by importing the helper — no mandatory coupling
PAY_PER_REQUEST: No capacity planning needed for variable workloads

Future: compliance-grade lineage (v2)

For regulated environments requiring tamper-evident audit trails, the following fields are candidates for a future LineageRecord v2:

Field	Purpose
`input_checksum`	SHA-256 of source file for integrity verification
`output_checksum`	SHA-256 of generated output
`fpolicy_sequence_number`	ONTAP-assigned sequence for ordering
`policy_version`	FPolicy policy configuration version
`uc_template_version`	UC CloudFormation template version
`guardrail_mode`	Active guardrail mode at processing time
`retention_profile`	Retention class for compliance tiering

For long-term retention beyond DynamoDB TTL, consider S3 export with Object Lock (WORM) for immutable audit storage.

6. Protobuf TCP Framing — Adaptive Reader

The problem

Phase 11 discovered that ONTAP's protobuf mode uses different TCP framing than XML mode. The existing read_fpolicy_message() assumes a 4-byte big-endian length prefix wrapped in quote delimiters — which doesn't work for protobuf.

The solution

An adaptive ProtobufFrameReader that supports three framing modes:

graph TD
    A[Incoming TCP Stream] --> B{FramingMode}
    B -->|AUTO_DETECT| C[Probe first 4 bytes]
    C -->|Valid uint32 length| D[LENGTH_PREFIXED]
    C -->|Otherwise| E[FRAMELESS]
    B -->|LENGTH_PREFIXED| D
    B -->|FRAMELESS| E
    D --> F[4-byte big-endian header → payload]
    E --> G[varint-delimited → payload]
    F --> H[Decoded Message]
    G --> H

Three modes

Mode	Wire Format	Use Case
`LENGTH_PREFIXED`	4-byte big-endian length + payload	XML mode (legacy)
`FRAMELESS`	varint-delimited protobuf	Protobuf mode (ONTAP 9.15.1+)
`AUTO_DETECT`	Probe first bytes, then lock mode	Unknown/mixed environments

Auto-detection heuristic

async def _auto_detect_and_read(self) -> bytes | None:
    """Probe first 4 bytes to determine framing mode."""
    peek = await self._reader.readexactly(4)
    candidate_length = struct.unpack("!I", peek)[0]

    if 0 < candidate_length <= self._max_message_size:
        # Valid length header → LENGTH_PREFIXED
        self._detected_mode = FramingMode.LENGTH_PREFIXED
        payload = await self._reader.readexactly(candidate_length)
        return payload
    else:
        # Not a valid length → FRAMELESS (varint-delimited)
        self._detected_mode = FramingMode.FRAMELESS
        self._buffer = peek
        return await self._read_varint_delimited()

Safety features

Max message size enforcement (default 1 MB): Prevents DoS via malformed length headers
FramingError exception: Structured error with offset and raw data for debugging
Graceful EOF handling: Returns None on connection close without raising

Integration with existing FPolicy server

from shared.integrations.protobuf_integration import create_fpolicy_reader, read_fpolicy_message_v2

# Environment variable PROTOBUF_FRAMING_MODE controls behavior:
# - Not set: legacy read_fpolicy_message() (backward compatible)
# - AUTO_DETECT / LENGTH_PREFIXED / FRAMELESS: use ProtobufFrameReader
reader = create_fpolicy_reader(stream)
message = await read_fpolicy_message_v2(reader or stream)

Phase 12 validates the adaptive reader with property-based tests and integration tests. Live ONTAP protobuf wire validation remains Phase 13 work.

Phase 13 protobuf validation scope

The following questions will be confirmed with NetApp support during live wire validation:

Exact ONTAP protobuf framing format (length-prefixed vs varint-delimited)
Message boundary behavior under high throughput
Keep-alive behavior in protobuf mode vs XML mode
Backward compatibility: can a single FPolicy server handle both XML and protobuf connections?
Mixed-mode migration path (XML → protobuf transition without event loss)
Maximum message size guidance from ONTAP side

7. SLO Definition — 4 Targets with CloudWatch Dashboard

The problem

Without defined SLOs, there's no objective measure of pipeline health. "It seems to be working" is not an operational posture.

The solution

Four SLO targets covering the critical path of the event-driven pipeline:

SLO	Metric	Target	SLO met when
Event Ingestion Latency	`EventIngestionLatency_ms`	P99 < 5,000 ms	LessThanThreshold
Processing Success Rate	`ProcessingSuccessRate_pct`	> 99.5%	GreaterThanThreshold
Reconnect Time	`FPolicyReconnectTime_sec`	< 30 sec	LessThanThreshold
Replay Completion Time	`ReplayCompletionTime_sec`	< 300 sec (5 min)	LessThanThreshold

For success rate, the CloudWatch Alarm fires when the metric drops below 99.5% (ComparisonOperator: LessThanThreshold), even though the SLO target is expressed as "> 99.5%".

CloudWatch Dashboard

The SLO dashboard combines all four metrics with threshold annotations, plus Synthetic Monitoring metrics (S3AP latency, ONTAP health):

from shared.slo import SLO_TARGETS, evaluate_slos, generate_dashboard_widgets

# Evaluate all SLOs programmatically
results = evaluate_slos(cloudwatch_client)
for r in results:
    status = "MET" if r.met else "VIOLATED"
    print(f"{r.slo_name}: {status} (value={r.value}, threshold={r.threshold})")

# Generate dashboard widget JSON for CloudFormation
widgets = generate_dashboard_widgets(region="ap-northeast-1")

Alarm-based violation detection

Each SLO has a corresponding CloudWatch Alarm:

Alarm Name	State	Evaluation
`fsxn-s3ap-slo-ingestion-latency`	OK	3 consecutive periods
`fsxn-s3ap-slo-success-rate`	OK	3 consecutive periods
`fsxn-s3ap-slo-reconnect-time`	OK	3 consecutive periods
`fsxn-s3ap-slo-replay-completion`	OK	3 consecutive periods

All alarms route to the aggregated SNS topic for unified alerting. SLO violation runbooks (e.g., ingestion latency triage, replay slowness diagnosis, reconnect timeout response) are Phase 13 deliverables — defining SLOs without corresponding runbooks is only half the operational story.

8. FPolicy Pipeline E2E Verification

The problem

Unit tests validate individual components, but the full pipeline — NFS file creation → ONTAP FPolicy detection → TCP notification → FPolicy server → SQS delivery — must be verified end-to-end in a real environment.

The verification

sequenceDiagram
    participant NFS as NFS Client (Bastion)
    participant ONTAP as FSx for ONTAP
    participant FP as FPolicy Server (Fargate)
    participant SQS as SQS Queue

    NFS->>ONTAP: echo "test" > /mnt/fpolicy_vol/test.txt
    ONTAP->>FP: NOTI_REQ (FILE_CREATE event)
    FP->>FP: Parse event, extract metadata
    FP->>SQS: SendMessage (JSON payload)
    SQS-->>SQS: Message available for consumers

Timeline (actual observed)

Time	Event	Detail
T+0s	TCP connection test	ONTAP → Fargate IP (10.0.128.98:9898)
T+10s	Session established	NEGO_REQ → NEGO_RESP handshake
T+12s	KEEP_ALIVE starts	2-minute interval
T+30s	NFS file created	`echo "test" > /mnt/fpolicy_vol/test_fpolicy_event.txt`
T+31s	NOTI_REQ received	FPolicy server receives file creation event
T+32s	SQS delivery	Event sent to SQS queue (FPolicy_Q)

SQS message format

{
  "event_type": "FILE_CREATE",
  "svm_name": "FSxN_OnPre",
  "volume_name": "vol1",
  "file_path": "/vol1/test_fpolicy_event.txt",
  "client_ip": "10.0.128.98",
  "timestamp": "2026-05-16T08:45:32Z",
  "session_id": 1,
  "sequence_number": 1
}

IAM issue discovered and fixed

The ECS task role's SQS policy used a Resource ARN pattern arn:aws:sqs:...:fsxn-fpolicy-* that didn't match the actual queue name FPolicy_Q. Fix: use explicit ARN or * wildcard in the template.

Lesson: SQS queue names that don't match template patterns silently fail. Either parameterize the queue ARN or use a broader resource pattern.

Event contract assumptions

The FPolicy event pipeline should be treated as an at-least-once, out-of-order event stream. Consumers must assume:

Duplicate events can occur (especially during Persistent Store replay)
Delivery order is not guaranteed (confirmed in Section 9)
Consumers must be idempotent
file_path + timestamp + sequence_number serves as an idempotency key candidate
Replay events may arrive after newer events
Schema versioning should be introduced before multi-UC production rollout

9. Persistent Store Replay Validation — Zero Event Loss in Tested Scenarios

The problem

Phase 11 configured Persistent Store on ONTAP but didn't validate replay completeness with real file operations during server downtime.

Important prerequisite: FPolicy Persistent Store is available for asynchronous non-mandatory policies only (ONTAP 9.14.1+). Synchronous and asynchronous mandatory configurations are not supported. Each SVM can have only one Persistent Store, and the same store can be used by multiple policies within that SVM.

The test procedure

Stop Fargate task (ECS stop-task)
Create 5 files via NFS during downtime (replay-test-1.txt through replay-test-5.txt)
Wait for ECS service auto-recovery (new task launch)
Update ONTAP FPolicy engine IP to new task IP (disable → update → re-enable)
Verify all 5 events arrive in SQS

Results

Metric	Value
Events generated during downtime	5
Events replayed to SQS	5
Lost events	0
Replay delivery order	3, 1, 2, 5, 4 (non-sequential)
Replay completion time	~30 seconds

Key observation: Out-of-order replay

Persistent Store replays events in a non-sequential order — not in the order they were created. This is expected behavior for asynchronous FPolicy. Downstream consumers must handle out-of-order delivery using:

Idempotency: Deduplicate by file path + timestamp
Timestamp-based ordering: Sort by event timestamp, not arrival order

20-file burst validation

Additionally, a 20-file burst test confirmed zero event loss under higher load:

Test	Files Created	Events Delivered	Loss
Replay (5 files)	5	5	0
Burst (20 files)	20	20	0

Phase 13 replay storm metrics

The 5-event and 20-event tests confirm basic replay correctness. Phase 13 will validate at scale (1000+ events) and measure ONTAP-side behavior:

Metric	Purpose
Persistent Store volume usage before/after replay	Capacity planning for the store volume
Events queued vs events replayed	Completeness verification
Replay throughput (events/sec)	Performance baseline
Replay duration	SLO calibration
Out-of-order distance	Downstream buffer sizing
Duplicate events	Idempotency requirement validation
ONTAP EMS logs around disconnect/reconnect	Root cause correlation

Phase 13 replay storm testing should vary not only event count, but also protocol (NFSv3/NFSv4.1/SMB), operation type (create/modify/delete/rename), downtime duration (5 min / 30 min / 2 hours), and file size distribution.

Operational framing: event durability as RPO/RTO

Operationally, Persistent Store replay behaves like an event-durability layer: the tested scenarios achieved zero event loss (event RPO = 0), while ReplayCompletionTime_sec provides an RTO-like operational metric for how quickly queued events are delivered after FPolicy server reconnection.

Phase 12 validation scope

Scope	Phase 12 Assumption	Production Consideration
SVM	Single SVM validation	Multi-SVM needs per-SVM policy and Persistent Store planning
Volume	Test volume	Production volumes should be grouped by UC/event profile
Protocol	NFS-based E2E test	NFSv3/NFSv4.1/SMB replay validation remains Phase 13
Event types	File create	Modify/delete/rename validation remains Phase 13
FPolicy mode	Async non-mandatory	Required for Persistent Store (NetApp docs)

10. Property-Based Testing — 16 Hypothesis Properties, 53 Tests

The problem

Example-based tests verify known scenarios but miss edge cases. For protocol parsers, guardrail logic, and data structures, we need exhaustive input space exploration.

The approach

Using Python's Hypothesis library, we defined 16 properties across the Phase 12 modules:

Property Group	Properties	Tests	Bugs Found
Protobuf Frame Reader	5 (round-trip, max size, EOF, multi-message, auto-detect)	18	1
Capacity Guardrails	4 (mode behavior, rate limit, daily cap, cooldown)	14	1
Data Lineage	3 (record/query round-trip, GSI consistency, TTL)	9	0
SLO Evaluation	2 (threshold comparison, no-data handling)	6	1
Capacity Forecast	2 (regression accuracy, edge cases)	6	0
Total	16	53	3

Bugs discovered

Protobuf reader: AUTO_DETECT mode failed when the first 4 bytes happened to form a valid-looking length that exceeded max_message_size. Fix: treat oversized candidate lengths as FRAMELESS indicator.
Guardrails: BREAK_GLASS mode didn't emit the GuardrailBypass metric when DynamoDB tracking update failed. Fix: move metric emission before the tracking update call.
SLO evaluation: When CloudWatch returned datapoints with identical timestamps (possible during metric aggregation), max(datapoints, key=lambda dp: dp["Timestamp"]) was non-deterministic. Fix: add secondary sort by value.

Example property test

@given(messages=st.lists(
    st.binary(min_size=1, max_size=1000),
    min_size=1, max_size=10,
))
@settings(max_examples=200)
def test_length_prefixed_round_trip(self, messages: list[bytes]):
    """Property: LENGTH_PREFIXED encode → decode preserves all messages."""
    stream_data = _make_length_prefixed_stream(messages)
    reader = _make_stream_reader(stream_data)
    frame_reader = ProtobufFrameReader(
        reader=reader,
        mode=FramingMode.LENGTH_PREFIXED,
        max_message_size=max(len(m) for m in messages) + 1,
    )

    decoded = []
    for _ in range(len(messages)):
        msg = asyncio.run(frame_reader.read_message())
        assert msg is not None
        decoded.append(msg)

    assert decoded == messages  # Round-trip property

11. S3 Access Point Deep Dive — Multi-Layer Auth and VPC Constraints

The critical finding

FSx for ONTAP S3 Access Points are not standard S3 endpoints. They use the FSx data plane, which has different network routing characteristics than standard S3.

In this pattern library, FSx for ONTAP S3 Access Points serve as an AWS service integration boundary: they let serverless and analytics services (Lambda, Step Functions, Bedrock, Transfer Family) interact with ONTAP-resident file data through S3-compatible APIs — without requiring ONTAP to become a generic S3 bucket or moving data out of the file system.

Multi-layer authorization model

graph TD
    Client[S3 API Client] --> IAM{Layer 1: IAM Policy}
    IAM -->|identity-based policy| AP{Layer 2: AP Resource Policy}
    AP -->|resource policy| FS{Layer 3: File System Identity}
    FS -->|UNIX UID or AD user| Volume[ONTAP Volume]

    IAM -.->|❌ Denied| Block1[Access Denied]
    AP -.->|❌ Denied| Block2[Access Denied]
    FS -.->|❌ No permission| Block3[Access Denied]

AWS documents this as a "dual-layer authorization model" combining IAM permissions with file system-level permissions. In practice, the request must pass through all applicable authorization layers — network origin check, VPC endpoint policy, access point resource policy, IAM identity policy, SCPs, and file system identity. An explicit Deny in any layer blocks access.

Correct IAM ARN format

{
  "Effect": "Allow",
  "Action": ["s3:ListBucket"],
  "Resource": "arn:aws:s3:ap-northeast-1:<ACCOUNT_ID>:accesspoint/fsxn-eda-s3ap"
}
{
  "Effect": "Allow",
  "Action": ["s3:GetObject"],
  "Resource": "arn:aws:s3:ap-northeast-1:<ACCOUNT_ID>:accesspoint/fsxn-eda-s3ap/object/*"
}

Common mistake: Using the S3AP alias (xxx-ext-s3alias) as a bucket ARN. The alias is only valid as the Bucket parameter in boto3 calls — IAM policies require the full access point ARN.

VPC network constraint (environment-specific observation)

Access Pattern	Observed Result	Notes
VPC Lambda → S3 AP (Internet-origin AP, via S3 Gateway Endpoint)	⚠️ Timeout in this config	Timed out with only the initial VPC/Gateway Endpoint path; Internet-origin AP required an internet-routed path (NAT Gateway or VPC-external Lambda) in this environment
Internet → S3 AP (NetworkOrigin=Internet)	✅	Routes correctly with valid IAM credentials
VPC Lambda → S3 AP (VPC-origin AP, via VPC endpoint in bound VPC)	Supported per AWS docs; not verified in Phase 12	Requires VPC-origin AP and matching endpoint policy
VPC Lambda → ONTAP REST API	✅	Direct management LIF access

Important: This observation is specific to the Phase 12 environment configuration (Internet-origin S3 AP). AWS documents that VPC-origin access points work with Gateway endpoints for traffic originating within the bound VPC. The network origin cannot be changed after creation — if VPC-internal access is required, create the access point with VPC origin.

Architectural implication for this pattern: Since the existing S3 AP uses Internet origin, any Lambda or Canary that needs to access it must either:

Run outside VPC (with Internet access)
Use NAT Gateway for outbound routing
Be split into separate VPC-internal (ONTAP) and VPC-external (S3AP) functions

Write support and practical constraints

FSx ONTAP S3 Access Points support PutObject, DeleteObject, multipart uploads (CreateMultipartUpload, UploadPart, CompleteMultipartUpload), and other write operations — they are not read-only. The access point compatibility table documents the full list of supported S3 API operations.

However, S3 Access Points are not full S3 buckets. Key constraints include:

Maximum upload size: 5 GB
Only FSX_ONTAP storage class
Only SSE-FSX encryption
No ACLs (except bucket-owner-full-control), no Object Versioning, no Object Lock, no presigned URLs

All access is governed by IAM policy, access point policy, and ONTAP file-system permissions (the multi-layer authorization model described above). In this pattern library, some workflows still use NFS/SMB for producer-side writes when file semantics, application compatibility, or operational constraints make that more appropriate.

12. Cross-Project Feedback — Template Hardening

During Phase 12, the companion project fsxn-observability-integrations reviewed our CloudFormation templates and provided actionable feedback. All items were applied:

Security Group: SourceSecurityGroupId over CIDR

Before (broad):

SecurityGroupIngress:
  - IpProtocol: tcp
    FromPort: 9898
    ToPort: 9898
    CidrIp: "10.0.0.0/8"

After (precise):

SecurityGroupIngress:
  - IpProtocol: tcp
    FromPort: !Ref FPolicyPort
    ToPort: !Ref FPolicyPort
    SourceSecurityGroupId: !Ref FsxnSvmSecurityGroupId
    Description: FPolicy TCP from FSxN SVM Security Group

This limits inbound traffic to only the FSxN SVM's security group rather than the entire VPC CIDR — a significant security improvement for production deployments.

ONTAP CLI: Deprecated `vserver` prefix

ONTAP 9.11+ deprecates the vserver prefix on FPolicy commands. Updated all templates and documentation (8 languages) to use the recommended format:

# Deprecated (still works for backward compatibility)
vserver fpolicy policy external-engine create -vserver FSxN_OnPre ...

# Recommended (ONTAP 9.11+)
fpolicy policy external-engine create -vserver FSxN_OnPre ...

KMS Decrypt: When it's needed (and when it's not)

Added documentation clarifying SQS encryption behavior:

SqsManagedSseEnabled: true → kms:Decrypt is NOT needed (transparent)
KmsMasterKeyId: alias/aws/sqs → kms:Decrypt IS needed

Our templates use SqsManagedSseEnabled: true, so no KMS permissions are required for the Bridge Lambda's SQS consumer policy.

EC2 AMI: Removed redundant Docker install

ECS-optimized AMIs ({{resolve:ssm:/aws/service/ecs/optimized-ami/...}}) already include Docker. Removed the unnecessary yum install -y docker from UserData scripts.

Cpu/Memory: String type is intentional

Fargate requires specific CPU/Memory combinations (e.g., 256 CPU → 512/1024/2048 Memory). Using String type with AllowedValues provides better validation than Number type for this constrained parameter space.

13. What's Next — Phase 13 Outlook

Phase 12 completes the operational hardening layer. The pipeline now has the production hardening baseline:

✅ Capacity guardrails preventing runaway auto-scaling
✅ Automated secrets rotation on 90-day cycle
✅ Proactive capacity forecasting with daily predictions
✅ SLO-based observability with alarm-driven alerting
✅ Data lineage tracking for audit and debugging
✅ Validated zero-event-loss replay under Fargate restarts in tested 5-event and 20-event scenarios
✅ Property-based testing catching real bugs

Ownership boundary

Layer	Primary Owner	Examples
Shared event platform	Platform / storage team	FPolicy server, SQS queue, EventBridge bus, Persistent Store
ONTAP operations	Storage team	SVM, volume, FPolicy policy, Persistent Store capacity
Security operations	Security / platform team	Secrets rotation, BREAK_GLASS approval, IAM policies
Workload UC	Application / data team	Step Functions, UC routing rules, output destinations
Observability	Platform + workload teams	SLO dashboard, UC-specific alarms, runbooks

Production Readiness Matrix

Capability	Phase 12 Status	Remaining Work
Capacity Guardrails	Verified (DRY_RUN/ENFORCE/BREAK_GLASS)	Approval workflow optional
Secrets Rotation	4-step rotation verified	Ensure all clients read from Secrets Manager
SLO Dashboard	Deployed, 4 alarms active	Runbooks and alarm response automation in Phase 13
Persistent Store Replay	5-event + 20-event scenarios verified	1000+ replay storm testing
S3AP Monitoring	ONTAP health path verified	Split S3AP health check (VPC-external)
Protobuf Framing	Property/integration tested	Live ONTAP protobuf wire validation
Multi-account OAM	Stack deployed conditionally	Second-account validation
Production UC E2E	Pipeline verified to SQS delivery	Full TriggerMode=EVENT_DRIVEN UC flow
Cost Dashboard	Not yet deployed	Per-UC Lambda/Fargate/DynamoDB/Synthetics cost aggregation

Phase 13 candidates

Operational readiness:

Canary S3AP check separation: Deploy VPC-external Lambda for S3 Access Point monitoring (resolving the VPC constraint discovered in Phase 12)
SLO violation runbooks: Operational response procedures for each SLO alarm (ingestion latency, success rate, reconnect, replay)
Replay storm testing: Generate 1000+ events during FPolicy server downtime, measure replay throughput and downstream throttling behavior

Enterprise deployment:

Multi-account OAM validation: Deploy workload-account-oam-link.yaml in a second AWS account
Shared platform vs workload boundary: Formalize ownership split between shared infrastructure (FPolicy server, SQS, EventBridge, guardrails, secrets rotation) and workload-specific resources (UC Step Functions, routing rules, output destinations)
Production UC end-to-end: Deploy a UC template with TriggerMode=EVENT_DRIVEN and verify the complete flow from NFS file creation through Step Functions execution to output generation

Protocol and cost:

Protobuf live wire validation: Confirm protobuf TCP framing with NetApp support and validate AUTO_DETECT mode against real ONTAP protobuf traffic
Cost optimization dashboard: Aggregate Lambda/Fargate/DynamoDB costs per UC with CloudWatch cost metrics

Decision trees and operational guides:

Decision trees: S3AP NetworkOrigin selection, FPolicy server deployment (Fargate vs EC2), guardrail mode transition (DRY_RUN → ENFORCE → BREAK_GLASS), monitoring placement (VPC-internal vs VPC-external)
NetApp Partner Delivery Checklist: ONTAP version, FPolicy mode, SVM/volume scope, protocol mix, S3AP NetworkOrigin, replay validation, runbook handover

Cost model awareness

While the cost dashboard is a Phase 13 deliverable, the following cost categories should inform design decisions now:

Category	Cost Type	Driver
FPolicy server (Fargate/EC2)	Fixed baseline	Always-on listener
NAT Gateway	Fixed + per-GB	Required if VPC Lambda needs Internet-origin S3AP access
CloudWatch Synthetics	Per-canary-run	5-minute interval = 8,640 runs/month
CloudWatch custom metrics + Logs	Per-metric + per-GB ingested	SLO metrics, FPolicy server logs
DynamoDB (lineage + guardrails)	Per-request (PAY_PER_REQUEST)	Event volume dependent
SQS / EventBridge	Per-message / per-event	Event volume dependent
Persistent Store volume	Per-GB provisioned	Sized for max queued events during downtime

Design decision for new deployments: S3 Access Point NetworkOrigin is immutable after creation. Choose VPC-origin if all consumers are VPC-internal (enables Gateway/Interface endpoint access without NAT). Choose Internet-origin if consumers include external accounts or on-premises clients. This decision affects Canary architecture, Lambda VPC configuration, and cost (NAT Gateway vs. VPC endpoint).

NetworkOrigin decision table

Based on AWS documentation, the following decision criteria apply:

Choose VPC-origin when:

All consumers are Lambda/ECS/EC2 inside the same VPC
Private connectivity is mandatory (no internet-routed path allowed)
VPC endpoint policy is part of the security boundary
Network restriction is built-in (cannot be accidentally misconfigured)

Choose Internet-origin when:

External accounts or on-premises clients need access
Consumers are outside the bound VPC
Internet-routed access with IAM controls is acceptable
Multi-VPC access is needed without Transit Gateway/peering to a single bound VPC

Factor	VPC-origin	Internet-origin
Network enforcement	Built-in explicit Deny for non-VPC traffic	Policy-based only
VPC endpoint required	Yes (Gateway or Interface in bound VPC)	Only if using `aws:SourceVpc` conditions
Multi-VPC access	Via Interface endpoint + peering/TGW to bound VPC	Via policy conditions
Change access scope	Must recreate access point	Update policy
On-premises access	Via Interface endpoint in bound VPC	Direct with IAM credentials
Cost implication	VPC endpoint (Gateway=free, Interface=hourly)	NAT Gateway if VPC Lambda needs access

Critical: This decision cannot be reversed. A PoC created with Internet-origin cannot be converted to VPC-origin for production — the access point must be deleted and recreated.

Phase 12 readiness by workload type

Workload	Phase 12 Ready?	Notes
Controlled PoC / single-account	✅ Ready	All core components verified
Low/moderate event volume (< 100 events/day)	✅ Ready	20-event burst validated
DRY_RUN guardrail validation	✅ Ready	Safe to deploy immediately
Secrets rotation validation	✅ Ready	4-step rotation verified
High-volume replay storm (1000+ events)	⏳ Phase 13	Throughput curve and store capacity not yet measured
Multi-account production	⏳ Phase 13	OAM link deployed but second-account validation pending
Strict SLO operations requiring runbooks	⏳ Phase 13	Dashboard deployed, runbooks not yet written
Live protobuf production mode	⏳ Phase 13	Wire validation with NetApp support pending
Full EVENT_DRIVEN UC end-to-end	⏳ Phase 13	Pipeline verified to SQS, Step Functions flow pending

Phase 13 runbook scope: first-response diagnostic bundle

For SLO violations and FPolicy disconnects, Phase 13 runbooks will include the following ONTAP-side diagnostic commands:

# FPolicy status
fpolicy show -vserver <SVM> -fields policy-name,status
fpolicy policy external-engine show -vserver <SVM>
fpolicy persistent-store show -vserver <SVM>

# Connection and event state
fpolicy show-engine -vserver <SVM>
fpolicy show-passthrough-read-connection -vserver <SVM>

# EMS logs for FPolicy events
event log show -messagename *fpolicy*

Combined with AWS-side diagnostics (CloudWatch Logs, SQS message count, alarm state), this forms the complete first-response bundle for support escalation.

Deployed Infrastructure

7 CloudFormation stacks deployed and verified:

Stack	Status	Purpose
`fsxn-phase12-guardrails-table`	CREATE_COMPLETE	DynamoDB tracking table
`fsxn-phase12-lineage-table`	CREATE_COMPLETE	Data lineage DynamoDB + GSI
`fsxn-phase12-slo-dashboard`	CREATE_COMPLETE	CloudWatch dashboard + 4 alarms
`fsxn-phase12-oam-link`	CREATE_COMPLETE	Cross-account observability stack (conditional resources — live second-account OAM validation remains Phase 13)
`fsxn-phase12-capacity-forecast`	CREATE_COMPLETE	Lambda + EventBridge schedule
`fsxn-phase12-secrets-rotation`	CREATE_COMPLETE	VPC Lambda + rotation config
`fsxn-phase12-synthetic-monitoring`	CREATE_COMPLETE	Canary + alarm; ONTAP path verified, S3AP split-path monitoring remains Phase 13

Test Results Summary

Category	Count	Type	Result
Unit Tests	116	Local (CI-reproducible)	✅ All pass
Property Tests (Hypothesis)	53	Local (CI-reproducible)	✅ All pass
CloudFormation Deployments	7 stacks	AWS integration	✅ All CREATE_COMPLETE
Lambda Invocations	2 (forecast + rotation)	AWS integration	✅ Successful
FPolicy E2E	1 pipeline test	AWS manual verification	✅ Event delivered
Replay E2E	5 events	AWS manual verification	✅ Zero loss
20-file burst	20 events	AWS manual verification	✅ Zero loss
Bugs found (property testing)	3	Local (CI-reproducible)	✅ All fixed

NetApp-Specific Takeaways

For NetApp users and partners evaluating this pattern:

FPolicy Persistent Store works as the durability layer for asynchronous non-mandatory FPolicy policies (NetApp docs), but replay behavior — including out-of-order delivery and throughput under load — must be validated under the customer's specific workload profile (file volume, protocol mix, event types).
S3 Access Points for FSx for ONTAP are not standard S3 buckets: they support selected S3 API operations including write operations (PutObject, DeleteObject, multipart uploads), but remain governed by ONTAP file-system permissions and have constraints (5 GB max upload, no presigned URLs, no Object Lock).
NetworkOrigin is a design-time decision. Choose VPC-origin or Internet-origin based on where the consumers run. This cannot be changed after creation and affects VPC endpoint requirements, Lambda placement, monitoring architecture, and cost.
ONTAP-common vs AWS-specific: FPolicy, Persistent Store, ONTAP REST API, and SVM/volume scoping are ONTAP-common patterns applicable to Cloud Volumes ONTAP and on-premises ONTAP. S3 Access Points, Secrets Manager rotation, SQS/EventBridge integration, and CloudWatch SLO dashboards are AWS-specific implementations.
Operational readiness requires more than event delivery: secrets rotation, SLOs, runbooks, lineage, and replay testing are all part of the production baseline. Phase 12 establishes this baseline; Phase 13 completes it with runbooks, storm testing, and protobuf wire validation.

The ONTAP portions of this pattern should be reviewed with the customer's NetApp operations team, especially FPolicy policy mode, Persistent Store capacity, SVM scope, protocol mix, and support escalation path.

Conclusion

Phase 12 transforms the FPolicy event-driven pipeline from "functionally complete" to "operationally hardened." The capacity guardrails provide three-mode safety control for auto-scaling operations. Secrets rotation eliminates manual credential management. The SLO dashboard gives operations teams objective health metrics. And the Persistent Store replay validation — with zero event loss in the tested 5-event replay and 20-event burst scenarios — increases confidence that the pipeline can tolerate Fargate task restarts, while larger replay-storm testing (1000+ events) remains Phase 13 work.

The property-based testing investment paid immediate dividends: 3 real bugs discovered in 53 tests that example-based testing missed. The S3 Access Point deep dive documented network-origin and endpoint configuration constraints that would otherwise surface as mysterious timeouts in production.

With 14,895 lines of code across 59 files, 7 deployed stacks, 169 total tests, and validated end-to-end event delivery, Phase 12 delivers the operational maturity required for enterprise production workloads on FSx for ONTAP.

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns
Previous phases: Phase 1 · Phase 7 · Phase 8 · Phase 9 · Phase 10 · Phase 11

All Agent Harnesses: The Live Comparison

Hector Flores — Sun, 17 May 2026 18:21:35 +0000

{/* LAST_UPDATED: 2026-07-03T12:00:00Z */}

🔴 LIVING ARTICLE — This page is continuously maintained and updated as platforms ship new features. Bookmark it. Come back often.

Last updated: July 3, 2026

Why This Page Exists

There are over a dozen platforms claiming to be the best way to build, run, and manage AI agents. Some are IDEs, some are cloud services, some are open-source libraries, and some are full autonomous coding environments. The terminology is a mess. Marketing pages all say "agent framework" but the products underneath are fundamentally different things.

I've been building multi-agent systems in production — 50+ agents running autonomously on cron schedules, managing everything from content pipelines to household logistics. That experience taught me something the comparison posts miss: the harness matters more than the model. The right control plane turns a chatbot into a production system. The wrong one turns your codebase into a liability.

This is my attempt to give you the definitive bird's-eye view. Every major agent harness, every feature set, head-to-head — with honest pros and cons for each. No ranking where my favorite conveniently wins. Just the facts, organized so you can make the right call for your situation.

What Is an Agent Harness?

Before comparing anything, we need to define what we're actually comparing. The industry uses "agent framework," "agent SDK," and "agent harness" interchangeably — but they're different things. Anthropic's engineering team nailed the distinction: the harness is the runtime container that wraps around an agent's execution.

{/* TAXONOMY_TABLE_START */}

Category	What It Does	Who Controls the Loop	Examples
Agent Harness	Runtime container — lifecycle, governance, tool access, policy enforcement	The platform	GitHub Copilot, Bedrock Agents, Vertex AI Agent Builder
Agent Framework	Programmable building blocks for composing agents in code	The developer	LangChain/LangGraph, CrewAI, AutoGen, Semantic Kernel
Agent SDK	Thin client library binding your code to a vendor's harness	The vendor's runtime	OpenAI Agents SDK, Google ADK
Agent Tool / Sandbox	Infrastructure component agents call into	N/A — it's a tool	E2B, Daytona, Modal
IDE Agent	AI assistant embedded in a code editor with agent capabilities	The IDE vendor	Cursor, Windsurf, JetBrains AI
Autonomous Agent	Fully self-directed agent with its own cloud environment	The agent itself	Devin

{/* TAXONOMY_TABLE_END */}

The key distinction: a harness owns the loop. It decides whether a tool call executes, enforces budgets, manages context, and provides observability. A framework gives you the building blocks to construct that loop yourself. An SDK connects you to someone else's loop. As Analytics Vidhya's taxonomy puts it: frameworks provide building blocks, runtimes execute workflows, harnesses enforce control.

Why does this matter? Because if you're evaluating "agent platforms" without understanding these categories, you'll compare LangChain (a library you embed) against Bedrock Agents (a managed service you configure) and wonder why the feature lists look nothing alike. They're solving different problems at different layers.

Head-to-Head Comparison Tables

Harnesses, IDE Agents & Autonomous Agents

{/* HARNESS_COMPARISON_TABLE_START */}

Feature	GitHub Copilot (Extensions + CLI)	OpenAI Agents SDK	Anthropic Claude Code	Amazon Bedrock Agents	Google Vertex AI Agent Builder	Cursor	Windsurf / Codeium	Devin	JetBrains AI
Tool Use	Extensions API + MCP + function calling	Function calling + hosted tools	MCP protocol + Bash/file tools	Action groups → Lambda/Step Functions	Fulfillments + Vertex Extensions	Built-in code/terminal tools	Code search + editing tools	Full dev environment tools	IDE-native tools
Memory	Copilot instructions + repo context + conversation	Thread-level + vector stores	Project indexing + conversation	Knowledge bases (OpenSearch/S3) + sessions	Vertex AI Search + flow state	Codebase index + session	Codebase index + session	Codebase index + persistent sessions	Project index + conversation
Multi-Agent	Multi-agent via CLI (task tool, background agents)	Handoffs between agents, swarm patterns	Sub-agents via tool use	Orchestration via Step Functions	Sub-agent routing via flows	Single agent (opaque backend)	Single agent	Parallel Devins	Single agent
Sandboxing	Docker containers, Codespaces	Developer-managed	Bash sandbox, permission prompts	Lambda/VPC isolation	Cloud Functions/Cloud Run	Local or remote containers	Local environment	Cloud VM per session	Local or remote
Governance	Pre/post tool hooks (hooks.json), extension allowlists, org policies	Guardrails API, content filters	Permission prompts, .claude files	IAM + CloudTrail + CloudWatch	IAM + Cloud Audit Logs	User approval prompts	User controls	Admin controls	Enterprise controls
Extensibility	Extensions + custom agents + skills	Plugin system + tool definitions	MCP servers (open protocol)	Lambda action groups	Webhooks + Extensions	Limited plugin API	Limited	API integrations	Plugin marketplace
IDE Integration	VS Code, Visual Studio, JetBrains, Xcode, CLI	None (API-first)	VS Code extension, terminal	None (API/console)	None (console/API)	Native (Cursor IDE)	Native (Windsurf IDE)	Cloud IDE (VSCode-based)	Native (JetBrains IDEs)
CLI Support	✅ Full CLI agent	❌	✅ Claude Code CLI	❌	❌	❌	❌	Slack/API	❌
Cloud vs Local	Both (local CLI + Codespaces + cloud agent)	Cloud (OpenAI servers)	Local-first + cloud	Cloud (AWS)	Cloud (GCP)	Local + remote	Local + remote	Cloud only	Local + remote
Pricing	Free tier → $10/mo → $39/mo → Enterprise	Pay-per-token + storage	Free (Claude Code) + API costs	Pay-per-token + AWS services	Pay-per-token + GCP services	Free → $20/mo → $40/mo → Enterprise	Free → $15/mo → $60/mo → Enterprise	$20/mo + $2.25/ACU → $500/mo teams	Bundled with JetBrains subscription
Open Source	Extensions spec open, CLI proprietary	SDK open source (MIT), runtime proprietary	CLI open source, MCP open protocol	Proprietary	Proprietary	Proprietary	Proprietary	Proprietary	Proprietary

{/* HARNESS_COMPARISON_TABLE_END */}

Agent Frameworks

{/* FRAMEWORK_COMPARISON_TABLE_START */}

Feature	LangChain / LangGraph	CrewAI	AutoGen (Microsoft)	Semantic Kernel (Microsoft)	Google ADK	Mastra
Tool Use	Decorators + schemas + any callable	Tool decorators with role binding	Function tools with type annotations	Skills/functions (semantic + native)	Tools with schema definitions	TypeScript-first tool definitions
Memory	Programmable (buffer, summary, vector, entity, graph)	Shared crew memory + agent memory	Conversation history + custom stores	Vector store connectors + key-value	Session state + Google Search grounding	Explicit read/write memory with observability
Multi-Agent	Graph-based (nodes = agents, edges = flow)	Crews with role-based orchestration	Conversational groups (critic, coder, planner)	Composable kernels (manual orchestration)	Multi-agent with `AgentTool` delegation	Multi-agent message flows
Sandboxing	Developer-managed (any environment)	Developer-managed	Developer-managed (Azure containers available)	Developer-managed (.NET/Java/Python hosted)	Developer-managed (GCP available)	Developer-managed
Governance	Callbacks, LangSmith tracing	Callbacks, logging hooks	Message inspection + Azure monitoring	Azure IAM/RBAC integration + callbacks	Google Cloud IAM + logging	Built-in observability, metrics, logs
Extensibility	Very high — model-agnostic, 700+ integrations	Moderate — growing ecosystem	High — Microsoft ecosystem integration	High — multi-language (C#, Java, Python, JS)	Moderate — Google ecosystem	High — TypeScript ecosystem
Deployment	Self-hosted (any infra) + LangSmith cloud	Self-hosted (Python apps)	Self-hosted + Azure integration	Self-hosted + Azure integration	Self-hosted + GCP integration	Self-hosted (Node.js)
Pricing	Free (OSS) + LangSmith SaaS optional	Free (OSS) + CrewAI Enterprise optional	Free (OSS)	Free (OSS)	Free (OSS)	Free (OSS)
License	MIT	MIT	MIT	MIT	Apache 2.0	MIT

{/* FRAMEWORK_COMPARISON_TABLE_END */}

Every Harness, In Depth

{/* HARNESS_SECTION: github-copilot */}

GitHub Copilot (Extensions + CLI + Cloud Agent)

GitHub Copilot isn't just autocomplete anymore — it's a full agent harness with extensions, hooks for governance, and a CLI that runs autonomous agents in your terminal. The extensions system lets third-party services register as tools, and the hooks.json governance layer gives organizations pre/post-tool interception that no other IDE agent offers.

The cloud coding agent can autonomously research a repository, create implementation plans, and submit pull requests — triggered directly from GitHub Issues. It runs in a secure cloud sandbox with full access to the repo context.

✅ Pros:

Deepest IDE integration — VS Code, Visual Studio, JetBrains, Xcode, Eclipse, and a standalone CLI
Extension system lets any service become an agent tool — unique in the IDE space
hooks.json governance — pre/post tool call interception for enterprise policy enforcement
CLI agent supports multi-agent patterns (background agents, task delegation, agent steering)
Enterprise trust — SSO, audit logs, content exclusions, org-level policy, IP indemnity
GitHub ecosystem integration — Actions, Issues, PRs, Codespaces, Security
MCP support for extensible tool discovery
Free tier available, competitive pricing at every tier

❌ Cons:

Extension ecosystem is growing but younger than VS Code's plugin marketplace
CLI agent requires local setup (though Codespaces solves this)
Multi-agent patterns in CLI are powerful but require context engineering knowledge
Cloud agent is newer and still maturing compared to the IDE and CLI experience

🎯 Best for: Teams already in the GitHub ecosystem who want IDE + CLI + cloud agent coverage with enterprise governance. If you need agents that integrate with your entire DevOps workflow — from issue to PR to deployment — nothing else touches the integration depth.

{/* HARNESS_SECTION_END: github-copilot */}

{/* HARNESS_SECTION: openai-agents-sdk */}

OpenAI Agents SDK

The OpenAI Agents SDK (which evolved from the Swarm research project) is a lightweight Python framework for building multi-agent workflows on OpenAI's infrastructure. It's MIT-licensed and surprisingly minimal — the core concept is agents with instructions, tools, and handoffs.

✅ Pros:

Extremely simple API — define agents, tools, and handoff rules in a few lines
Native access to OpenAI's latest models (GPT-4o, o3, etc.) with minimal latency
Built-in tracing and observability via the OpenAI dashboard
Guardrails API for input/output validation
Handoffs pattern makes multi-agent delegation intuitive
Active development with 26,000+ GitHub stars

❌ Cons:

Tightly coupled to OpenAI models — limited multi-provider support
No IDE integration — purely API/code-first
Sandboxing is your responsibility (no built-in execution isolation)
Enterprise governance is limited to OpenAI's platform controls
Relatively new — ecosystem is smaller than LangChain's

🎯 Best for: Teams building custom AI applications on OpenAI's platform who want a clean, minimal SDK without the overhead of heavier frameworks.

{/* HARNESS_SECTION_END: openai-agents-sdk */}

{/* HARNESS_SECTION: anthropic-claude-code */}

Anthropic Claude Code

Claude Code is Anthropic's agentic coding tool — a CLI-first agent that reads your codebase, runs commands, and edits files. It's powered by Claude and uses the Model Context Protocol (MCP) for extensible tool access. The CLI itself is open source.

✅ Pros:

CLI-first design — excellent for terminal-native developers
MCP protocol is open and vendor-neutral — any MCP server works as a tool
Strong project understanding via codebase indexing
.claude files for project-level instructions and rules
Sub-agent delegation via the Task tool for parallel work
Open source CLI with transparent tool execution
Scheduled tasks for automated maintenance

❌ Cons:

Anthropic-model-only — can't use GPT-4o or Gemini through it
No visual IDE (VS Code extension exists but it's CLI-in-editor)
API costs can escalate quickly with heavy agentic usage (long context windows)
Enterprise governance features are less mature than GitHub's or cloud providers'
Permission system relies on user approval prompts — no org-level policy hooks

🎯 Best for: Developers who live in the terminal and want a powerful, extensible coding agent with open protocols. MCP's vendor-neutral tool ecosystem is a genuine differentiator for teams building cross-platform integrations.

{/* HARNESS_SECTION_END: anthropic-claude-code */}

{/* HARNESS_SECTION: langchain-langgraph */}

LangChain / LangGraph

LangChain is the most widely adopted agent framework, with LangGraph adding stateful, graph-based orchestration for complex multi-agent workflows. Together they offer 700+ integrations covering every major model, vector store, and tool.

✅ Pros:

Largest ecosystem — 700+ integrations, massive community, extensive documentation
LangGraph's graph-based orchestration is genuinely powerful for complex workflows
Model-agnostic — swap between OpenAI, Anthropic, Google, open-source models freely
LangSmith provides production-grade tracing, evaluation, and monitoring
Checkpointed workflows for long-running agents with state persistence
Python and JavaScript SDKs

❌ Cons:

Steep learning curve — abstraction layers can feel over-engineered for simple use cases
No built-in sandboxing or execution isolation (BYO infrastructure)
No governance hooks at the platform level — you build your own policy layer
Frequent breaking changes between major versions
Enterprise adoption often requires significant custom engineering on top of the framework

🎯 Best for: Teams building custom multi-agent applications that need maximum flexibility and model portability. If you're willing to invest in infrastructure, LangGraph's graph-based orchestration is best-in-class for complex stateful workflows.

{/* HARNESS_SECTION_END: langchain-langgraph */}

{/* HARNESS_SECTION: crewai */}

CrewAI

CrewAI takes a role-based approach to multi-agent systems. You define "crews" of agents with specific roles, goals, and backstories, then orchestrate them through sequential or hierarchical task execution.

✅ Pros:

Intuitive role-based abstraction — easy to conceptualize multi-agent collaboration
Quick to prototype — get a working multi-agent system in minutes
Growing ecosystem with pre-built tools and templates
Good documentation and active community
CrewAI Enterprise adds deployment, monitoring, and team management

❌ Cons:

Less flexible than LangGraph for complex orchestration patterns
Smaller integration ecosystem than LangChain
Production hardening requires significant custom work
No built-in sandboxing, governance, or policy enforcement
Role/backstory abstraction can feel artificial for non-conversational use cases

🎯 Best for: Teams prototyping multi-agent systems who want an intuitive, role-based API. Great for research, content generation, and analysis workflows where agents play distinct specialist roles.

{/* HARNESS_SECTION_END: crewai */}

{/* HARNESS_SECTION: microsoft-autogen */}

Microsoft AutoGen

AutoGen is Microsoft's framework for building scalable multi-agent conversational applications. It excels at patterns where agents debate, critique, and collaborate through structured conversations.

✅ Pros:

Rich multi-agent conversation patterns — critic, coder, planner, executor roles
Deep Azure ecosystem integration (Azure OpenAI, Cognitive Search, Container Apps)
Strong research foundation (from Microsoft Research)
Code execution capabilities with Docker-based isolation
Active community and growing sample library

❌ Cons:

API has undergone significant redesigns (AutoGen 0.4 → AgentChat) — migration friction
Heavier abstraction than OpenAI Agents SDK for simple use cases
Primarily Python — limited multi-language support
Conversation-centric design doesn't fit all agent patterns
Enterprise governance still requires custom Azure integration work

🎯 Best for: Research teams and enterprises in the Microsoft ecosystem building multi-agent conversational systems — code review agents, planning committees, or collaborative debugging workflows.

{/* HARNESS_SECTION_END: microsoft-autogen */}

{/* HARNESS_SECTION: microsoft-semantic-kernel */}

Microsoft Semantic Kernel

Semantic Kernel is Microsoft's orchestration framework for building AI copilots and agents in enterprise applications. It bridges LLM capabilities with traditional application code through a plugin architecture.

✅ Pros:

Multi-language — C#, Java, Python, JavaScript support
Tight Azure and Microsoft 365 integration (RBAC, managed identities, Entra ID)
Plugin architecture makes it natural for enterprise "copilot" experiences
Strong typing and enterprise patterns (.NET-first design)
Good fit for building custom internal copilots on Microsoft stack

❌ Cons:

Multi-agent support is manual — less opinionated than AutoGen or CrewAI
Not designed primarily as an agent framework — more of an orchestrator
Smaller community than LangChain
.NET-first design can feel awkward in Python-dominant AI ecosystem
Less third-party model support compared to LangChain

🎯 Best for: Enterprise .NET/Java teams building internal copilots on Azure. If your stack is C# + Azure + Microsoft 365, Semantic Kernel is the natural choice for AI-augmented applications.

{/* HARNESS_SECTION_END: microsoft-semantic-kernel */}

{/* HARNESS_SECTION: amazon-bedrock-agents */}

Amazon Bedrock Agents

Amazon Bedrock Agents is AWS's fully managed agent harness. You configure agents declaratively — pick a model, define action groups (Lambda functions), attach knowledge bases (OpenSearch/S3), and Bedrock handles the runtime.

✅ Pros:

True managed harness — no loop code to write, configure and deploy
Strongest infrastructure isolation — Lambda/VPC/IAM per tool
Deep AWS service integration (S3, DynamoDB, Step Functions, CloudWatch)
Enterprise-grade governance — IAM, CloudTrail, service control policies, VPC endpoints
Knowledge bases with automated RAG patterns
Multi-model support (Claude, Llama, Titan, Mistral via Bedrock)

❌ Cons:

AWS lock-in — tools must be Lambda/AWS services
Declarative configuration limits flexibility for novel agent patterns
Multi-agent orchestration is indirect (via Step Functions, not native)
No IDE integration — API/console only
Cost can be opaque (token costs + Lambda + storage + data transfer)
Less community tooling compared to open-source frameworks

🎯 Best for: AWS-native enterprises that want a managed, governed agent runtime with minimal custom code. If your infrastructure is already on AWS and compliance requirements are strict, Bedrock Agents' built-in governance is a major advantage.

{/* HARNESS_SECTION_END: amazon-bedrock-agents */}

{/* HARNESS_SECTION: google-vertex-ai-adk */}

Google Vertex AI Agent Builder + ADK

Vertex AI Agent Builder is Google Cloud's managed harness, building on Dialogflow CX. The Agent Development Kit (ADK) is the open-source companion framework for building custom agents with multi-agent orchestration.

✅ Pros:

Managed harness with dialog management roots (Dialogflow CX) — great for conversational flows
ADK is open source (Apache 2.0) with multi-agent support via AgentTool
Google Search grounding for real-time information access
Vertex AI Search integration for enterprise RAG
GCP governance — IAM, VPC Service Controls, Cloud Audit Logs
Multi-model support via Vertex AI (Gemini, Claude, Llama, Mistral)

❌ Cons:

GCP lock-in for the managed harness
Agent Builder's dialog-management heritage can feel constraining for code-centric agents
ADK is newer and less battle-tested than LangChain/LangGraph
Multi-agent patterns in ADK are still maturing
Pricing complexity similar to AWS (token costs + GCP services)

🎯 Best for: GCP-native enterprises building conversational agents or teams wanting an open-source framework (ADK) with optional managed deployment. The Dialogflow heritage makes it strong for customer-facing chatbots.

{/* HARNESS_SECTION_END: google-vertex-ai-adk */}

{/* HARNESS_SECTION: cursor */}

Cursor

Cursor is an AI-native code editor (VS Code fork) with a built-in agent mode that can autonomously plan, write, and test code within your project.

✅ Pros:

Seamless agent-in-editor experience — no context switching
Strong codebase understanding via semantic indexing
Agent mode handles multi-step tasks (implement feature → write tests → debug)
Active development with rapid feature iteration
Growing user base and community
Competitive free tier

❌ Cons:

Proprietary — limited extensibility beyond what Cursor provides
No governance hooks for enterprise policy enforcement
Agent is a black box — limited observability into decisions
Multi-agent patterns not supported (single agent experience)
Fork dependency on VS Code means extension compatibility lags
No CLI agent capability

🎯 Best for: Individual developers who want the smoothest AI-in-editor experience and are comfortable with a curated, opinionated tool. Less suitable for enterprises needing governance and policy control.

{/* HARNESS_SECTION_END: cursor */}

{/* HARNESS_SECTION: windsurf-codeium */}

Windsurf / Codeium

Windsurf is Codeium's AI-native IDE with agent capabilities including "Cascade" — a multi-step agentic flow that can understand context across your entire codebase.

✅ Pros:

Strong codebase-wide context understanding
Cascade flow feature for multi-step agentic work
Competitive pricing with a generous free tier
Fast completions with low latency
Enterprise deployment options (on-prem inference, data locality)

❌ Cons:

Smaller ecosystem and community than Cursor or VS Code + Copilot
Limited extensibility — agent capabilities are vendor-controlled
No governance hooks or enterprise policy framework
Acquisition by OpenAI (announced 2025) creates strategic uncertainty
Multi-agent is not user-configurable
No CLI support

🎯 Best for: Developers wanting a fast, capable AI IDE with good codebase understanding at a competitive price point. The on-prem inference option matters for teams with strict data locality requirements.

{/* HARNESS_SECTION_END: windsurf-codeium */}

{/* HARNESS_SECTION: devin */}

Devin

Devin by Cognition is a fully autonomous AI software engineer that operates in its own cloud environment. It can plan, code, debug, and deploy with minimal human intervention.

✅ Pros:

Most autonomous agent — handles end-to-end tasks from plan to PR
Own cloud environment with full dev tools (browser, terminal, IDE)
Parallel Devins for concurrent work on multiple tasks
Interactive planning for collaborative task scoping
Devin Search and Wiki for codebase exploration and documentation
Slack integration for conversational task delegation

❌ Cons:

Expensive — $20/mo entry then $2.25 per ACU ($500/mo for teams)
Reliability concerns — independent evaluations found low task completion rates
Fully proprietary with no extensibility beyond provided integrations
Cloud-only — can't run locally or air-gapped
Opaque internals — limited observability into agent decisions
No governance framework for enterprise policy enforcement

🎯 Best for: Teams with well-scoped, repetitive tasks that benefit from full autonomy (migrations, boilerplate generation, documentation). Use with supervision — it's powerful but not yet reliable enough for unsupervised production work on complex codebases.

{/* HARNESS_SECTION_END: devin */}

{/* HARNESS_SECTION: jetbrains-ai */}

JetBrains AI Assistant

JetBrains AI is integrated into IntelliJ, PyCharm, WebStorm, and the full JetBrains IDE family, with an agent mode called Junie for autonomous multi-step coding tasks.

✅ Pros:

Native integration in the full JetBrains IDE family
Junie agent mode for autonomous multi-step tasks
Leverages JetBrains' deep code analysis (inspections, refactoring, type inference)
On-prem inference options for sensitive environments
Multi-model support (OpenAI, Anthropic, Google, local models)
Bundled with JetBrains All Products Pack

❌ Cons:

JetBrains IDEs only — no VS Code, no CLI
Agent capabilities are newer and less mature than Cursor or Copilot
Limited extensibility for custom agent behaviors
No governance/hooks framework comparable to Copilot's hooks.json
Smaller AI-focused community compared to VS Code ecosystem

🎯 Best for: JetBrains users who don't want to switch editors but want AI agent capabilities. The deep IDE integration (inspections, refactoring) gives it advantages in languages where JetBrains excels (Java, Kotlin, Python).

{/* HARNESS_SECTION_END: jetbrains-ai */}

{/* HARNESS_SECTION: mastra */}

Mastra

Mastra is a TypeScript-first agent framework focused on observability and developer experience. It's designed for building multi-agent systems in Node.js applications with built-in visibility into agent behavior.

✅ Pros:

TypeScript-native — first-class experience for Node.js/Next.js teams
Built-in observability (metrics, logs, visualization of agent flows)
Explicit memory model — developers see how and when memory is read/written
Multi-agent message flows with clear debugging
Growing ecosystem with modern developer ergonomics

❌ Cons:

TypeScript/Node.js only — no Python, C#, or Java support
Newer and smaller community than LangChain or CrewAI
No built-in sandboxing or governance
Less battle-tested in production than established frameworks
Limited model provider integrations compared to LangChain

🎯 Best for: TypeScript teams building multi-agent applications who prioritize observability and debuggability. If your stack is Next.js/Node.js and you want to see exactly what your agents are doing, Mastra's visibility is a differentiator.

{/* HARNESS_SECTION_END: mastra */}

The Governance Gap

{/* GOVERNANCE_SECTION_START */}

Here's what surprised me most when building this comparison: most agent platforms have no governance story at all. Cursor, Windsurf, CrewAI, Devin — they all have "user clicks approve" and that's it. There's no programmatic policy layer, no pre-tool-call interception, no audit trail that an enterprise compliance team would accept.

Only three platforms offer real governance primitives:

GitHub Copilot — hooks.json with pre/post tool call interception + extension allowlists + org-level policies
Amazon Bedrock Agents — IAM + CloudTrail + service control policies + VPC endpoints
Google Vertex AI Agent Builder — IAM + Cloud Audit Logs + VPC Service Controls

The frameworks (LangChain, AutoGen, etc.) give you hooks to build governance, but you're writing that layer yourself. That's fine for startups but a non-starter for regulated enterprises. If governance is a requirement — and in 2026, it should be — your shortlist gets very short very fast.

I wrote about this gap in depth in my three layers your AI agent is missing article, and built @htekdev/agent-harness specifically to address it.

{/* GOVERNANCE_SECTION_END */}

How to Choose

{/* DECISION_FRAMEWORK_START */}

Don't start with "which platform is best?" Start with "what am I building?"

If you're building...	Start here	Why
A custom AI application (chatbot, RAG app, copilot)	LangChain/LangGraph or Semantic Kernel	Maximum flexibility and model portability
AI coding assistance in your editor	GitHub Copilot	Broadest IDE + CLI + cloud coverage with governance
A quick AI coding setup, single-editor focus	Cursor	Most polished single-editor experience
Managed, governed agents on AWS	Amazon Bedrock Agents	Enterprise governance out of the box
Managed, governed agents on GCP	Vertex AI Agent Builder	Enterprise governance out of the box
A CLI-first agentic coding workflow	Copilot CLI or Claude Code	Extensions/hooks vs MCP extensibility
Multi-agent prototypes with roles	CrewAI	Fastest time-to-prototype for role-based systems
Multi-agent conversational systems	AutoGen	Rich debate/critique/collaborate patterns
Multi-agent graph-based orchestration	LangGraph	Best-in-class for stateful graph workflows
Full autonomous task delegation	Devin	Highest autonomy level (with supervision)
Internal copilots on Microsoft stack	Semantic Kernel	Native .NET/Azure/M365 integration
TypeScript-first agent apps	Mastra	Best observability for Node.js agents
Minimal multi-agent SDK	OpenAI Agents SDK	Cleanest API with handoff pattern

{/* DECISION_FRAMEWORK_END */}

Where Copilot Stands — Honest Assessment

{/* COPILOT_ASSESSMENT_START */}

I use Copilot every day — it runs 50+ agents managing my home, my content pipeline, and my development workflow. So let me be direct about where it leads and where it doesn't.

Where Copilot genuinely leads:

Ecosystem breadth — the only platform spanning IDE (all major editors), CLI, cloud agent, and API. Nobody else covers all four surfaces.
Governance — hooks.json is unique. No other IDE agent gives you programmatic pre/post tool-call interception. For enterprises, this is a dealbreaker in Copilot's favor.
Extensions — the ability to turn any service into an agent tool via the extensions API is unique among IDE agents. Cursor and Windsurf are closed ecosystems.
Enterprise trust — IP indemnity, content exclusions, SSO, audit logs, org-level policy. GitHub spent years earning enterprise trust, and it shows.
GitHub integration — Issues → cloud agent → PR → Actions → deploy. The full software lifecycle, automated.

Where others have edges:

Claude Code's MCP protocol is more open and portable than Copilot's extensions API. MCP works across vendors; Copilot extensions are GitHub-specific.
Cursor's in-editor UX is more polished for pure coding tasks. The diff/apply flow feels snappier.
LangGraph's orchestration is more flexible than Copilot CLI's multi-agent patterns for complex stateful workflows.
Bedrock and Vertex offer stronger cloud-native governance for non-GitHub-centric enterprises.
Devin's autonomy level exceeds what any IDE agent currently attempts.

This isn't a contest where one tool wins everything. It's a landscape where your constraints determine the right choice.

{/* COPILOT_ASSESSMENT_END */}

The Bottom Line

{/* BOTTOM_LINE_START */}

The agent harness landscape in 2026 is where container orchestration was in 2016 — fragmented, fast-moving, and converging toward patterns that aren't fully standardized yet. The CNCF's four pillars of platform control (golden paths, guardrails, safety nets, manual review) are emerging as the design principles every harness will eventually implement.

My bet: by 2027, the distinction between "agent harness" and "agent framework" will dissolve. Frameworks will grow governance layers. Harnesses will expose programmable hooks. MCP or something like it will become the standard tool protocol. And the platforms that survive will be the ones that nailed the balance between developer autonomy and organizational control.

Until then, choose based on what you actually need today. Use the comparison tables. Read the pros and cons. And remember: the best agent harness is the one your team can actually govern in production.

{/* BOTTOM_LINE_END */}

Resources

{/* RESOURCES_START */}

{/* RESOURCES_END */}

Why your .NET 8 API needs a cache layer — and how to build it right with Redis/Valkey and tag invalidation

fenixkit — Sun, 17 May 2026 18:18:44 +0000

Caching is one of those things that sounds optional until your database starts getting hammered at scale, your response times creep up, and you realise you've been querying the same data hundreds of times per minute. This article covers why a cache layer matters, how to implement cache-aside properly with tag-based invalidation in .NET 8, how to handle Redis outages gracefully, and why Valkey is worth knowing about.

Why bother with cache at all?

The short answer: your database doesn't need to answer the same question twice.

A typical read-heavy API hits the database for the same product list, the same user profile, the same category results — on every request. Each one is a network round trip, a query execution, and serialisation overhead. At low traffic it's fine. At scale it isn't.

A cache layer puts the answer in Redis the first time, and returns it directly on every subsequent request — milliseconds, no database involved.

The reasons people avoid it:

"It adds complexity" — only if you build it badly
"Cache invalidation is hard" — it is, but it doesn't have to be unpredictable
"Redis going down takes my API down" — only if you don't handle it properly

All three are solvable.

The cache-aside pattern

Cache-aside is the simplest correct approach:

On read — check Redis first. Hit → return. Miss → query the database, populate Redis, return.
On write — invalidate the relevant cache entries, then write to the database.

GET /api/products/abc123

  1. Check Redis  ──▶  HIT  ──▶  return cached JSON ✓
               └──▶  MISS ──▶  query database
                              └──▶  populate Redis ──▶  return ✓

PUT /api/products/abc123

  → invalidate cache entries for this product
  → write to database

Simple in theory. The problem is step 2 — which cache entries do you invalidate?

The invalidation problem

If you cache by key only (product:abc123), that's easy — delete that key on update. But most APIs cache more than that:

Paged lists — product:paged:p1:s20
Cursor pages — product:cursor:start:20:fwd
Filtered results — product:category:Gaming

When you update a product, all of those might be stale. You can't just delete one key.

The naive solution is to expire everything with a short TTL. It works, but it means serving stale data for up to N minutes after every write, and it doesn't scale — at high write rates your cache is constantly cold.

Tag-based invalidation

A better approach: every cached entry is registered under one or more tags. When you write, you invalidate by tag — wiping all entries associated with that tag at once.

In Redis, a tag is a Set that holds the keys registered under it:

product:abc123              STRING   cached product JSON          TTL 5 min
product:paged:p1:s20        STRING   cached page JSON             TTL 5 min
product:category:Gaming     STRING   cached category list         TTL 5 min

tag:product                 SET      { paged keys, cursor keys }    no TTL
tag:product:abc123          SET      { "product:abc123" }            no TTL
tag:product:category:Gaming SET      { "product:category:..." }      no TTL

Tag sets have no TTL — they are deleted when InvalidateByTagAsync runs, leaving no orphaned entries.

On every write, the repository wipes all matching tags.

The update case is worth calling out: when a product moves from Electronics to Gaming, you need to invalidate both the old and new category cache. The solution is to union the tags from the original and the updated entity before invalidating — both category caches get wiped, no extra logic needed in your handler.

Three levels of control

Not everything needs automatic invalidation. A well-designed cache layer gives you three levels:

Level	Mechanism	Use for
Automatic	Base repository calls `GetInvalidationTags` on every write	Standard CRUD — always on
Tag-based	`_cache.InvalidateByTagAsync("product:category:Gaming")`	Custom domain queries
Manual	`_cache.InvalidateAsync("product:abc123")`	Surgical single-key removal

You pick the right level per operation. Most of the time the automatic level handles everything.

Handling Redis outages — FailOpen vs FailClosed

This is where most cache implementations go wrong. If Redis throws an exception and you let it propagate, your API returns 500s whenever the cache is unavailable — even though your data is perfectly fine in the database.

FailOpen (recommended default): treat any Redis error as a cache miss. The request falls through to the database, succeeds, and returns normally. Redis being down is a performance degradation, not an outage.

FailClosed: return an error when Redis is unavailable. Use this only when cache correctness is a hard requirement.

For most APIs, FailOpen is the right default. Redis is a performance layer, not a source of truth.

Making cache optional

There are scenarios where you want to run without Redis entirely — local development or environments where you haven't provisioned a cache server yet.

The clean solution is a no-op implementation of your cache interface that can be swapped in via config:

// appsettings.json / .env
Cache__Enabled=false

When disabled: the cache interface resolves to a no-op, IConnectionMultiplexer is never registered, and the Redis health check is omitted automatically. No code changes required anywhere else.

Valkey — the Redis fork worth knowing about

In 2024, Redis changed its licence from BSD, no longer open-source. In response, the Linux Foundation forked Redis at version 7.2 and created Valkey — an open-source, community-maintained drop-in replacement.

Valkey is wire-protocol compatible with Redis. StackExchange.Redis connects to it transparently — no client changes, no code changes needed.

# docker-compose.valkey.yml
valkey:
  image: valkey/valkey:7.2-alpine
  command: valkey-server --requirepass ${CACHE_PASSWORD}
  ports:
    - "6379:6379"

valkey:6379,password=yourpassword,protocol=2

If you're happy with Redis 8, nothing changes. If you prefer a fully open-source stack, Valkey 7.2 is a transparent swap.

Putting it together

The full pattern in a .NET 8 Minimal API:

Read — check Redis, miss falls through to the database, result populates Redis on return
Write — union tags from old + new entity, invalidate, write to database
FailOpen by default — Redis errors never surface as 500s
Optional — disable via config, no-op swaps in automatically

If you'd rather not wire all of this from scratch, I've packaged the full implementation into FenixKit — .NET 8 Minimal API starter kits with the cache layer, tag invalidation, FailOpen, Valkey support, and health checks all included and pre-configured.

Automate your Hugo CV deployment with GitHub Actions

Ulrich VACHON — Sun, 17 May 2026 18:18:07 +0000

In this article we will see how to automate the build and deployment of a Hugo-based CV site hosted on GitHub Pages. No more running Hugo by hand, just git push and you're done.

The purpose of this article is not to introduce Hugo or GitHub Pages from scratch, but instead to explain how to wire them together with GitHub Actions to get a clean, automated deployment pipeline for a developer CV site.

💡 My CV site is live at reservoircode.net so feel free to use it as a reference !

👍 You can take a look to the project by following this link github.com/ulrich/ulrich.github.io

The context

My CV is a static site generated with Hugo and hosted on GitHub Pages. The theme is hugo-devresume-theme added as a git submodule. The whole content is controlled by a single config.toml file for the experiences, skills, languages, everything...

Before setting up the automation, my workflow was manual:

hugo
git add .
git commit -m "Bla bla bla"
git push origin master

Not great. Let's fix that 😃

Branch strategy

The key idea is to separate sources from the generated output:

Branch	Role
`src`	Hugo sources contains `config.toml`, theme submodule, static assets...
`master`	Generated HTML served by GitHub Pages

Working on src, pushing triggers the build, master gets updated automatically.

The GitHub Actions workflow

Create the file .github/workflows/deploy.yml on your src branch:

name: Deploy Hugo site

on:
  push:
    branches:
      - src

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          submodules: true

      - name: Setup Hugo
        uses: peaceiris/actions-hugo@v3
        with:
          hugo-version: '0.81.0'
          extended: true

      - name: Build
        run: hugo --source src

      - name: Deploy
        uses: peaceiris/actions-gh-pages@v4
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          publish_dir: ./src/public
          publish_branch: master
          cname: reservoircode.net

A few things worth noting here:

submodules: true. The theme is a git submodule. Without this flag, the clone would be incomplete and the build would fail silently.

extended: true. This is critical. The theme uses SCSS with Hugo template variables injected at build time (like primaryColor). Without the extended version of Hugo, the SCSS is not compiled and your custom colors are simply ignored and the theme falls back to its hardcoded defaults.

cname. If you use a custom domain, this line regenerates the CNAME file on every deploy. Without it, the file gets wiped on each push and your domain stops resolving.

github_token. Automatically provided by GitHub, no manual secret setup needed.

Some improvements

Setting a custom font color with SCSS

After the first successful deploy, my custom blue color (#53abe7) was replaced by the theme's default green (#54B689). The root cause: Hugo standard cannot process SCSS. The theme's stylesheet contains Hugo template directives like:

$theme-color-primary: {{ .Site.Params.primaryColor | default "#54B689" }};

Without Hugo Extended, this variable is never injected and the default value is used. Adding extended: true to the workflow fixed it.

Setting avatar image

The assets/ folder in Hugo is processed through a pipeline and its path gets concatenated with the base path. The fix is to place static files under static/ instead and Hugo copies its content as-is to the root of the generated site.

mkdir -p src/static/assets/images
cp my-photo.png src/static/assets/images/avatar.png

Be careful of baseURL

The site is served from reservoircode.net. Hugo uses baseURL to build all absolute paths for images, CSS, JS. Updating it fixed the remaining broken assets:

baseURL = "https://reservoircode.net/"

Customizing the theme without touching it

The theme's layout files live in src/themes/devresume/layouts/partials/. If you modify them directly, your changes get wiped next time you update the submodule.

Hugo has a clean override mechanism: any file placed under src/layouts/partials/ takes priority over the theme's version. So to customize experience.html:

mkdir -p src/layouts/partials
cp src/themes/devresume/layouts/partials/experience.html src/layouts/partials/experience.html

Then edit src/layouts/partials/experience.html freely. Your version will always win.

I used this to add a stack field to each experience entry. In config.toml:

[[params.experience.list]]
title = "Lead Developer / Senior Software Engineer"
dates = "02/2025 – Present"
company = "Rout'in · Reservoir Code · Hybrid"
stack = "Java 25, Spring Boot 3, React, AWS, Terraform, EKS"
details = """
Tech Lead for a team of 3 to 4 developers on the **Mobility Pass** platform...
"""

And in the overridden partial:

<div class="item-content">
    <p>{{ with .details }}{{ . | markdownify }}{{ end }}</p>
    {{ with .stack }}
    <p><strong>Stack :</strong> <span class="text-muted">{{ . }}</span></p>
    {{ end }}
</div>

Markdown in config.toml

Since the theme uses | markdownify in its templates, you can write Markdown directly in your config.toml strings. Use triple quotes for multiline content:

details = """
Led integration with a **major French payment service provider**.

Ran bi-weekly coordination meetings with OPS teams.
"""

⚠️ Watch out for indentation. In Markdown, 4 leading spaces mean a code block. Keep your content flush left inside the """ block.

Updating the theme submodule

The theme is pinned to a specific commit. To pull the latest version:

cd src/themes/devresume
git checkout master
git pull origin master
cd ../../..
git add src/themes/devresume
git commit -m "Update theme"
git push origin src

The git add src/themes/devresume step updates the commit pointer stored in your repo. Without it, the submodule stays pinned to the old version.

Conclusion

The setup is now clean: edit config.toml on src, push, done. GitHub Actions handles the Hugo build and deploys the result to master, which GitHub Pages serves on the custom domain for CNAME included.

The main lesson from this experience: Hugo Extended is not optional when your theme compiles SCSS at build time. And the branch separation between sources and output is the right model for GitHub Pages, even if it requires a small upfront setup.

Have a good day ☀️

Tags: hugo github devops webdev

Designing Reliable Permission Models with Lean 4

Shrijith Venkatramana — Sun, 17 May 2026 18:15:22 +0000

Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.

Most authorization systems begin simple.

Then reality happens.

Over time:

more roles get added,
exceptions accumulate,
workflows become stateful,
permissions become inherited,
AI assistants start generating handlers and refactors,
and eventually nobody is fully certain what combinations are actually possible anymore.

This is where many discussions around “AI-generated code safety” become unsatisfying.

People often talk about:

better prompts,
more tests,
stronger reviews,
static analysis,
or safer languages.

Those help.

But there is another direction worth exploring:

What if some critical invariants were not merely tested, but mathematically enforced?

Not:

“the code probably works,”
or “the tests passed,”

but:

“certain invalid states are mechanically impossible.”

That is the interesting promise behind Lean.

And permission systems are one of the best places to start because:

humans understand them intuitively,
they are security-critical,
and they become surprisingly difficult to reason about once complexity grows.

This tutorial walks through:

installing Lean 4,
understanding the core mathematical ideas,
building a permission model,
proving security invariants,
intentionally breaking them,
and seeing how Lean prevents unsafe changes.

The goal is not academic theorem proving.

The goal is:

designing systems where important security assumptions become hard to accidentally violate.

1. Installing Lean 4

Lean 4 is unusual because it is simultaneously:

a programming language,
a compiler,
and a theorem prover.

Install it using elan.

Linux/macOS

curl https://raw.githubusercontent.com/leanprover/elan/master/elan-init.sh -sSf | sh

Verify installation:

lean --version
lake --version

2. Install the VSCode Extension

Install:

“Lean 4”

from the VSCode marketplace.

This gives:

live proof checking,
inline errors,
theorem goals,
and interactive feedback.

This interactivity matters a lot.

Lean is less like:

writing static code,

and more like:

continuously negotiating with a mathematical verifier.

3. Create a Lean Project

Create a project with Mathlib support:

lake new VerifiedPermissions math
cd VerifiedPermissions
code .

Open:

VerifiedPermissions/Basic.lean

This file will contain both:

executable programs,
and mathematical proofs about those programs.

That duality is the central idea behind Lean.

4. First Lean Program

Replace the file contents with:

def greet (name : String) : String :=
  s!"Hello, {name}"

#eval greet "world"

Let’s unpack this carefully.

`def`

def greet

def means:

define a function or value.

This is ordinary programming.

`(name : String)`

(name : String)

This means:

the function accepts a parameter called name,
whose type is String.

Lean is statically typed.

But unlike many languages:

types in Lean are deeply connected to logic itself.

That becomes important later.

`: String`

: String

This declares:

the function returns a string.

So mathematically:

greet : String → String

Meaning:

greet maps one string into another string.

Functions in Lean are treated very mathematically.

`:=`

:=

Means:

is defined as.

`#eval`

#eval greet "world"

Actually runs the program.

This is important because Lean is not just:

a proof notation system,
or symbolic logic language.

It is executable.

5. A Small Verified Function

Now replace the file with:

def increment (x : Nat) : Nat :=
  x + 1

theorem increment_is_larger (x : Nat) :
  increment x > x := by
  exact Nat.lt_succ_self x

This is where things become interesting.

You are no longer just writing code.

You are writing:

code,
and mathematical claims about the code.

6. Understanding the Mathematics Line by Line

`Nat`

Nat

Means:

natural numbers.

So:

0, 1, 2, 3…

Lean treats mathematics as native objects.

`increment`

def increment (x : Nat) : Nat :=
  x + 1

This is an executable function.

Nothing unusual yet.

`theorem`

theorem increment_is_larger

This changes everything conceptually.

You are no longer saying:

“I hope this property holds.”

You are saying:

“This property must be proven.”

And Lean will refuse to continue unless the proof is valid.

`(x : Nat)`

The theorem applies universally.

Meaning:

For every natural number x

not:

“for tested examples,”
not “for likely inputs,”
but literally all possible values.

This is one of the biggest conceptual differences from testing.

Tests are existential:

These cases worked.

Proofs are universal:

All valid inputs satisfy this property.

`increment x > x`

increment x > x

This is the claim being proven.

Meaning:

increment always returns a larger number.

`:= by`

:= by

This begins a proof block.

You are now constructing evidence that the statement is true.

`exact`

exact Nat.lt_succ_self x

This says:

use an existing theorem directly.

Nat.lt_succ_self is a theorem already known to Lean:

x < x + 1

So Lean verifies:

your theorem,
by reducing it to already-proven mathematics.

7. Breaking the Proof Intentionally

Now change:

increment x > x

to:

increment x < x

You now claim:

increment makes numbers smaller.

Lean immediately rejects this.

This is the first important moment.

The theorem is not:

documentation,
comments,
or developer intent.

It is mechanically enforced logic.

8. Building a Permission Model

Now we move toward authorization systems.

Replace the file with:

inductive Role
| Guest
| User
| Admin

9. Understanding `inductive`

This line introduces a very important mathematical idea.

inductive Role

This defines a finite set of possible values.

Mathematically:

Role ∈ {Guest, User, Admin}

This is powerful because:

impossible states cannot exist,
invalid roles cannot appear accidentally,
and all cases must be handled explicitly.

This already improves reliability substantially.

10. Defining Permissions

Now add:

def canDelete : Role → Bool
| Role.Guest => false
| Role.User => false
| Role.Admin => true

This means:

canDelete maps a Role into a boolean

or mathematically:

Role → Bool

Meaning:

every role deterministically maps to a permission decision.

11. Why This Is Safer Than It Looks

Notice something subtle.

Lean forces all role cases to be handled.

If you later add:

| Moderator

Lean immediately complains that:

canDelete is incomplete.

This is extremely valuable operationally.

In many production systems:

new authorization states get introduced,
old logic silently becomes incomplete,
edge cases appear months later.

Lean forces exhaustive handling.

That alone prevents many categories of policy drift.

12. Adding Security Invariants

Now add:

theorem guests_cannot_delete :
  canDelete Role.Guest = false := by
  rfl

theorem users_cannot_delete :
  canDelete Role.User = false := by
  rfl

13. Understanding `rfl`

rfl

means:

this is true by direct reduction.

Lean computes:

canDelete Role.User
→ false

So the theorem becomes:

false = false

which is trivially true.

14. Introducing a Security Bug

Now simulate a future refactor.

Change:

| Role.User => false

to:

| Role.User => true

Immediately:

users_cannot_delete

fails.

This is where the practical value starts appearing.

The proof acts like:

a permanently active security assertion.

Not:

documentation,
not review guidelines,
not tribal knowledge.

An enforced invariant.

15. Why This Matters More with AI-Generated Code

The interesting part is not tiny examples like this.

The interesting part is what happens later when:

AI assistants generate handlers,
rewrite permission logic,
refactor workflows,
or modify state transitions.

The problem is no longer:

“Will the code compile?”

The problem becomes:

“Did the generated system preserve critical invariants?”

Formal models become interesting because:

implementations can change repeatedly,
while the invariants remain fixed and machine-checked.

16. What Lean Is Actually Buying

Lean does not magically create bug-free software.

What it can realistically provide is:

machine-checked invariants,
exhaustive handling of states,
prevention of silent policy drift,
stronger guarantees around transitions,
and continuous enforcement of critical assumptions.

That is a narrower claim than:

“formally verified applications.”

But it is also much more practical.

And for authorization-heavy systems, even small mechanically enforced guarantees can become surprisingly valuable over time.

*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

HexmosTech / git-lrc

Free, Micro AI Code Reviews That Run on Commit

git-lrc

Free, Micro AI Code Reviews That Run on Commit

AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.

See It In Action

See git-lrc catch serious security issues such as leaked credentials, expensive cloud operations, and sensitive material in log statements

git-lrc-intro-60s.mp4

Why

🤖 AI agents silently break things. Code removed. Logic changed. Edge cases gone. You won't notice until production.
🔍 Catch it before it ships. AI-powered inline comments show you exactly what changed and what looks wrong.
🔁 Build a…

View on GitHub

I Built an AI Pair Programmer for VS Code Because Copilot Felt Too Expensive for Many Developers

Aakash — Sun, 17 May 2026 18:14:40 +0000

I Built an AI Pair Programmer for VS Code Because Copilot Felt Too Expensive for Many Developers

Like many developers, I started using AI coding assistants daily.

They genuinely improve productivity:

autocomplete
debugging
refactoring
explaining complex code
generating boilerplate

But I kept running into the same problems:

❌ Expensive subscriptions
❌ Heavy IDE experiences
❌ Complicated onboarding
❌ Too many features I never used

So I decided to build something simpler.

Introducing DevMind AI

DevMind is an AI pair programmer built specifically for VS Code.

The goal was simple:

Make AI coding assistance fast, lightweight, and affordable for developers.

Features

⚡ Instant Autocomplete

Low-latency inline completions directly inside VS Code.

💬 AI Chat Inside Editor

Ask questions without leaving your coding flow.

🔧 Explain, Fix & Refactor

Select code and instantly:

explain it

fix bugs

refactor functions

🔐 Gmail OTP Sign-in

No passwords.
No complicated OAuth flow.
Just verify your Gmail and start coding.

📊 Live Usage Tracking

Transparent request limits directly inside the editor.

Why I Built It

One thing I noticed:
Many students and developers — especially in India — wanted AI coding tools but found current pricing difficult.

So I wanted DevMind to be:

accessible
developer-friendly
easy to start

That’s why pricing starts at:

₹199/month for Solo
Free tier included forever

Tech Stack

Built using:

VS Code Extension API
AI model integrations
Custom backend APIs
Real-time autocomplete pipeline

* Gmail OTP authentication

Current Features

✅ Autocomplete
✅ AI Chat
✅ Explain Code
✅ Bug Fixing
✅ Refactoring
✅ Multi-language support

Languages supported:

TypeScript
JavaScript
Python
Go
Java
Rust
C++ and more. --- # What’s Next Currently working on:
better context awareness
faster completions
team collaboration
smarter refactors
project memory

Still early.
Still improving every week.

Looking For Feedback

Would genuinely love feedback from developers here:

onboarding
UI/UX
autocomplete quality
pricing
feature ideas

Try DevMind here:
👉 https://devmind.singhjitech.com

Built by SinghJiTech from India 🇮🇳

How Autonomous AI Agents Are Reshaping Developer Workflows in 2026

Smart picks — Sun, 17 May 2026 18:12:52 +0000

Most developers spent 2023 and 2024 experimenting with AI-assisted code completion and chat interfaces. Those tools were useful—but they were also passive. You typed a prompt. The model responded. You did the rest.
That model is being replaced by something fundamentally different. Autonomous AI agents don't wait for a prompt. They receive a goal, break it into subtasks, call the tools they need, track their own progress, and iterate until the job is done—or until they need human input. This shift from reactive generation to goal-driven execution is what separates agentic AI from everything that came before it.
If you're building software in 2026, understanding this shift isn't optional. Agentic workflows are already running in production across engineering teams, DevOps pipelines, customer operations, and research functions. Here's what you actually need to know.

Agentic AI vs. Generative AI: The Real Difference

Standard generative AI operates on a simple exchange: input goes in, output comes out. A chatbot summarizes a Slack thread. A code model suggests a function. The interaction is stateless and bounded. You stay in the loop for every decision.
Agentic AI breaks that pattern. An autonomous AI agent operates in a continuous loop: it perceives the state of a task, reasons about what action to take next, executes that action through tools or APIs, observes the result, and updates its plan. This cycle repeats sometimes dozens of times—until the agent reaches its goal or surfaces a blocker it can't resolve alone.
The practical difference is significant. A generative model helps you write a unit test. An agentic system can read your failing CI run, identify the root cause across multiple files, write the fix, run the test locally via a code execution tool, and open a PR—without you touching the keyboard.

The Core Building Blocks of Agentic AI Systems

If you're planning to build or integrate agentic workflows, you need to understand the components that make them work. These aren't optional abstractions—they're the load-bearing structure of any production-grade agent.

Planning and Reasoning Loops

An agent needs to decompose a goal into ordered steps. Modern agents often use patterns like ReAct (Reason + Act), where the model alternates between reasoning about the next step and actually executing it. More complex systems use multi-step planners or tree-of-thought approaches to handle tasks with branching logic. The key insight from Microsoft's agent architecture guidance is that you should match complexity to need—not every workflow requires multi-agent orchestration, and simpler single-agent-with-tools patterns are often the right default for enterprise use cases.

Memory and Context Handling

Agents need to track what they've already done. This happens at multiple levels: short-term working memory held in the context window, intermediate scratchpads for multi-step reasoning, and longer-term storage via vector databases or structured retrieval. Getting memory architecture wrong is one of the fastest ways to produce agents that loop, hallucinate resolved states, or lose track of task scope.

Tool Calling

Agents act on the world through tools functions the model can invoke to read files, query databases, call APIs, run shell commands, search the web, or interact with third-party services. Tool calling is what gives agents their teeth. A model that can only produce text is a language model. A model that can call git blame, kubectl get pods, and a Jira API in sequence is an agent.

Orchestration

Most real-world agentic systems involve more than one agent. An orchestrator routes tasks to specialized sub-agents—one for code generation, one for test execution, one for documentation.Patterns vary: sequential pipelines where each agent hands off to the next, concurrent execution where independent tasks run in parallel, and hierarchical structures where a supervisor agent manages a pool of workers.

Human-in-the-Loop and Approval Steps

This is where many early production deployments stumble. AWS's operational guidance is direct on this point: start with work where the agent's output is a recommendation that a human acts on. Move into higher-stakes autonomous execution only after you've established observability, tested edge case handling, and defined clear escalation rules. Approval gates where an agent pauses and surfaces a decision to a human before proceedingaren't a weakness. They're a feature.

Observability

You cannot improve what you cannot see. Production agents need structured logging of every tool call, reasoning step, and decision branch. Without this, debugging a failed workflow means reading through unstructured outputs and guessing. Good observability lets you identify where an agent went off-track, what data it used, and why it made a given choice.

Real Developer Use Cases in 2026

Autonomous AI agents are not a future concept. Here's where engineering teams are actually deploying them.
Coding assistance and code review: Agents that read issue descriptions, locate the relevant codebase sections, propose a fix, and run lint and test checks before surfacing a draft PR. This compresses triage-to-PR time from hours to minutes.
Issue triage and classification: Agents connected to GitHub, Linear, or Jira that read incoming issues, classify severity, assign labels, route to the right team, and draft an initial response—without a human touching the ticket first.
DevOps and infrastructure support: Agents that monitor alerting systems, cross-reference runbooks, attempt known remediation steps, and escalate only when automated resolution fails. These are particularly effective for well-documented, repeatable incidents.
Internal tooling and research automation: Agents that gather competitive intelligence, summarize technical documentation, draft internal RFCs, or compile release notes by reading merged PRs across a sprint.
Customer operations: Support agents that handle tier-1 queries autonomously, pulling live order status, policy documents, or account data through tool calls, and escalating edge cases to human agents with a full context summary already prepared.
For teams thinking through where to start, reviewing AI agent implementation strategies can help prioritize use cases that are genuinely agent-shaped—meaning the task has a clear start and end, requires judgment across multiple tools, and produces output that can be evaluated objectively.

Benefits Worth Taking Seriously

The productivity case for autonomous AI agents is strong when implementation is done correctly.

Throughput without headcount: Agents can run 24/7 across multiple workflows simultaneously, handling volume that would otherwise require expanding a team.
Faster execution on well-defined tasks: Work that involves pulling information from multiple systems, formatting it, and routing it somewhere is exactly where agents outperform humans on speed.
Scalable automation: Agent-based workflows scale horizontally. Adding a new ticket source or a new data format often means updating a tool or prompt, not rebuilding a pipeline.

Risks You Need to Account For

Agentic AI introduces failure modes that don't exist in traditional software—and that standard monitoring won't catch.
Hallucinations at action time. A model that hallucinates in a chat interface produces a bad answer. A model that hallucinates during an agentic task might delete the wrong files, call the wrong API endpoint, or write a fix that passes tests but introduces a security regression. Stakes are higher when agents act.
Reliability and error propagation. Multi-step workflows amplify small errors. A wrong assumption in step two affects every downstream step. Without tight error handling and fallback logic, agents fail in opaque and sometimes damaging ways.
Security and access control. Agents that can call APIs and write to databases are attack surfaces. Prompt injection where malicious content in a data source hijacks an agent's behavior—is a real threat that's still poorly understood in most production deployments.
Compliance and auditability. Regulated industries need to document what decisions were made, who made them, and why. If your agent can't produce a clean audit trail, it probably can't operate in finance, healthcare, or legal workflows without significant additional tooling.

Best Practices for Developer Teams Getting Started

Define "done" before you define the agent. If you can't describe what task completion looks like in objective terms, including how to handle edge cases, you're not ready to build an agent for that workflow.
Start with reversible actions. The safest first agents operate in read-heavy, write-light modes. They summarize, recommend, and draft—rather than execute, commit, or send.
Set iteration limits. Agents without guardrails loop. Cap the number of tool calls or reasoning steps per run and handle the timeout case explicitly.
Log everything at the tool-call level. Text output alone isn't enough for debugging. Capture every tool invocation, input, and response.
Treat human-in-the-loop as architecture, not afterthought. Approval steps should be first-class components of your agent design not bolt ons added after something breaks.

FAQ

What are autonomous AI agents?

Autonomous AI agents are goal-directed systems built on large language models that can plan, use tools, retrieve context, and execute multi-step workflows without continuous human input.

How is agentic AI different from chatbots?

Chatbots respond to individual prompts in isolation. Agentic AI systems maintain state across steps, use external tools, and operate independently toward a defined goal.

What programming frameworks support agentic AI development?

LangGraph, AutoGen, CrewAI, and the Anthropic and OpenAI tool-use APIs are among the most widely used frameworks for building production AI agent workflows in 2026.

What is the biggest risk in production AI agents?

Error propagation and security vulnerabilities—particularly prompt injection—are the two most underestimated risks when moving agentic AI from prototype to production.

How should a developer team start with AI agent workflows?

Start with a single, well-scoped workflow where the inputs are structured, success is measurable, and actions are low-stakes or reversible. Build observability in from day one.

Author Bio
Smart Pick Team is the editorial team behind Smart Pick, a technology publication covering AI tools, developer workflows, and software infrastructure for builders and technical professionals. The Smart Pick Team tracks the practical side of AI adoption—cutting through hype to focus on what actually works in production environments across the US tech industry.ere the agent's output is a recommendation that a human acts on. Move into higher-stakes autonomous execution only after you've established observability, tested edge case handling, and defined clear escalation rules. Approval gates—where an agent pauses and surfaces a decision to a human before proceeding—aren't a weakness. They're a feature.
Observability
You cannot improve what you cannot see. Production agents need structured logging of every tool call, reasoning step, and decision branch. Without this, debugging a failed workflow means reading through unstructured outputs and guessing. Good observability lets you identify where an agent went off-track, what data it used, and why it made a given choice.

I make an app to help you make money while traveling

fikuri — Sun, 17 May 2026 18:07:57 +0000

This is a series of content I created for the Build with MeDo Hackathon at MeDo.

Introduction

In Indonesia, Jastip (concierge service) is very common. China has a similar culture called Daigou (buying on behalf).

The key similarity between Daigou and Jastip is one simple thing: "Access".

How Jastip works is basically someone from a different location with "access" to certain products offers them to others on social media. If people want to buy, the seller does not need to own the inventory or stock. They just need to be there at the store and ready to purchase. This Jastip service includes a service fee and sometimes delivery service. There are many kinds of products being sold, from snacks to luxury goods you cannot easily find in Indonesia.

People in Indonesia take this to another level by making it a full side quest while travelling. They plan a trip for fun, and to help finance the travelling, they do a Jastip side quest with many products, often from cross-country travel.

Then I found that China already has a similar culture with Daigou, but it is more focused on luxury items that are often hard to find in China or have some kind of rarity or reputation from the origin country (or so from what I read on the internet).

The interesting thing about China and Indonesia's Jastip/Daigou culture is how they operate. In China, Daigou mostly operates on WeChat livestreaming, and the storefront is built inside WeChat using mini programs.
On the other hand, in Indonesia, interaction and transaction usually happen on either WhatsApp groups or Instagram chat. It is very fragmented and causes multiple issues.

So to help my fellow Indonesians make money while travelling, I made an app to manage this kind of thing using medo.dev.

Show me the money

To motivate you to read further, let me put a motivational image for you.

Yes, the money. With this app, you can not only plan for travel, but also make some money with Jastip.

How, you may ask?

1. You can create a campaign

When creating a campaign, you set the starting date and end date of your Jastip.
But Jastip across countries is hard. You need to calculate currency, and I am also afraid of people doing hit and run.

Don't worry, this app handles the currency exchange automatically for you, and there is deposit payment for your buyer. They can use that deposit to buy from you later, but if they do not buy, you can redeem the deposit.

What's next after creating a campaign, you may ask?

2. You can open the campaign and check

Talk with the AI assistant so you are better informed about weather, currency, pricing strategy, adding products if you already have something in mind, or just reading the news in case something happens.

3. Talk with our smart AI agent

Our AI is not like ChatGPT or Gemini, where you need to give information about what you are doing, where, or when. It understands your Jastip context, and it is integrated deeply with MeDo's large language model.
You can also take notes for important information from the AI and check them from the notes tab.

4. Check who is already interested

Check your member tab to see who is already interested and has put down deposit money.
They actually pay with their debit/credit card (using MeDo Stripe skills). It will be yours to claim later.

5. Talk with your buyer directly

Instead of information scattered in WhatsApp and Instagram, you can interact directly here in the app. You can also send images if you want.

6. Lastly, download the invoice or just export everything into Excel for later.

You can enjoy your travel and manage the Jastip using our platform.

7. and heres from buyer side of the app

From the buyer side, the flow is browse campaign, join with deposit, send request, chat, and track the transaction.

Next is how did I make this app?

nd how did it only take me a week to do so...

1. Register to Medo.dev

Ok, MeDo is a full-stack AI coding platform. It sounds very much like a buzzword, but trust me, it is real. You may already have heard about v0, Lovable, Bolt, or Replit, but there is nothing like it. MeDo is more complete, and you only need MeDo (and Stripe apparently for payment).

2. Create new project

3. Next, you just wait until your idea is built

Here's the fun part. Medo.dev is actually not frontend first. It is product first. It will create the full product spec and requirements before it even starts coding. Everything is included, from what should be built, how the folders and system are structured, and what is out of scope, which is very important to make your product align with your vision.

You can edit it or not, it is up to you. Next, click generate the app and then wait.

4. PLUGINSSS (apparently now they call it skills)

The skill integrations in medo.dev blew my mind. In v0, Lovable, and Bolt, you are kind of forced to register for an outside backend service like Supabase. But medo.dev does this automagically, so you do not have to have a Supabase account and click connect or something like that. It just WORKS.

login just works
crud just works
upload image just works
realtime chat app just works
even payment using stripe is just WORKS

And it is not only the backend. Even LLM, image generation, video generation, text-to-speech, and speech-to-text are included. No need to juggle multiple providers, grab API keys, store them, then integrate them. IT JUST WORKS.

You can just pay using a single credit system for all of the services. No need to set up multiple payments or subscribe to multiple websites to get API keys anymore.

I was really blown away by the depth of the integration, so I started to dig into the code.

5. The LLM skills

So in my experience using the medo.dev LLM skills, the default is not really great, but you can just prompt it. Make sure to always prompt it to make the output token to the max.

You can also ask it to call tools, using other skills as tool calls that the LLM can invoke. My biggest problem with the LLM skills is that it seems to be hardcoded to Gemini 2.5 Flash, while there are already newer models like 3 Flash or later.

You can actually make an agent in the LLM skills, not just a placeholder. For example, I built this AI assistant to be able to talk to its own CRUD API, so it can change the title at the top on the fly in realtime by calling directly to the internal API. Mind = blown.

Some tips and tricks using MeDo

Based on my few weeks using medo.dev, here are some tips and tricks that helped me boost productivity and save credits.

Basics

MeDo gives you an integrated backend and frontend workspace. You can look at the code, but you can also access files, logs, and infrastructure directly inside the app.

A single Fast Build prompt costs about 15 credits, while Deep Build costs 30 credits. As of writing this, new accounts usually get 300 credits on registration and 100 credits from daily login. That is around 10 Deep Build prompts or 20 Fast Build prompts.

Tip 1: Put related tasks in one prompt

From my experience, credits are counted per message, not by how hard the problem is or how many tool calls it uses.

There are limits if you ask for too many things at once, but if the tasks are related, it is usually better to put them in one prompt. This helps you get more value from each credit spend.

Tip 2: When debugging, ask for logs and checks

When something breaks, do not only ask MeDo to fix it. Ask it to add logs and checks first so it can understand the real problem.

Example prompt:

What is the best approach to fix this [issue]? Please add logging and checks to help identify the issue.

This helped me a lot when a problem was not fixed in one run.

Tip 3: Ask for design suggestions before giving strict direction

MeDo can be overly eager when you give it very strict design direction.

In my case, I gave it a reference and told it exactly what I wanted, but it still tried to be "creative" and the result was frustrating.

What worked better was asking for a few options first.

Example prompt:

Give me 3 design variants for this particular [problem]. The goal is to [specific goal], and I prefer it to look like [your preferences].

This way, MeDo gives you a few layout and style directions first. You can then choose one, adjust it, or ask for another variant.

This saved me a lot of credits because I stopped forcing one direction through many follow-up prompts.

Tip 4: when stuck, try to clear the context before sending a new prompt

Sometimes you get that little annoying bug, whether it is frontend or backend.
In my experience, it always helps to use the clear context button before sending another prompt. Clearing context can make the agent think and do the task better and faster.

this is the button, don't mind my negative credits i overused it lol

So that's what I am building and how I am building it, with tips and tricks.

Build Autonomous AI Workflows With Claude Desktop

ForgeWorkflows — Sun, 17 May 2026 18:05:56 +0000

The Problem Is Not Your Prompts

In 2026, according to McKinsey's State of AI 2024 report, 72% of organizations now use AI in at least one business function, up from 50% in prior years. Most of them are doing it wrong. They open a chat window, type a prompt, read the response, copy it somewhere, and repeat the next morning. That is not infrastructure. That is a slightly faster version of doing the work yourself.

The actual problem is not prompt quality. It is that most people treat a reasoning model as a vending machine: insert query, receive answer, walk away. Claude's desktop application, as of mid-2026, supports scheduled task execution and direct tool connections that change this entirely. The question is how to wire it up so the machine runs without you standing next to it.

This article is the nine-step framework we use. No aspirational framing. Just the architecture, the constraint patterns that actually hold, and the places where this approach breaks down.

How the Architecture Works

Think of Claude's desktop app as a local orchestration layer. It can hold a persistent context, fire on a schedule, call external tools via MCP (Model Context Protocol) connections, and write its results to a destination you define. That is the full loop. The gap between "chatbot" and "infrastructure" is closing that loop so no human has to sit in the middle of it.

The nine steps break into three phases. The first phase is definition: you decide what recurring decision or document the pipeline will handle, write a system prompt that encodes the rules, and define the exact format the LLM must return. The second phase is connection: you attach the tools the reasoning engine needs (a calendar API, a CRM read endpoint, a Slack webhook, a local file path) and verify each connection fires correctly in isolation before chaining them. The third phase is scheduling and validation: you set the recurrence, add a constraint block to the prompt, and build a lightweight check that confirms the response matches the expected shape before it touches anything downstream.

The constraint block is where most builds fail. I spent a week trying to get a classifier to return exactly three sentences. The prompt said "EXACTLY 3 sentences. Not 2, not 4. Three." It still returned four. The fix was not better instructions. It was reframing the requirement as a hard technical constraint: "CRITICAL: This is a hard technical constraint enforced by automated validation. If you write 4 sentences, the output will be rejected. Count your sentences before responding." An LLM does not treat polite instructions the same way it treats system-level constraints. Every prompt we now ship uses emphatic constraint blocks for any hard formatting requirement. This pattern is documented in our Blueprint Quality Standard.

The tool connection layer deserves its own attention. Claude's MCP protocol lets you expose local functions, REST endpoints, or file operations as callable tools. When the reasoning engine needs data, it calls the tool rather than asking you to paste it in. This is the difference between a pipeline that runs at 7 AM and one that waits for you to wake up. We have seen this pattern used effectively with n8n as the middleware layer: n8n handles the webhook ingestion and data transformation, then passes a clean payload to Claude for the reasoning step, then routes the result to its destination. The two tools complement each other rather than compete.

The Nine Steps, Without the Padding

Step 1: Define one recurring decision. Not "automate my work." Pick the specific thing you rewrite every Monday. A status summary, a lead triage note, a content brief. One thing.

Step 2: Write the system prompt as a specification. Include the role, the input format, the exact output format, and the constraint block for any hard requirements. Treat it like a function signature, not a conversation opener.

Step 3: Identify every data dependency. List every piece of information the reasoning step needs. If any of it lives behind an API or in a file, that dependency becomes a tool connection in step 5.

Step 4: Define the output destination. Where does the result go? A Notion page, a Slack channel, a CSV, a CRM field. Define this before you build anything. The destination determines the format constraint.

Step 5: Connect tools one at a time. Add each MCP tool connection individually and test it in isolation. A broken tool connection that fails silently will corrupt every run downstream. Verify the tool returns what you expect before wiring the next one.

Step 6: Run the full chain manually three times. Before scheduling anything, trigger the complete pipeline by hand. Check that the reasoning layer uses the tool data correctly, that the constraint block holds, and that the result lands in the right destination in the right shape.

Step 7: Add a validation step. Write a simple check, either inside n8n or as a second Claude call, that confirms the response matches the expected format. If it does not match, the pipeline should alert you rather than silently write a malformed result to your CRM.

Step 8: Set the schedule. Claude's desktop scheduler accepts cron-style expressions. Set the recurrence to match the actual cadence of the decision, not the most frequent possible interval. Daily pipelines that run hourly create noise and cost.

Step 9: Monitor the first five runs manually. Watch the logs. Check the destinations. The first week of a scheduled pipeline reveals edge cases that manual testing missed. Fix them before you stop watching.

Implementation Considerations

This approach works well for decisions that are structurally repetitive: the inputs change, but the logic does not. Weekly reporting, lead scoring against a fixed rubric, content brief generation from a template, invoice categorization. Where it breaks down is anywhere the decision requires judgment that changes based on context you have not encoded. If your Monday status update sometimes needs to flag a political situation inside a client account, the pipeline will not know that unless you build a way to inject that context. Autonomous does not mean omniscient.

There is also a cost consideration that most tutorials skip. A pipeline that calls a reasoning model on a schedule, with tool calls, runs up API usage whether or not the run produces anything useful. Before scheduling, calculate the expected token cost per run and multiply by the recurrence. A pipeline that runs 30 times a month at a non-trivial token count adds up. We have seen teams build schedules that are far more frequent than the underlying data actually changes, which means the LLM is reasoning over identical inputs repeatedly. Match the schedule to the data refresh rate, not to how often you wish you had the answer.

For teams already using n8n for orchestration, the cleanest pattern is to keep Claude as the reasoning node inside a larger n8n chain rather than using Claude's desktop scheduler as the primary trigger. n8n gives you better error handling, retry logic, and branching than the desktop app's native scheduler. The Claude desktop scheduled tasks guide covers the native approach in detail; the n8n integration pattern is worth considering if you are already running other automations through that layer. You can browse the full catalog of pre-built automation pipelines at ForgeWorkflows blueprints to see how we structure these reasoning nodes inside larger chains.

One more constraint worth naming: the desktop app requires the machine to be running. If your laptop sleeps at 3 AM, the 3 AM schedule does not fire. For anything that needs guaranteed execution, the pipeline belongs on a server or inside a cloud orchestration layer, not on a local desktop. This is not a criticism of the tool. It is a deployment decision that the tutorials consistently omit.

What We'd Do Differently

Start with the validation step, not the schedule. Every build we have done where we set the schedule first and added validation later resulted in at least one bad run writing garbage to a live destination. Build the check before you automate the trigger. The order matters more than the individual components.

Version your system prompts like code. When a scheduled pipeline starts returning unexpected results three weeks after launch, the first question is always "did the prompt change?" If you are editing the system prompt in place without version history, you cannot answer that question. Store prompts in a git repository or at minimum a dated document. We learned this the hard way on a pipeline that silently drifted over six iterations of "small tweaks."

Build the human override before you need it. Every autonomous pipeline should have a documented way to pause it, override a single run, or inject context manually. Teams that skip this end up either fully trusting a pipeline they should not, or manually disabling it every time an edge case appears. The override mechanism is not a fallback. It is part of the design.