Versus LangGraph / Temporal: benchmark methodology

AINL’s public size and runtime benchmarks are reproducible from this repository. They are not a substitute for your production SLOs — but they give a repeatable way to compare authoring compactness, emit expansion, and p

Versus LangGraph / Temporal: benchmark methodology

AINL’s public size and runtime benchmarks are reproducible from this repository. They are not a substitute for your production SLOs — but they give a repeatable way to compare authoring compactness, emit expansion, and post-compile execution cost.

This document names commands and artifacts so you can run head-to-head style comparisons (e.g. AINL source vs emitted LangGraph module vs hand-written Python baseline) without changing core code.

Size / token economics (authoring + emit footprint)

Human-readable: repository root BENCHMARK.md (tiktoken cl100k_base, viable vs legacy-inclusive).
Regenerate: make benchmark or make benchmark-ci (see docs/benchmarks.md).
Scripts: scripts/benchmark_size.py — profiles in tooling/artifact_profiles.json; modes include full_multitarget (includes langgraph + temporal wrapper emitters) and minimal_emit (planner-selected targets only).
Hybrid in IR: declare S hybrid langgraph / S hybrid temporal (see docs/HYBRID_GUIDE.md) so minimal_emit can include wrapper targets when you want apples-to-apples “deploy bundle” sizing.

Suggested comparison rows (fill from JSON / tables):

| Row | What you compare | |-----|------------------| | A | AINL strict-valid source (tk) | | B | Same program --emit langgraph output (tk) | | C | Same program --emit temporal output (tk) — sum of emitted files | | D | Hand-written LangGraph / Temporal tutorial equivalent (tk) — your baseline |

Runtime (latency, RSS, optional reliability)

Results file: tooling/benchmark_runtime_results.json (regenerated by make benchmark when runtime bench is enabled in your env).
Script: scripts/benchmark_runtime.py — see docs/benchmarks.md for flags and CI notes.
Interpretation: measures deterministic graph execution after compile — not LLM inference. Use this to separate orchestration runtime from model cost.

LLM generation benchmarks

docs/OLLAMA_EVAL.md — local Ollama and optional cloud models; use the same temperature and prompts when comparing success rate of “generate valid AINL” vs “generate valid Python graph code.”

Migration / emit speed

Emit is local and CPU-bound; a rough wall-clock check:

time python3 scripts/validate_ainl.py --strict examples/hybrid/temporal_durable_ainl/monitoring_durable.ainl --emit temporal -o /tmp/ainl_temporal_out
time python3 scripts/validate_ainl.py --strict examples/hybrid/langgraph_outer_ainl_core/monitoring_escalation.ainl --emit langgraph -o /tmp/monitoring_langgraph.py

Record means on a quiet machine; commit methodology, not one-off magic numbers.

Honest boundaries

LangGraph and Temporal excel at their runtime guarantees and ecosystems; AINL’s claim is portable authoring + strict compile + multi-target emit, not “faster worker RPCs than Temporal.”
n8n / CrewAI / prompt-loop frameworks serve different personas; compare on determinism, audit, token recurring cost, and compile guarantees for operational graphs, not on visual DSL features.

Versus LangGraph / Temporal: benchmark methodology

Size / token economics (authoring + emit footprint)

Human-readable: repository root BENCHMARK.md (tiktoken cl100k_base, viable vs legacy-inclusive).
Regenerate: make benchmark or make benchmark-ci (see docs/benchmarks.md).
Scripts: scripts/benchmark_size.py — profiles in tooling/artifact_profiles.json; modes include full_multitarget (includes langgraph + temporal wrapper emitters) and minimal_emit (planner-selected targets only).
Hybrid in IR: declare S hybrid langgraph / S hybrid temporal (see docs/HYBRID_GUIDE.md) so minimal_emit can include wrapper targets when you want apples-to-apples “deploy bundle” sizing.

Suggested comparison rows (fill from JSON / tables):

Runtime (latency, RSS, optional reliability)

Results file: tooling/benchmark_runtime_results.json (regenerated by make benchmark when runtime bench is enabled in your env).
Script: scripts/benchmark_runtime.py — see docs/benchmarks.md for flags and CI notes.
Interpretation: measures deterministic graph execution after compile — not LLM inference. Use this to separate orchestration runtime from model cost.

LLM generation benchmarks

docs/OLLAMA_EVAL.md — local Ollama and optional cloud models; use the same temperature and prompts when comparing success rate of “generate valid AINL” vs “generate valid Python graph code.”

Migration / emit speed

Emit is local and CPU-bound; a rough wall-clock check:

time python3 scripts/validate_ainl.py --strict examples/hybrid/temporal_durable_ainl/monitoring_durable.ainl --emit temporal -o /tmp/ainl_temporal_out
time python3 scripts/validate_ainl.py --strict examples/hybrid/langgraph_outer_ainl_core/monitoring_escalation.ainl --emit langgraph -o /tmp/monitoring_langgraph.py

Record means on a quiet machine; commit methodology, not one-off magic numbers.

Honest boundaries

LangGraph and Temporal excel at their runtime guarantees and ecosystems; AINL’s claim is portable authoring + strict compile + multi-target emit, not “faster worker RPCs than Temporal.”
n8n / CrewAI / prompt-loop frameworks serve different personas; compare on determinism, audit, token recurring cost, and compile guarantees for operational graphs, not on visual DSL features.

Versus LangGraph / Temporal: benchmark methodology

Size / token economics (authoring + emit footprint)

Human-readable: repository root BENCHMARK.md (tiktoken cl100k_base, viable vs legacy-inclusive).
Regenerate: make benchmark or make benchmark-ci (see docs/benchmarks.md).
Scripts: scripts/benchmark_size.py — profiles in tooling/artifact_profiles.json; modes include full_multitarget (includes langgraph + temporal wrapper emitters) and minimal_emit (planner-selected targets only).
Hybrid in IR: declare S hybrid langgraph / S hybrid temporal (see docs/HYBRID_GUIDE.md) so minimal_emit can include wrapper targets when you want apples-to-apples “deploy bundle” sizing.

Suggested comparison rows (fill from JSON / tables):

Runtime (latency, RSS, optional reliability)

Results file: tooling/benchmark_runtime_results.json (regenerated by make benchmark when runtime bench is enabled in your env).
Script: scripts/benchmark_runtime.py — see docs/benchmarks.md for flags and CI notes.
Interpretation: measures deterministic graph execution after compile — not LLM inference. Use this to separate orchestration runtime from model cost.

LLM generation benchmarks

docs/OLLAMA_EVAL.md — local Ollama and optional cloud models; use the same temperature and prompts when comparing success rate of “generate valid AINL” vs “generate valid Python graph code.”

Migration / emit speed

Emit is local and CPU-bound; a rough wall-clock check:

time python3 scripts/validate_ainl.py --strict examples/hybrid/temporal_durable_ainl/monitoring_durable.ainl --emit temporal -o /tmp/ainl_temporal_out
time python3 scripts/validate_ainl.py --strict examples/hybrid/langgraph_outer_ainl_core/monitoring_escalation.ainl --emit langgraph -o /tmp/monitoring_langgraph.py

Record means on a quiet machine; commit methodology, not one-off magic numbers.

Honest boundaries

LangGraph and Temporal excel at their runtime guarantees and ecosystems; AINL’s claim is portable authoring + strict compile + multi-target emit, not “faster worker RPCs than Temporal.”
n8n / CrewAI / prompt-loop frameworks serve different personas; compare on determinism, audit, token recurring cost, and compile guarantees for operational graphs, not on visual DSL features.

Versus LangGraph / Temporal: benchmark methodology

Versus LangGraph / Temporal: benchmark methodology

Size / token economics (authoring + emit footprint)

Runtime (latency, RSS, optional reliability)

LLM generation benchmarks

Migration / emit speed

Honest boundaries

See also

Versus LangGraph / Temporal: benchmark methodology

Versus LangGraph / Temporal: benchmark methodology

Size / token economics (authoring + emit footprint)

Runtime (latency, RSS, optional reliability)

LLM generation benchmarks

Migration / emit speed

Honest boundaries

See also

Versus LangGraph / Temporal: benchmark methodology

Versus LangGraph / Temporal: benchmark methodology

Size / token economics (authoring + emit footprint)

Runtime (latency, RSS, optional reliability)

LLM generation benchmarks

Migration / emit speed

Honest boundaries

See also