Claims and reproducible evidence

This document maps public-facing statements in this repository to defensible, reproducible artifacts. It does not replace narrative reports or field analyses; it shows where numbers can be re-derived locally.

Claims and reproducible evidence

This document maps public-facing statements in this repository to defensible, reproducible artifacts. It does not replace narrative reports or field analyses; it shows where numbers can be re-derived locally.

Quick rerun (updates sections in BENCHMARK.md):

python scripts/benchmark_token_savings.py
python scripts/benchmark_compile_once_run_many.py
python scripts/benchmark_authoring_density.py

Tokenizer: tiktoken cl100k_base (billing-aligned with GPT-4o–class models) unless a doc explicitly states otherwise.

Machine-readable outputs (`tooling/`)

These files are committed so diffs and CI can compare runs. They are the companion artifacts to BENCHMARK.md (human-readable tables and caveats).

| File | Produced by | |------|-------------| | tooling/token_savings_results.json | scripts/benchmark_token_savings.py | | tooling/compile_once_run_many_results.json | scripts/benchmark_compile_once_run_many.py | | tooling/authoring_density_results.json | scripts/benchmark_authoring_density.py | | tooling/benchmark_size.json | scripts/benchmark_size.py (make benchmark) | | tooling/benchmark_size_ci.json | CI / make benchmark-ci slice (same schema family, smaller run) | | tooling/benchmark_runtime_results.json | scripts/benchmark_runtime.py | | tooling/benchmark_runtime_ci.json | CI runtime slice | | tooling/benchmark_manifest.json | Profile/mode registry consumed by size benchmarking — not a “result” row, but versioned config |

Other JSON under tooling/ (for example artifact_profiles.json, mcp_exposure_profiles.json) support compiler and product defaults; they are not benchmark run outputs.

1. Orchestration tokens: compile-once vs prompt-loop (recurring jobs)

Claims elsewhere: “90–95% fewer tokens” vs prompt-loop agents on recurring monitors, digests, scheduled jobs; similar figures in OpenClaw bridge / cap-tuner docs for stable paths.

Evidence:

| Artifact | Role | |----------|------| | scripts/benchmark_compile_once_run_many.py | Simulates repeated runs; compares compiled AINL vs prompt-loop baselines; writes tooling JSON + injects BENCHMARK.md. | | Example workloads | examples/benchmark/enterprise_monitor.ainl, examples/workflows/lead_enrichment.ainl, examples/workflows/support_ticket_router.ainl (+ scenarios such as price monitor / ETL QC / RSS digest in the script). | | Hub | docs/benchmarks.md § Analytical orchestration-token economics. |

Honest scope: Savings are largest when most runs need no LLM (healthy polls, cache hits, deterministic routing). Workloads that invoke an LLM every run (e.g. classify + draft) still gain from IR routing but show smaller ratios — see scenario tables in BENCHMARK.md.

2. Routing / classification: LLM-first vs compiled IR (“2–5×” class statements)

Claims elsewhere: Multi-step pipelines spend fewer tokens when routing lives in IR branches instead of repeated LLM orchestration.

Evidence:

| Artifact | Role | |----------|------| | scripts/benchmark_token_savings.py | Three-way comparison (vanilla LLM-first, hand-optimized Python, compiled AINL); routing-depth sensitivity; injects BENCHMARK.md. |

3. Authoring density: AINL vs Python/TypeScript (“3–5×” class statements)

Claims elsewhere: Compact .ainl authoring vs equivalent imperative code generated or written for the same workflow.

Evidence:

| Artifact | Role | |----------|------| | scripts/benchmark_authoring_density.py | Token and line counts for paired programs; injects BENCHMARK.md. | | Complex reference | examples/workflows/data_pipeline.ainl vs verbose baselines under benchmarks/handwritten_baselines/authoring_density/. |

Honest scope: Line-count ratios for complex graphs reach the 3–5× band vs LLM-style verbose Python; simple programs show lower token ratios — see interpretation block in BENCHMARK.md.

4. Emit / artifact size (“~1.02×”, “minimal_emit”, viable subset)

Claims elsewhere: README / integration docs cite ~1.02× leverage on tokenizer-aligned viable subset workloads vs unstructured baselines.

Evidence:

| Artifact | Role | |----------|------| | scripts/benchmark_size.py | Primary size benchmark; profiles in tooling/artifact_profiles.json. | | make benchmark / docs/benchmarks.md | Regenerates BENCHMARK.md size tables; separates viable subset vs legacy-inclusive. |

This metric is not the same as §1–3; do not mix emit size with orchestration LLM tokens.

5. Session bootstrap / bridge / startup context (“85–95%” class statements)

Claims elsewhere: Golden-path OpenClaw integration (docs/openclaw/AINL_INTEGRATION_GOLDEN.md), embedding pilots, startup clamps — session token footprint vs full memory dumps.

Evidence: Operational worksheets and live metering — docs/operations/TOKEN_AND_USAGE_OBSERVABILITY.md; bridge sizing docs/openclaw/AINL_AUTO_TUNER.md. This is a different surface from §1 (scheduled graph execution). Where both appear, they are complementary, not duplicate proofs of the same quantity.

6. Field analyses, consultant reports, and agent_reports

Role: Narrative validation, operator experience, and third-party write-ups (agent_reports/, AI_CONSULTANT_REPORT_APOLLO.md, etc.).

Relationship to §1–4: Useful context and quotes; auditable reproduction for headline economics should cite BENCHMARK.md and the scripts/benchmark_*.py family above.

Claims and reproducible evidence

Quick rerun (updates sections in BENCHMARK.md):

python scripts/benchmark_token_savings.py
python scripts/benchmark_compile_once_run_many.py
python scripts/benchmark_authoring_density.py

Tokenizer: tiktoken cl100k_base (billing-aligned with GPT-4o–class models) unless a doc explicitly states otherwise.

Machine-readable outputs (`tooling/`)

These files are committed so diffs and CI can compare runs. They are the companion artifacts to BENCHMARK.md (human-readable tables and caveats).

Other JSON under tooling/ (for example artifact_profiles.json, mcp_exposure_profiles.json) support compiler and product defaults; they are not benchmark run outputs.

1. Orchestration tokens: compile-once vs prompt-loop (recurring jobs)

Claims elsewhere: “90–95% fewer tokens” vs prompt-loop agents on recurring monitors, digests, scheduled jobs; similar figures in OpenClaw bridge / cap-tuner docs for stable paths.

Evidence:

2. Routing / classification: LLM-first vs compiled IR (“2–5×” class statements)

Claims elsewhere: Multi-step pipelines spend fewer tokens when routing lives in IR branches instead of repeated LLM orchestration.

Evidence:

3. Authoring density: AINL vs Python/TypeScript (“3–5×” class statements)

Claims elsewhere: Compact .ainl authoring vs equivalent imperative code generated or written for the same workflow.

Evidence:

4. Emit / artifact size (“~1.02×”, “minimal_emit”, viable subset)

Claims elsewhere: README / integration docs cite ~1.02× leverage on tokenizer-aligned viable subset workloads vs unstructured baselines.

Evidence:

This metric is not the same as §1–3; do not mix emit size with orchestration LLM tokens.

5. Session bootstrap / bridge / startup context (“85–95%” class statements)

Claims elsewhere: Golden-path OpenClaw integration (docs/openclaw/AINL_INTEGRATION_GOLDEN.md), embedding pilots, startup clamps — session token footprint vs full memory dumps.

6. Field analyses, consultant reports, and agent_reports

Role: Narrative validation, operator experience, and third-party write-ups (agent_reports/, AI_CONSULTANT_REPORT_APOLLO.md, etc.).

Relationship to §1–4: Useful context and quotes; auditable reproduction for headline economics should cite BENCHMARK.md and the scripts/benchmark_*.py family above.

Claims and reproducible evidence

Quick rerun (updates sections in BENCHMARK.md):

python scripts/benchmark_token_savings.py
python scripts/benchmark_compile_once_run_many.py
python scripts/benchmark_authoring_density.py

Tokenizer: tiktoken cl100k_base (billing-aligned with GPT-4o–class models) unless a doc explicitly states otherwise.

Machine-readable outputs (`tooling/`)

These files are committed so diffs and CI can compare runs. They are the companion artifacts to BENCHMARK.md (human-readable tables and caveats).

Other JSON under tooling/ (for example artifact_profiles.json, mcp_exposure_profiles.json) support compiler and product defaults; they are not benchmark run outputs.

1. Orchestration tokens: compile-once vs prompt-loop (recurring jobs)

Claims elsewhere: “90–95% fewer tokens” vs prompt-loop agents on recurring monitors, digests, scheduled jobs; similar figures in OpenClaw bridge / cap-tuner docs for stable paths.

Evidence:

2. Routing / classification: LLM-first vs compiled IR (“2–5×” class statements)

Claims elsewhere: Multi-step pipelines spend fewer tokens when routing lives in IR branches instead of repeated LLM orchestration.

Evidence:

3. Authoring density: AINL vs Python/TypeScript (“3–5×” class statements)

Claims elsewhere: Compact .ainl authoring vs equivalent imperative code generated or written for the same workflow.

Evidence:

4. Emit / artifact size (“~1.02×”, “minimal_emit”, viable subset)

Claims elsewhere: README / integration docs cite ~1.02× leverage on tokenizer-aligned viable subset workloads vs unstructured baselines.

Evidence:

This metric is not the same as §1–3; do not mix emit size with orchestration LLM tokens.

5. Session bootstrap / bridge / startup context (“85–95%” class statements)

Claims elsewhere: Golden-path OpenClaw integration (docs/openclaw/AINL_INTEGRATION_GOLDEN.md), embedding pilots, startup clamps — session token footprint vs full memory dumps.

6. Field analyses, consultant reports, and agent_reports

Role: Narrative validation, operator experience, and third-party write-ups (agent_reports/, AI_CONSULTANT_REPORT_APOLLO.md, etc.).

Relationship to §1–4: Useful context and quotes; auditable reproduction for headline economics should cite BENCHMARK.md and the scripts/benchmark_*.py family above.

Claims and reproducible evidence

Claims and reproducible evidence

Machine-readable outputs (`tooling/`)

1. Orchestration tokens: compile-once vs prompt-loop (recurring jobs)

2. Routing / classification: LLM-first vs compiled IR (“2–5×” class statements)

3. Authoring density: AINL vs Python/TypeScript (“3–5×” class statements)

4. Emit / artifact size (“~1.02×”, “minimal_emit”, viable subset)

5. Session bootstrap / bridge / startup context (“85–95%” class statements)

6. Field analyses, consultant reports, and agent_reports

See also

Claims and reproducible evidence

Claims and reproducible evidence

Machine-readable outputs (`tooling/`)

1. Orchestration tokens: compile-once vs prompt-loop (recurring jobs)

2. Routing / classification: LLM-first vs compiled IR (“2–5×” class statements)

3. Authoring density: AINL vs Python/TypeScript (“3–5×” class statements)

4. Emit / artifact size (“~1.02×”, “minimal_emit”, viable subset)

5. Session bootstrap / bridge / startup context (“85–95%” class statements)

6. Field analyses, consultant reports, and agent_reports

See also

Claims and reproducible evidence

Claims and reproducible evidence

Machine-readable outputs (`tooling/`)

1. Orchestration tokens: compile-once vs prompt-loop (recurring jobs)

2. Routing / classification: LLM-first vs compiled IR (“2–5×” class statements)

3. Authoring density: AINL vs Python/TypeScript (“3–5×” class statements)

4. Emit / artifact size (“~1.02×”, “minimal_emit”, viable subset)

5. Session bootstrap / bridge / startup context (“85–95%” class statements)

6. Field analyses, consultant reports, and agent_reports

See also

Claims and reproducible evidence

Claims and reproducible evidence

Machine-readable outputs (tooling/)

1. Orchestration tokens: compile-once vs prompt-loop (recurring jobs)

2. Routing / classification: LLM-first vs compiled IR (“2–5×” class statements)

3. Authoring density: AINL vs Python/TypeScript (“3–5×” class statements)

4. Emit / artifact size (“~1.02×”, “minimal_emit”, viable subset)

5. Session bootstrap / bridge / startup context (“85–95%” class statements)

6. Field analyses, consultant reports, and agent_reports

See also

Claims and reproducible evidence

Claims and reproducible evidence

Machine-readable outputs (tooling/)

1. Orchestration tokens: compile-once vs prompt-loop (recurring jobs)

2. Routing / classification: LLM-first vs compiled IR (“2–5×” class statements)

3. Authoring density: AINL vs Python/TypeScript (“3–5×” class statements)

4. Emit / artifact size (“~1.02×”, “minimal_emit”, viable subset)

5. Session bootstrap / bridge / startup context (“85–95%” class statements)

6. Field analyses, consultant reports, and agent_reports

See also

Claims and reproducible evidence

Claims and reproducible evidence

Machine-readable outputs (tooling/)

1. Orchestration tokens: compile-once vs prompt-loop (recurring jobs)

2. Routing / classification: LLM-first vs compiled IR (“2–5×” class statements)

3. Authoring density: AINL vs Python/TypeScript (“3–5×” class statements)

4. Emit / artifact size (“~1.02×”, “minimal_emit”, viable subset)

5. Session bootstrap / bridge / startup context (“85–95%” class statements)

6. Field analyses, consultant reports, and agent_reports

See also

Machine-readable outputs (`tooling/`)

Machine-readable outputs (`tooling/`)

Machine-readable outputs (`tooling/`)