PRYAJA3 Evals

Golden benchmark gate

Offline deterministic eval results for policy, memory and research contracts. This is the foundation for nightly and shadow release gates.

Latest result

Latest JSON result loaded through the target API.

passing/workspace/data/evals/pryaja3/latest.json
Total cases
19
Passed
19
Failed
0
Success rate
100%
Started
17 апр. 2026 г., 08:36
Finished
17 апр. 2026 г., 08:36

Suites

Each suite maps to a target product risk area.

agentic7 passed / 0 failed7 cases
Offline intent and guard checks for dynamic external workflows.
/home/gpt/pryaja2/packages/evals/golden/agentic.json
memory4 passed / 0 failed4 cases
Offline memory-write regression cases for durable vs episode-only behavior.
/home/gpt/pryaja2/packages/evals/golden/memory.json
research2 passed / 0 failed2 cases
Offline research-lane regression cases for query normalization and evidence artifact shape.
/home/gpt/pryaja2/packages/evals/golden/research.json
tool_policy3 passed / 0 failed3 cases
Offline policy-engine regression cases for canonical ToolPolicyDecision values.
/home/gpt/pryaja2/packages/evals/golden/tool_policy.json
zero_shot3 passed / 0 failed3 cases
Offline checks for task-first lightweight strategy on simple no-tool questions.
/home/gpt/pryaja2/packages/evals/golden/zero_shot.json

Failures

Failure details are intentionally visible because evals are a release gate, not a vanity metric.

No failed cases in the latest result.