Skip to content

Coding Agents

Coding agent targets evaluate AI coding assistants and CLI-based agents. These targets require a judge_target to run LLM-based evaluators.

Agent providers receive a structured prompt document with two sections: a preread block listing files the agent must read, and the user query containing the eval input.

When an eval test includes type: file inputs, agent providers do not receive the file content inline. Instead, they receive:

  1. A preread block with file:// URIs pointing to absolute paths on disk
  2. The user query with <file: path="..."> reference tags

The agent is expected to read the files itself using its filesystem tools.

This differs from LLM providers, which receive file content embedded directly in the prompt as XML:

<file path="src/example.ts">
// file content is inlined here
</file>

Given an eval with a guideline file and a file input:

input:
- role: user
content:
- type: file
value: ./src/example.ts
- type: text
value: Review this code

The agent receives a prompt like:

Read all guideline files:
* [guidelines.md](file:///abs/path/guidelines.md).
Read all input files:
* [example.ts](file:///abs/path/src/example.ts).
If any file is missing, fail with ERROR: missing-file <filename> and stop.
Then apply system_instructions on the user query below.
[[ ## user_query ## ]]
<file: path="./src/example.ts">
Review this code

The preread block instructs the agent to read both guideline and input files before processing the query. If a system_prompt is configured on the target, it is passed separately via the provider SDK (not in the prompt document).

targets:
- name: claude_agent
provider: claude
workspace_template: ./workspace-templates/my-project
judge_target: azure_base
FieldRequiredDescription
workspace_templateNoPath to workspace template directory
cwdNoWorking directory (mutually exclusive with workspace_template)
judge_targetYesLLM target for evaluation
targets:
- name: codex_target
provider: codex
workspace_template: ./workspace-templates/my-project
judge_target: azure_base
FieldRequiredDescription
workspace_templateNoPath to workspace template directory
cwdNoWorking directory (mutually exclusive with workspace_template)
judge_targetYesLLM target for evaluation
targets:
- name: pi_target
provider: pi-coding-agent
workspace_template: ./workspace-templates/my-project
judge_target: azure_base
FieldRequiredDescription
workspace_templateNoPath to workspace template directory
cwdNoWorking directory (mutually exclusive with workspace_template)
judge_targetYesLLM target for evaluation
targets:
- name: vscode_dev
provider: vscode
workspace_template: ${{ WORKSPACE_PATH }}
judge_target: azure_base
FieldRequiredDescription
executableNoPath to VS Code binary. Supports ${{ ENV_VAR }} syntax or literal paths. Defaults to code (or code-insiders for the insiders provider).
workspace_templateYesPath to workspace template directory
judge_targetYesLLM target for evaluation

Using a custom executable path:

targets:
- name: vscode_dev
provider: vscode
executable: ${{ VSCODE_CMD }}
workspace_template: ${{ WORKSPACE_PATH }}
judge_target: azure_base
targets:
- name: vscode_insiders
provider: vscode-insiders
workspace_template: ${{ WORKSPACE_PATH }}
judge_target: azure_base

Same configuration as VS Code.

Evaluate any command-line agent:

targets:
- name: local_agent
provider: cli
command: 'python agent.py --prompt-file {PROMPT_FILE} --output {OUTPUT_FILE}'
workspace_template: ./workspace-templates/my-project
judge_target: azure_base
FieldRequiredDescription
commandYesCommand to run. {PROMPT} is inline prompt text and {PROMPT_FILE} is a temp file path containing the prompt.
workspace_templateNoPath to workspace template directory
cwdNoWorking directory (mutually exclusive with workspace_template)
judge_targetYesLLM target for evaluation

For testing the evaluation harness without calling real providers:

targets:
- name: mock_target
provider: mock