Cli overview
Command Line Interface Reference
The Reward Kit provides a command-line interface (CLI) for common operations like previewing evaluations, deploying reward functions, and running agent evaluations.
Installation
When you install the Reward Kit, the CLI is automatically installed:
You can verify the installation by running:
Authentication Setup
Before using the CLI, set up your authentication credentials:
Command Overview
The Reward Kit CLI supports the following main commands:
run
: Run a local evaluation pipeline using a Hydra configuration.preview
: Preview evaluation results or re-evaluate generated outputs.deploy
: Deploy a reward function as an evaluator.agent-eval
: Run agent evaluations on task bundles.list
: List existing evaluators (coming soon).delete
: Delete an evaluator (coming soon).
Run Command (reward-kit run
)
The run
command is the primary way to execute local evaluation pipelines. It leverages Hydra for configuration, allowing you to define complex evaluation setups (including dataset loading, model generation, and reward application) in YAML files and easily override parameters from the command line.
Syntax
or
Key Options
--config-path TEXT
: Path to the directory containing your Hydra configuration files. (Required)--config-name TEXT
: Name of the main Hydra configuration file (e.g.,run_my_eval.yaml
). (Required)--multirun
or-m
: Run multiple jobs (e.g., for sweeping over parameters). Refer to Hydra documentation for multi-run usage.--help
: Show help message for therun
command.
Hydra Overrides
You can override any parameter defined in your Hydra configuration YAML files directly on the command line. For detailed information on how Hydra is used, refer to the Hydra Configuration for Examples guide.
Examples
Output
The run
command typically generates:
- A timestamped output directory (e.g.,
outputs/YYYY-MM-DD/HH-MM-SS/
). - Inside this directory:
.hydra/
: Contains the full Hydra configuration for the run (for reproducibility).- Log files.
- Result files, often including:
<config_output_name>_results.jsonl
(e.g.,math_example_results.jsonl
): Detailed evaluation results for each sample.preview_input_output_pairs.jsonl
: Generated prompts and responses, suitable for use withreward-kit preview
.
- Console Output:
- A summary report is logged to the console, including:
- Total samples processed.
- Number of successful evaluations.
- Number of evaluation errors.
- Average, min, and max scores (if applicable).
- Score distribution.
- Details of the first few errors encountered.
- A summary report is logged to the console, including:
Preview Command (reward-kit preview
)
The preview
command allows you to test reward functions with sample data. A primary use case is to inspect or re-evaluate the preview_input_output_pairs.jsonl
file generated by the reward-kit run
command. This allows you to iterate on reward logic using a fixed set of model generations or to apply different metrics to the same outputs.
You can also use it with manually created sample files.
Syntax
Options
--metrics-folders
: Specify local metric scripts to apply, in the format “name=path/to/metric_script_dir”. The directory should contain amain.py
with a@reward_function
.--samples
: Path to a JSONL file containing sample conversations or prompt/response pairs. This is typically thepreview_input_output_pairs.jsonl
file from areward-kit run
output directory.--remote-url
: (Optional) URL of a deployed evaluator to use for scoring, instead of local--metrics-folders
.--max-samples
: Maximum number of samples to process (optional)--output
: Path to save preview results (optional)--verbose
: Enable verbose output (optional)
Examples
Sample File Format
The samples file should be a JSONL (JSON Lines) file. If it’s the output from reward-kit run
(preview_input_output_pairs.jsonl
), each line typically contains a “messages” list (including system, user, and assistant turns) and optionally a “ground_truth” field. If creating manually, a common format is:
Or, if you have ground truth for comparison:
Deploy Command
The deploy
command deploys a reward function as an evaluator on the Fireworks platform.
Syntax
Options
--id
: ID for the deployed evaluator (required)--metrics-folders
: Specify metrics to use in the format “name=path” (required)--display-name
: Human-readable name for the evaluator (optional)--description
: Description of the evaluator (optional)--force
: Overwrite if an evaluator with the same ID already exists (optional)--providers
: List of model providers to use (optional)--verbose
: Enable verbose output (optional)
Examples
Common Workflows
Iterative Development Workflow
A typical development workflow using the CLI now often involves reward-kit run
first:
- Configure: Set up your dataset and evaluation parameters in Hydra YAML files (e.g.,
conf/dataset/my_data.yaml
,conf/run_my_eval.yaml
). Define or reference your reward function logic. - Run: Execute the evaluation pipeline using
reward-kit run
. This generates model responses and initial scores. - Analyze & Iterate:
- Examine the detailed results (
*_results.jsonl
) and thepreview_input_output_pairs.jsonl
from the output directory. - If iterating on reward logic, you can use
reward-kit preview
with thepreview_input_output_pairs.jsonl
and your updated local metric script.
- Refine your reward function code or Hydra configurations.
- Examine the detailed results (
- Re-run: If configurations changed significantly or you need new model generations, re-run
reward-kit run
. - Deploy: Once satisfied with the evaluator’s performance and configuration:
(Note: The
--metrics-folders
fordeploy
should point to the finalized reward function script(s) you intend to deploy as the evaluator.)
Comparing Multiple Metrics
You can preview multiple metrics to compare their performance:
Deployment with Custom Providers
You can deploy with specific model providers:
Agent-Eval Command
The agent-eval
command enables you to run agent evaluations using task bundles.
Syntax
Options
Task Specification:
--task-dir
: Path to task bundle directory containing reward.py, tools.py, etc.--dataset
or-d
: Path to JSONL file containing task specifications.
Output and Models:
--output-dir
or-o
: Directory to store evaluation runs (default: ”./runs”).--model
: Override MODEL_AGENT environment variable.--sim-model
: Override MODEL_SIM environment variable for simulated user.
Testing and Debugging:
--no-sim-user
: Disable simulated user (use static initial messages only).--test-mode
: Run in test mode without requiring API keys.--mock-response
: Use a mock agent response (works with —test-mode).--debug
: Enable detailed debug logging.--validate-only
: Validate task bundle structure without running evaluation.--export-tools
: Export tool specifications to directory for manual testing.
Advanced Options:
--task-ids
: Comma-separated list of task IDs to run.--max-tasks
: Maximum number of tasks to evaluate.--registries
: Custom tool registries in format ‘name=path’.--registry-override
: Override all toolset paths with this registry path.--evaluator
: Custom evaluator module path (overrides default).
Examples
Note: The following examples use examples/your_agent_task_bundle/
as a placeholder. You will need to replace this with the actual path to your task bundle directory.
Task Bundle Structure
A task bundle is a directory containing the following files:
reward.py
: Reward function with @reward_function decoratortools.py
: Tool registry with tool definitionstask.jsonl
: Dataset rows with task specificationsseed.sql
(optional): Initial database state
See the Agent Evaluation guide for more details.
Environment Variables
The CLI recognizes the following environment variables:
FIREWORKS_API_KEY
: Your Fireworks API key (required for deployment operations)FIREWORKS_API_BASE
: Base URL for the Fireworks API (defaults tohttps://api.fireworks.ai
)FIREWORKS_ACCOUNT_ID
: Your Fireworks account ID (optional, can be configured in auth.ini)MODEL_AGENT
: Default agent model to use (e.g., “openai/gpt-4o-mini”)MODEL_SIM
: Default simulation model to use (e.g., “openai/gpt-3.5-turbo”)
Troubleshooting
Common Issues
-
Authentication Errors:
Solution: Ensure
FIREWORKS_API_KEY
is correctly set. -
Metrics Folder Not Found:
Solution: Check that the path exists and contains a valid
main.py
file. -
Invalid Sample File:
Solution: Verify the sample file is in the correct JSONL format.
-
Deployment Permission Issues:
Solution: Use a production API key with deployment permissions or request additional permissions.
-
Task Bundle Validation Errors:
Solution: Ensure your task bundle has all required files.
-
Model API Key Not Set:
Solution: Set the MODEL_AGENT environment variable or use the —model parameter.
-
Import Errors with Task Bundle:
Solution: Check that the Python path is correct and the module can be imported.
Getting Help
For additional help, use the --help
flag with any command:
Next Steps
- Explore the Developer Guide for conceptual understanding
- Try the Creating Your First Reward Function tutorial
- Learn about Agent Evaluation to create your own task bundles
- See Examples for practical implementations