Command Line Interface Reference

The Reward Kit provides a command-line interface (CLI) for common operations like previewing evaluations, deploying reward functions, and running agent evaluations.

Installation

When you install the Reward Kit, the CLI is automatically installed:

pip install reward-kit

You can verify the installation by running:

reward-kit --help

Authentication Setup

Before using the CLI, set up your authentication credentials:

# Set your API key
export FIREWORKS_API_KEY=your_api_key

# Optional: Set the API base URL (for development environments)
export FIREWORKS_API_BASE=https://api.fireworks.ai

Command Overview

The Reward Kit CLI supports the following main commands:

  • run: Run a local evaluation pipeline using a Hydra configuration.
  • preview: Preview evaluation results or re-evaluate generated outputs.
  • deploy: Deploy a reward function as an evaluator.
  • agent-eval: Run agent evaluations on task bundles.
  • list: List existing evaluators (coming soon).
  • delete: Delete an evaluator (coming soon).

Run Command (reward-kit run)

The run command is the primary way to execute local evaluation pipelines. It leverages Hydra for configuration, allowing you to define complex evaluation setups (including dataset loading, model generation, and reward application) in YAML files and easily override parameters from the command line.

Syntax

python -m reward_kit.cli run [options] [HYDRA_OVERRIDES...]

or

reward-kit run [options] [HYDRA_OVERRIDES...]

Key Options

  • --config-path TEXT: Path to the directory containing your Hydra configuration files. (Required)
  • --config-name TEXT: Name of the main Hydra configuration file (e.g., run_my_eval.yaml). (Required)
  • --multirun or -m: Run multiple jobs (e.g., for sweeping over parameters). Refer to Hydra documentation for multi-run usage.
  • --help: Show help message for the run command.

Hydra Overrides

You can override any parameter defined in your Hydra configuration YAML files directly on the command line. For detailed information on how Hydra is used, refer to the Hydra Configuration for Examples guide.

Examples

# Basic usage, running an evaluation defined in examples/math_example/conf/run_math_eval.yaml
python -m reward_kit.cli run \
  --config-path examples/math_example/conf \
  --config-name run_math_eval.yaml

# Override the number of samples to process and the model name
python -m reward_kit.cli run \
  --config-path examples/math_example/conf \
  --config-name run_math_eval.yaml \
  evaluation_params.limit_samples=10 \
  generation.model_name="accounts/fireworks/models/mixtral-8x7b-instruct"

Output

The run command typically generates:

  • A timestamped output directory (e.g., outputs/YYYY-MM-DD/HH-MM-SS/).
  • Inside this directory:
    • .hydra/: Contains the full Hydra configuration for the run (for reproducibility).
    • Log files.
    • Result files, often including:
      • <config_output_name>_results.jsonl (e.g., math_example_results.jsonl): Detailed evaluation results for each sample.
      • preview_input_output_pairs.jsonl: Generated prompts and responses, suitable for use with reward-kit preview.
    • Console Output:
      • A summary report is logged to the console, including:
        • Total samples processed.
        • Number of successful evaluations.
        • Number of evaluation errors.
        • Average, min, and max scores (if applicable).
        • Score distribution.
        • Details of the first few errors encountered.

Preview Command (reward-kit preview)

The preview command allows you to test reward functions with sample data. A primary use case is to inspect or re-evaluate the preview_input_output_pairs.jsonl file generated by the reward-kit run command. This allows you to iterate on reward logic using a fixed set of model generations or to apply different metrics to the same outputs.

You can also use it with manually created sample files.

Syntax

reward-kit preview [options]

Options

  • --metrics-folders: Specify local metric scripts to apply, in the format “name=path/to/metric_script_dir”. The directory should contain a main.py with a @reward_function.
  • --samples: Path to a JSONL file containing sample conversations or prompt/response pairs. This is typically the preview_input_output_pairs.jsonl file from a reward-kit run output directory.
  • --remote-url: (Optional) URL of a deployed evaluator to use for scoring, instead of local --metrics-folders.
  • --max-samples: Maximum number of samples to process (optional)
  • --output: Path to save preview results (optional)
  • --verbose: Enable verbose output (optional)

Examples

# Previewing output from a `reward-kit run` command with a local metric
reward-kit preview \
  --samples ./outputs/YYYY-MM-DD/HH-MM-SS/preview_input_output_pairs.jsonl \
  --metrics-folders "my_custom_metric=./path/to/my_custom_metric"

# Previewing with multiple local metrics
reward-kit preview \
  --samples ./outputs/YYYY-MM-DD/HH-MM-SS/preview_input_output_pairs.jsonl \
  --metrics-folders "metric1=./metrics/metric1" "metric2=./metrics/metric2"

# Limit sample count
reward-kit preview --metrics-folders "clarity=./my_metrics/clarity" --samples ./samples.jsonl --max-samples 5

# Save results to file
reward-kit preview --metrics-folders "clarity=./my_metrics/clarity" --samples ./samples.jsonl --output ./results.json

Sample File Format

The samples file should be a JSONL (JSON Lines) file. If it’s the output from reward-kit run (preview_input_output_pairs.jsonl), each line typically contains a “messages” list (including system, user, and assistant turns) and optionally a “ground_truth” field. If creating manually, a common format is:

{"messages": [{"role": "user", "content": "What is machine learning?"}, {"role": "assistant", "content": "Machine learning is a method of data analysis..."}]}

Or, if you have ground truth for comparison:

{"messages": [{"role": "user", "content": "Question..."}, {"role": "assistant", "content": "Model answer..."}], "ground_truth": "Reference answer..."}

Deploy Command

The deploy command deploys a reward function as an evaluator on the Fireworks platform.

Syntax

reward-kit deploy [options]

Options

  • --id: ID for the deployed evaluator (required)
  • --metrics-folders: Specify metrics to use in the format “name=path” (required)
  • --display-name: Human-readable name for the evaluator (optional)
  • --description: Description of the evaluator (optional)
  • --force: Overwrite if an evaluator with the same ID already exists (optional)
  • --providers: List of model providers to use (optional)
  • --verbose: Enable verbose output (optional)

Examples

# Basic deployment
reward-kit deploy --id my-evaluator --metrics-folders "clarity=./my_metrics/clarity"

# With display name and description
reward-kit deploy --id my-evaluator \
  --metrics-folders "clarity=./my_metrics/clarity" \
  --display-name "Clarity Evaluator" \
  --description "Evaluates responses based on clarity"

# Force overwrite existing evaluator
reward-kit deploy --id my-evaluator \
  --metrics-folders "clarity=./my_metrics/clarity" \
  --force

# Multiple metrics
reward-kit deploy --id comprehensive-evaluator \
  --metrics-folders "clarity=./my_metrics/clarity" "accuracy=./my_metrics/accuracy" \
  --display-name "Comprehensive Evaluator"

Common Workflows

Iterative Development Workflow

A typical development workflow using the CLI now often involves reward-kit run first:

  1. Configure: Set up your dataset and evaluation parameters in Hydra YAML files (e.g., conf/dataset/my_data.yaml, conf/run_my_eval.yaml). Define or reference your reward function logic.
  2. Run: Execute the evaluation pipeline using reward-kit run. This generates model responses and initial scores.
    python -m reward_kit.cli run --config-path ./conf --config-name run_my_eval.yaml
    
  3. Analyze & Iterate:
    • Examine the detailed results (*_results.jsonl) and the preview_input_output_pairs.jsonl from the output directory.
    • If iterating on reward logic, you can use reward-kit preview with the preview_input_output_pairs.jsonl and your updated local metric script.
    reward-kit preview \
      --samples ./outputs/YYYY-MM-DD/HH-MM-SS/preview_input_output_pairs.jsonl \
      --metrics-folders "my_refined_metric=./path/to/refined_metric"
    
    • Refine your reward function code or Hydra configurations.
  4. Re-run: If configurations changed significantly or you need new model generations, re-run reward-kit run.
  5. Deploy: Once satisfied with the evaluator’s performance and configuration:
    reward-kit deploy --id my-evaluator-id \
      --metrics-folders "my_final_metric=./path/to/final_metric" \
      --display-name "My Final Evaluator" \
      --description "Description of my evaluator" \
      --force
    
    (Note: The --metrics-folders for deploy should point to the finalized reward function script(s) you intend to deploy as the evaluator.)

Comparing Multiple Metrics

You can preview multiple metrics to compare their performance:

# Preview with multiple metrics
reward-kit preview \
  --metrics-folders \
  "metric1=./my_metrics/metric1" \
  "metric2=./my_metrics/metric2" \
  "metric3=./my_metrics/metric3" \
  --samples ./samples.jsonl

Deployment with Custom Providers

You can deploy with specific model providers:

# Deploy with custom provider
reward-kit deploy --id my-evaluator \
  --metrics-folders "clarity=./my_metrics/clarity" \
  --providers '[{"providerType":"anthropic","modelId":"claude-3-sonnet-20240229"}]'

Agent-Eval Command

The agent-eval command enables you to run agent evaluations using task bundles.

Syntax

reward-kit agent-eval [options]

Options

Task Specification:

  • --task-dir: Path to task bundle directory containing reward.py, tools.py, etc.
  • --dataset or -d: Path to JSONL file containing task specifications.

Output and Models:

  • --output-dir or -o: Directory to store evaluation runs (default: ”./runs”).
  • --model: Override MODEL_AGENT environment variable.
  • --sim-model: Override MODEL_SIM environment variable for simulated user.

Testing and Debugging:

  • --no-sim-user: Disable simulated user (use static initial messages only).
  • --test-mode: Run in test mode without requiring API keys.
  • --mock-response: Use a mock agent response (works with —test-mode).
  • --debug: Enable detailed debug logging.
  • --validate-only: Validate task bundle structure without running evaluation.
  • --export-tools: Export tool specifications to directory for manual testing.

Advanced Options:

  • --task-ids: Comma-separated list of task IDs to run.
  • --max-tasks: Maximum number of tasks to evaluate.
  • --registries: Custom tool registries in format ‘name=path’.
  • --registry-override: Override all toolset paths with this registry path.
  • --evaluator: Custom evaluator module path (overrides default).

Examples

Note: The following examples use examples/your_agent_task_bundle/ as a placeholder. You will need to replace this with the actual path to your task bundle directory.

# Run agent evaluation with default settings, assuming MODEL_AGENT is set
export MODEL_AGENT=openai/gpt-4o-mini # Example model
reward-kit agent-eval --task-dir examples/your_agent_task_bundle/

# Use a specific dataset file from your task bundle
reward-kit agent-eval --dataset examples/your_agent_task_bundle/task.jsonl --task-dir examples/your_agent_task_bundle/

# Run in test mode (no API keys required)
reward-kit agent-eval --task-dir examples/your_agent_task_bundle/ --test-mode --mock-response

# Validate task bundle structure without running
reward-kit agent-eval --task-dir examples/your_agent_task_bundle/ --validate-only

# Use a custom model and limit to specific tasks
reward-kit agent-eval --task-dir examples/your_agent_task_bundle/ \
  --model anthropic/claude-3-opus-20240229 \
  --task-ids your_task.id.001,your_task.id.002

# Export tool specifications for manual testing
reward-kit agent-eval --task-dir examples/your_agent_task_bundle/ --export-tools ./tool_specs

Task Bundle Structure

A task bundle is a directory containing the following files:

  • reward.py: Reward function with @reward_function decorator
  • tools.py: Tool registry with tool definitions
  • task.jsonl: Dataset rows with task specifications
  • seed.sql (optional): Initial database state

See the Agent Evaluation guide for more details.

Environment Variables

The CLI recognizes the following environment variables:

  • FIREWORKS_API_KEY: Your Fireworks API key (required for deployment operations)
  • FIREWORKS_API_BASE: Base URL for the Fireworks API (defaults to https://api.fireworks.ai)
  • FIREWORKS_ACCOUNT_ID: Your Fireworks account ID (optional, can be configured in auth.ini)
  • MODEL_AGENT: Default agent model to use (e.g., “openai/gpt-4o-mini”)
  • MODEL_SIM: Default simulation model to use (e.g., “openai/gpt-3.5-turbo”)

Troubleshooting

Common Issues

  1. Authentication Errors:

    Error: Authentication failed. Check your API key.
    

    Solution: Ensure FIREWORKS_API_KEY is correctly set.

  2. Metrics Folder Not Found:

    Error: Metrics folder not found: ./my_metrics/clarity
    

    Solution: Check that the path exists and contains a valid main.py file.

  3. Invalid Sample File:

    Error: Failed to parse sample file. Ensure it's a valid JSONL file.
    

    Solution: Verify the sample file is in the correct JSONL format.

  4. Deployment Permission Issues:

    Error: Permission denied. Your API key doesn't have deployment permissions.
    

    Solution: Use a production API key with deployment permissions or request additional permissions.

  5. Task Bundle Validation Errors:

    Error: Missing required files in task bundle: tools.py, reward.py
    

    Solution: Ensure your task bundle has all required files.

  6. Model API Key Not Set:

    Warning: MODEL_AGENT environment variable is not set
    

    Solution: Set the MODEL_AGENT environment variable or use the —model parameter.

  7. Import Errors with Task Bundle:

    Error: Failed to import tool registry from example.task.tools
    

    Solution: Check that the Python path is correct and the module can be imported.

Getting Help

For additional help, use the --help flag with any command:

reward-kit --help
reward-kit preview --help
reward-kit deploy --help
reward-kit agent-eval --help

Next Steps