APPS Coding Example
This guide explains how to use thereward-kit run command to evaluate code generation models on a sample of the codeparrot/apps dataset. This example focuses on checking the parsability of generated Python code.
Overview
- Dataset: A sample from
codeparrot/apps, a dataset of programming problems and solutions. The specific dataset configuration used isapps_full_prompts(defined inconf/dataset/apps_full_prompts.yaml), which typically points to a pre-generated JSONL file. - Task: Given a problem description (question), the model should generate a Python code solution.
- Reward Function: The evaluation uses
reward_kit.rewards.apps_coding_reward.evaluate_apps_solution.- Functionality: In its current form for this example, this reward function performs a basic check to see if the generated Python code is parsable by Python’s
ast.parsemodule. It scores1.0if the code is parsable and0.0otherwise. - It does not execute the code or check for functional correctness against test cases in this simplified setup.
- The
ground_truth_for_evalfield (derived from APPS’input_outputfield) is available to the reward function but not utilized by this initial parsability check.
- Functionality: In its current form for this example, this reward function performs a basic check to see if the generated Python code is parsable by Python’s
- System Prompt: A default system prompt is provided in the configuration to guide the model:
Setup
- Environment: Ensure your Python environment is set up with
reward-kitand its development dependencies installed. If you haven’t already, install them from the root of the repository: - API Key: The default configuration uses a Fireworks AI model (
accounts/fireworks/models/deepseek-v3-0324) for code generation. Make sure yourFIREWORKS_API_KEYis set in your environment or in a.envfile in the project root.
Data Preparation (Informational)
The example typically uses a pre-generated sample of prompts from thecodeparrot/apps dataset. The default run configuration (run_eval.yaml) references apps_full_prompts, which points to development/CODING_DATASET.jsonl.
If you wished to regenerate this sample or create a different one (this is for informational purposes, not required to run the example with defaults):
- The script
scripts/convert_apps_to_prompts.pycan convert the raw Hugging Facecodeparrot/appsdataset into the JSONL format expected by the pipeline. - The source dataset configuration for raw APPS data is defined in
conf/dataset/apps_source.yaml. - An example command to generate 5 samples from the ‘test’ split:
Running the Evaluation
The evaluation is configured inexamples/apps_coding_example/conf/run_eval.yaml. This is the main configuration file used by Hydra.
To run the evaluation using the reward-kit run command:
- Ensure your virtual environment is activated:
- Execute the run command from the root of the repository:
Overriding Parameters
You can override parameters from therun_eval.yaml configuration directly from the command line. For example:
- Limit the number of samples for a quick test:
- Disable code generation (to test reward function with cached responses):
If you have previously run the example and responses are cached (default cache dir:
outputs/generated_responses_cache_apps/), you can disable new generation: - Change the generation model:
Expected Output
Thereward-kit run command will:
- Load prompts based on the
apps_full_promptsdataset configuration (typically fromdevelopment/CODING_DATASET.jsonl). - If
generation.enabledistrue(default), generate code solutions using the configured model. Responses are cached (default:outputs/generated_responses_cache_apps/). - Evaluate each generated solution using the
evaluate_apps_solutionreward function (checking for Python AST parsability). - Print a summary of results to the console.
- Save detailed evaluation results to a JSONL file in a timestamped directory. The default output path is configured in
run_eval.yamlas./outputs/apps_coding_example/${now:%Y-%m-%d}/${now:%H-%M-%S}. The results file will be namedapps_coding_example_results.jsonlwithin that directory.