conf/dataset/
or within an example’s conf/dataset/
directory (e.g., examples/math_example/conf/dataset/
). They are processed by reward_kit.datasets.loader.py
using Hydra.
There are two main types of dataset configurations: Base Datasets and Derived Datasets.
conf/dataset/base_dataset.yaml
(schema), examples/math_example/conf/dataset/gsm8k.yaml
(concrete example)
_target_
(Required)
reward_kit.datasets.loader.load_and_process_dataset
_target_: reward_kit.datasets.loader.load_and_process_dataset
source_type
(Required)
"huggingface"
: For datasets hosted on the Hugging Face Hub."jsonl"
: For local datasets in JSON Lines format."fireworks"
: (Not yet implemented) For datasets hosted on Fireworks AI.source_type: huggingface
path_or_name
(Required)
huggingface
: The Hugging Face dataset name (e.g., "gsm8k"
, "cais/mmlu"
).jsonl
: Path to the .jsonl
file (e.g., "data/my_data.jsonl"
).path_or_name: "gsm8k"
split
(Optional)
"train"
, "test"
, "validation"
). If loading a Hugging Face DatasetDict
or multiple JSONL files mapped via data_files
, this selects the split after loading."train"
split: "test"
config_name
(Optional)
"main"
, "all"
for gsm8k
). Corresponds to the name
parameter in Hugging Face’s load_dataset
.null
config_name: "main"
(for gsm8k
)data_files
(Optional)
datasets.load_dataset
. Can be a single file path, a list, or a dictionary mapping split names to file paths.data_files: {"train": "path/to/train.jsonl", "test": "path/to/test.jsonl"}
max_samples
(Optional)
DatasetDict
is loaded). If null
or 0
, all samples are loaded.null
max_samples: 100
column_mapping
(Optional)
"query"
, "ground_truth"
), and values are the original column names in the source dataset. This mapping is applied by reward_kit.datasets.loader.py
.{"query": "query", "ground_truth": "ground_truth", "solution": null}
gsm8k.yaml
):
preprocessing_steps
(Optional)
"reward_kit.datasets.loader.transform_codeparrot_apps_sample"
). These functions are applied to the dataset after loading and before column mapping.[]
preprocessing_steps: ["my_module.my_preprocessor_func"]
hf_extra_load_params
(Optional)
datasets.load_dataset()
(e.g., trust_remote_code: True
).{}
hf_extra_load_params: {trust_remote_code: True}
description
(Optional, Metadata)
description: "GSM8K (Grade School Math 8K) dataset."
examples/math_example/conf/dataset/base_derived_dataset.yaml
(schema), examples/math_example/conf/dataset/gsm8k_math_prompts.yaml
(concrete example)
_target_
(Required)
reward_kit.datasets.loader.load_derived_dataset
_target_: reward_kit.datasets.loader.load_derived_dataset
base_dataset
(Required)
"gsm8k"
, which would load conf/dataset/gsm8k.yaml
) or a full inline base dataset configuration object.base_dataset: "gsm8k"
system_prompt
(Optional)
evaluation_format
, this prompt is added as a system_prompt
field alongside user_query
.null
gsm8k_math_prompts.yaml
): "Solve the following math problem. Show your work clearly. Put the final numerical answer between <answer> and </answer> tags."
output_format
(Optional)
"evaluation_format"
: Converts dataset records to include user_query
, ground_truth_for_eval
, and optionally system_prompt
and id
. This is the standard format for many evaluation scenarios."conversation_format"
: (Not yet implemented) Converts to a list of messages."jsonl"
: Keeps records in a format suitable for direct JSONL output (typically implies minimal transformation beyond base loading and initial mapping)."evaluation_format"
output_format: "evaluation_format"
transformations
(Optional)
loader.py
).[]
derived_column_mapping
(Optional)
output_format
conversion. This can override or extend the base dataset’s column_mapping
. Keys are new names, values are names from the loaded base dataset.{}
gsm8k_math_prompts.yaml
):
query
, ground_truth
) are then used by convert_to_evaluation_format
to create user_query
and ground_truth_for_eval
.derived_max_samples
(Optional)
max_samples
from the base dataset configuration for the purpose of this derived dataset.null
derived_max_samples: 5
description
(Optional, Metadata)
description: "GSM8K dataset with math-specific system prompt in evaluation format."
reward_kit.datasets.loader.py
script uses Hydra to:
load_and_process_dataset
or load_derived_dataset
) with the parameters defined in the YAML.