Dataset Configuration Guide
This guide explains the structure and fields used in YAML configuration files for datasets within the Reward Kit. These configurations are typically located inconf/dataset/ or within an example’s conf/dataset/ directory (e.g., examples/math_example/conf/dataset/). They are processed by reward_kit.datasets.loader.py using Hydra.
There are two main types of dataset configurations: Base Datasets and Derived Datasets.
1. Base Dataset Configuration
A base dataset configuration defines the connection to a raw data source and performs initial processing like column mapping. Example File:conf/dataset/base_dataset.yaml (schema), examples/math_example/conf/dataset/gsm8k.yaml (concrete example)
Key Fields:
-
_target_(Required)- Description: Specifies the Python function to instantiate for loading this dataset.
- Typical Value:
reward_kit.datasets.loader.load_and_process_dataset - Example:
_target_: reward_kit.datasets.loader.load_and_process_dataset
-
source_type(Required)- Description: Defines the type of the data source.
- Supported Values:
"huggingface": For datasets hosted on the Hugging Face Hub."jsonl": For local datasets in JSON Lines format."fireworks": (Not yet implemented) For datasets hosted on Fireworks AI.
- Example:
source_type: huggingface
-
path_or_name(Required)- Description: Identifier for the dataset.
- For
huggingface: The Hugging Face dataset name (e.g.,"gsm8k","cais/mmlu"). - For
jsonl: Path to the.jsonlfile (e.g.,"data/my_data.jsonl").
- For
- Example:
path_or_name: "gsm8k"
- Description: Identifier for the dataset.
-
split(Optional)- Description: Specifies the dataset split to load (e.g.,
"train","test","validation"). If loading a Hugging FaceDatasetDictor multiple JSONL files mapped viadata_files, this selects the split after loading. - Default:
"train" - Example:
split: "test"
- Description: Specifies the dataset split to load (e.g.,
-
config_name(Optional)- Description: For Hugging Face datasets with multiple configurations (e.g.,
"main","all"forgsm8k). Corresponds to thenameparameter in Hugging Face’sload_dataset. - Default:
null - Example:
config_name: "main"(forgsm8k)
- Description: For Hugging Face datasets with multiple configurations (e.g.,
-
data_files(Optional)- Description: Used for loading local files (like JSONL, CSV) with Hugging Face’s
datasets.load_dataset. Can be a single file path, a list, or a dictionary mapping split names to file paths. - Example:
data_files: {"train": "path/to/train.jsonl", "test": "path/to/test.jsonl"}
- Description: Used for loading local files (like JSONL, CSV) with Hugging Face’s
-
max_samples(Optional)- Description: Maximum number of samples to load from the dataset (or from each split if a
DatasetDictis loaded). Ifnullor0, all samples are loaded. - Default:
null - Example:
max_samples: 100
- Description: Maximum number of samples to load from the dataset (or from each split if a
-
column_mapping(Optional)- Description: A dictionary to rename columns from the source dataset to a standard internal format. Keys are the new standard names (e.g.,
"query","ground_truth"), and values are the original column names in the source dataset. This mapping is applied byreward_kit.datasets.loader.py. - Default:
{"query": "query", "ground_truth": "ground_truth", "solution": null} - Example (
gsm8k.yaml):
- Description: A dictionary to rename columns from the source dataset to a standard internal format. Keys are the new standard names (e.g.,
-
preprocessing_steps(Optional)- Description: A list of strings, where each string is a Python import path to a preprocessing function (e.g.,
"reward_kit.datasets.loader.transform_codeparrot_apps_sample"). These functions are applied to the dataset after loading and before column mapping. - Default:
[] - Example:
preprocessing_steps: ["my_module.my_preprocessor_func"]
- Description: A list of strings, where each string is a Python import path to a preprocessing function (e.g.,
-
hf_extra_load_params(Optional)- Description: A dictionary of extra parameters to pass directly to Hugging Face’s
datasets.load_dataset()(e.g.,trust_remote_code: True). - Default:
{} - Example:
hf_extra_load_params: {trust_remote_code: True}
- Description: A dictionary of extra parameters to pass directly to Hugging Face’s
-
description(Optional, Metadata)- Description: A brief description of the dataset configuration for documentation purposes.
- Example:
description: "GSM8K (Grade School Math 8K) dataset."
2. Derived Dataset Configuration
A derived dataset configuration references a base dataset and applies further transformations, such as adding system prompts, changing the output format, or applying different column mappings or sample limits. Example File:examples/math_example/conf/dataset/base_derived_dataset.yaml (schema), examples/math_example/conf/dataset/gsm8k_math_prompts.yaml (concrete example)
Key Fields:
-
_target_(Required)- Description: Specifies the Python function to instantiate for loading this derived dataset.
- Typical Value:
reward_kit.datasets.loader.load_derived_dataset - Example:
_target_: reward_kit.datasets.loader.load_derived_dataset
-
base_dataset(Required)- Description: A reference to the base dataset configuration to derive from. This can be the name of another dataset configuration file (e.g.,
"gsm8k", which would loadconf/dataset/gsm8k.yaml) or a full inline base dataset configuration object. - Example:
base_dataset: "gsm8k"
- Description: A reference to the base dataset configuration to derive from. This can be the name of another dataset configuration file (e.g.,
-
system_prompt(Optional)- Description: A string that will be used as the system prompt. In the
evaluation_format, this prompt is added as asystem_promptfield alongsideuser_query. - Default:
null - Example (
gsm8k_math_prompts.yaml):"Solve the following math problem. Show your work clearly. Put the final numerical answer between <answer> and </answer> tags."
- Description: A string that will be used as the system prompt. In the
-
output_format(Optional)- Description: Specifies the final format for the derived dataset.
- Supported Values:
"evaluation_format": Converts dataset records to includeuser_query,ground_truth_for_eval, and optionallysystem_promptandid. This is the standard format for many evaluation scenarios."conversation_format": (Not yet implemented) Converts to a list of messages."jsonl": Keeps records in a format suitable for direct JSONL output (typically implies minimal transformation beyond base loading and initial mapping).
- Default:
"evaluation_format" - Example:
output_format: "evaluation_format"
-
transformations(Optional)- Description: A list of additional transformation functions to apply after the base dataset is loaded and initial derived processing (like system prompt addition) is done. (Currently not fully implemented in
loader.py). - Default:
[]
- Description: A list of additional transformation functions to apply after the base dataset is loaded and initial derived processing (like system prompt addition) is done. (Currently not fully implemented in
-
derived_column_mapping(Optional)- Description: A dictionary for column mapping applied after the base dataset is loaded and before the
output_formatconversion. This can override or extend the base dataset’scolumn_mapping. Keys are new names, values are names from the loaded base dataset. - Default:
{} - Example (
gsm8k_math_prompts.yaml):Note: These mapped columns (query,ground_truth) are then used byconvert_to_evaluation_formatto createuser_queryandground_truth_for_eval.
- Description: A dictionary for column mapping applied after the base dataset is loaded and before the
-
derived_max_samples(Optional)- Description: Maximum number of samples for this derived dataset. If specified, this overrides any
max_samplesfrom the base dataset configuration for the purpose of this derived dataset. - Default:
null - Example:
derived_max_samples: 5
- Description: Maximum number of samples for this derived dataset. If specified, this overrides any
-
description(Optional, Metadata)- Description: A brief description of this derived dataset configuration.
- Example:
description: "GSM8K dataset with math-specific system prompt in evaluation format."
How Configurations are Loaded
Thereward_kit.datasets.loader.py script uses Hydra to:
- Compose these YAML configurations.
- Instantiate the appropriate loader function (
load_and_process_datasetorload_derived_dataset) with the parameters defined in the YAML. - The loader functions then use these parameters to fetch data (e.g., from Hugging Face or local files), apply mappings, execute preprocessing steps, and format the data as requested.