Reward-Driven Summarizer - RFT on Fireworks

Introduction

In this demo, we will demonstrate how thoughtful reward‑function design can steer a language model toward producing clear, 50‑token summaries that balance brevity with relevance. Using Fireworks’ reinforcement‑fine‑tuning workflow, you’ll see how adjusting a few well‑chosen signals can transform raw model outputs into reliable digests suitable for news briefs, chat recaps, and study notes—revealing, along the way, why defeating reward hacking is central to building trustworthy summarizers.

Goals

Every summarizer will look different. Let’s set up some goals:

Use llama-v3p1-8b-instruct to balance speed and model intelligence
Summaries should be under 50 tokens
Summaries should capture relevant information within a much larger text

Why Reinforcement Fine-Tune?

Reinforcement Fine‑Tuning augments standard supervised training by adding a reward signal that scores each model output after it is generated. Instead of optimizing only for next‑token likelihood, the model learns from these scores—gradually preferring strategies that maximize the reward and discarding those that do not. Traditional supervised fine‑tuning simply teaches a model to imitate example summaries, but it never checks whether the finished output actually satisfies our broader goals—like striking the right balance between brevity and substance. Reinforcement  Fine‑Tuning adds a feedback step after each summary is generated, letting us reward outputs that hit that balance and discourage ones that don’t. Because we can adjust this feedback on the fly, RFT gives us a practical steering mechanism: tweak the reward, observe how the model adapts, and quickly converge on summaries that are both concise and informative. For this sort of summarization task, that end‑to‑end feedback loop is essential—imitation alone can’t capture the nuanced trade‑offs we care about. For more information on RFT on the Fireworks platform and when to use it, take a look at our examples on Knowledge Distillation

Setup & Utils

If you haven’t already, head to https://fireworks.ai/, make an account, and grab an API key - you’ll need one for this demo.

!pip install --upgrade fireworks-ai rouge-score transformers torch

# Imports
from reward_kit import reward_function, EvaluateResult, MetricResult
from typing import List, Dict, Optional
from fireworks import LLM
from rouge_score import rouge_scorer
import math, torch
import os

FIREWORKS_API_KEY = os.getenv("FIREWORKS_API_KEY")

from fireworks import LLM

# Set Up Client
llm = LLM(
  model="llama-v3p1-8b-instruct",
  id="my-deployment-id",
  deployment_type="on-demand", # Can only fine-tune a dedicated deployment
  precision="FP8",
  accelerator_type="NVIDIA_H100_80GB",
) 

Initial Test

Before we touch any fine-tuning or reward functions, we first run the task with an off‑the‑shelf model and record how its raw summaries perform. This baseline reveals the model’s natural tendencies—what it captures well, what it omits, and where it drifts from our goals. Let’s define a system prompt:

sys_prompt = """
Your job is to read a long document and produce a single, fluent English paragraph ≤ 50 GPT-2 tokens that captures the document’s four most important facts.

Rules you must obey for every response

1. Token limit – maximum 50 tokens
2. Importance – include the most critical factual points; leave out everything else.
3. No PII – never output emails, phone numbers, SSNs, or other personally identifying strings that may occur in the input.
4. Fluency – write clear, grammatical English in a single paragraph.
5. Output only the paragraph – no explanations, bullet lists, or metadata.

If the rules conflict, the priority is: Length > Coverage > No PII > Fluency.
"""

And try a sample document (I’m using a news article):

long_document = """
MONTEREY PARK (KABC) -- Authorities are investigating an apparent explosion at an LASD training facility in Monterey Park where at least three deputies were killed.

The incident was reported just before 7:30 a.m. Friday at what looked to be LASD's SEB compound, which houses the sheriff's department's special enforcement units and bomb squad.

It appears the Sheriff's Enforcement Bureau personnel were handling some kind of explosives when there was a blast, according to preliminary information from sources. Three deputies were killed in the incident.

The Los Angeles County Fire Department responded to the scene. It's unclear if there were any other injuries.

It is believed to have been an accident. More is expected soon from the sheriff.

There were no other details immediately available regarding this incident.

L.A. City Mayor Karen Bass confirmed that the LAPD bomb squad is responding to the scene and assisting with the incident.

Governor Gavin Newsom's Office said the governor has been briefed on the apparent explosion and that the Governor's Office of Emergency Services is in contact with LASD while closely monitoring the situation.

L.A. County Supervisor Kathryn Barger issued the following statement regarding the deadly incident:

"I am heartbroken to hear of the terrible tragedy that has unfolded today at an L.A. County Sheriff's Department facility. I am closely tracking the situation as we learn more about what occurred and the condition of those affected. My heart is heavy, and my thoughts are with the brave men and women of the Sheriff's Department during this difficult time. We stand with them and their families as they navigate the hours and days ahead."

L.A. County Supervisor Hilda Solis also issued a statement:

"I am deeply saddened by the tragic incident that occurred this morning at the Los Angeles County Sheriff's Department Biscailuz Training Academy in East Los Angeles. My heart goes out to the families, friends, and colleagues of the three individuals who lost their lives in what appears to have been a devastating explosion. I am in contact with Sheriff Robert Luna and closely monitoring the situation as we await further details. My thoughts are with all those grieving and the first responders who are on the scene."

The FBI and ATF responded to the scene, according to a post from U.S. Attorney General Pam Bondi posted on X.

"Our federal agents are at the scene and we are working to learn more. Please pray for the families of the sheriff's deputies killed," the post said.
"""

response = llm.chat.completions.create(
    messages=[
        {"role": "system", "content": sys_prompt},
        {"role": "user", "content": long_document}
    ],
    max_tokens=100, # can't set it to 50 as the model might just stop in the middle of a sentence

)

print(response.choices[0].message.content)

"An apparent explosion occurred at the Los Angeles County Sheriff's Department training facility in Monterey Park, resulting in the deaths of at least three deputies. The incident occurred during a training exercise involving explosives, and authorities are investigating the cause as an accident. The Los Angeles County Fire Department and LAPD bomb squad responded to the scene, with the FBI and ATF also arriving to assist. L.A. County Supervisors Kathryn Barger and Hilda Solis expressed their condolences to the families of the victims, while Governor"

Pretty clear that the “summary” is hardly concise and starts simply copy the input text after a little bit, even though we specified to limit itself to 50 tokens in the system prompt. Not what we want from a summary. To get around this, we’ll need to fine-tune our model. To understand the fundamentals of RFT and how the fireworks platform makes it easy, check out our course on Knowledge Distillation. We’ll need to set up a reward function that gives the fireworks training kernel a signal on how good a certain response is. It’s our job to figure out what “good” means. Let’s get started!

Part 1: Teach brevity (Length Gate)

Our opening baseline is a binary “length‑only” reward: a summary earns full credit if it stays within the token budget and zero otherwise. This simple gate makes it crystal‑clear to the model that excess verbosity is unacceptable.

def token_len(txt: str) -> int:
    return len(txt.strip().split())

def extract_summary(msgs: List[Dict]) -> Optional[str]:
    for m in reversed(msgs):
        if m.get("role") == "assistant" and not m.get("tool_calls"):
            return m.get("content", "").strip()
    return None

@reward_function
def length_gate_only(
    messages:           List[Dict[str, str]],
    original_messages:  Optional[List[Dict[str, str]]] = None,
    **kwargs,
) -> EvaluateResult:

    summary = extract_summary(messages)

    if summary is None:
        return EvaluateResult(
            score   = 0.0,
            reason  = "parse error",
            metrics = {"token_len": MetricResult(0, False, "parse error")},
            error   = "parse_error",
        )

    tok_len = token_len(summary)
    if tok_len > 50:
        return EvaluateResult(
            score   = 0.0,
            reason  = f"length {tok_len} > 50 tokens",
            metrics = {"token_len": MetricResult(tok_len, False, str(tok_len))},
        )

    return EvaluateResult(
        score   = 1.0,
        reason  = f"length {tok_len} tokens (within limit)",
        metrics = {"token_len": MetricResult(tok_len, True,  str(tok_len))},
    )

Drop this evaluator into Fireworks’ RFT pipeline, point it at your dataset, and you’ll immediately force the model to tighten its summaries. Taking a look at a sample output, we see the following issue:

"An explosion at an LASD training site killed 3 deputies."

The model learned that it can output very short “summaries” and achieve very high rewards. We’ll need to iterate on our reward function again.

Part 2: Reward substance (ROUGE-L)

Once the model has learned that shorter is better, we need to remind it that substance still counts. The second evaluator rewards each summary according to how much of the source document’s wording it captures. A quick overlap measure—ROUGE‑L—is enough to push the policy toward mentioning the main ideas instead of trimming indiscriminately.

# One global Rouge scorer – re-use for speed
_ro = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=True)

def rouge_recall(pred: str, ref: str) -> float:
    return _ro.score(pred, ref)["rougeL"].recall

def extract_doc(orig: List[Dict]) -> Optional[str]:
    return orig[-2].get("content", "").strip() if orig else None  

@reward_function
def summary_reward_v2_doc(
    messages:          List[Dict[str, str]],
    original_messages: Optional[List[Dict[str, str]]] = None,
    **kwargs,
) -> EvaluateResult:

    summary = extract_summary(messages)
    doc     = extract_doc(original_messages)

    if summary is None or doc is None:
        return EvaluateResult(0.0, "parse error",
                              {"coverage": MetricResult(0, False, "parse")},
                              error="parse_error")

    if token_len(summary) > 50:
        tl = token_len(summary)
        return EvaluateResult(0.0, f"length {tl} > 50",
                              {"coverage": MetricResult(0, False, "too long"),
                               "token_len": MetricResult(tl, False, str(tl))})

    cov = rouge_recall(summary, doc)    # 0–1

    return EvaluateResult(round(cov, 4),
                          f"Rouge-L recall {cov:.2f}",
                          {"coverage": MetricResult(cov, cov > .7, f"{cov:.2f}"),
                           "token_len": MetricResult(token_len(summary), True, str(token_len(summary)))})

Running it through the Fireworks RFT pipeline shows us that summaries regain essential details - which is an important counter-balance to the brevity score that we implemented earlier.

"Three LASD deputies dead explosion at SEB training place, maybe accident with explosives, unclear if more hurt, FBI ATF LAPD there, waiting sheriff talk more."

This reads much better than before, but it still reads like a bullet mash‑up—missing verbs, punctuation, and time context—so clarity and polish are next on the fix‑list.

Part 3: Focus on key facts (Bullet Recall)

Our third evaluator narrows the comparison window from the entire source document to a curated bullet list of key facts. Pure document‑level ROUGE can reward nonsense phrases that merely echo scattered words; by contrast, scoring against a focused checklist forces the model to mention the specific points humans actually care about. The downside is cost: generating high‑quality bullet lists requires either human or much larger LLM annotation. For example, a bullet point list of our new example might look like the following:

[
"An explosion occurred at the LASD Special Enforcement Bureau (SEB) training facility in Monterey Park around 7:30 a.m.",
"Three sheriff’s deputies were killed, reportedly while handling explosives; cause appears accidental.",
"FBI, ATF, LAPD bomb squad, and L.A. County Fire responded; further injuries are unconfirmed.",
"Officials including Governor Newsom and Supervisors Barger and Solis issued condolences; more details pending from Sheriff Luna.",
]

Let’s enhance our dataset by adding this list and start writing our reward function. We’ll keep parts that we’ve developed so far and build upon that.

def extract_bullets(orig: List[Dict]) -> Optional[List[str]]:
    return orig[-1].get("bullets") if orig else None

@reward_function
def summary_reward_v3_bullets(
    messages:          List[Dict[str, str]],
    original_messages: Optional[List[Dict[str, str]]] = None,
    bullets:           Optional[List[str]]
    **kwargs,
) -> EvaluateResult:

    summary = extract_summary(messages)
    bullets = extract_bullets(original_messages)

    if summary is None or bullets is None:
        return EvaluateResult(0.0, "parse error",
                              {"coverage": MetricResult(0, False, "parse")},
                              error="parse_error")

    if token_len(summary) > 50:
        tl = token_len(summary)
        return EvaluateResult(0.0, f"length {tl} > 50",
                              {"coverage": MetricResult(0, False, "too long"),
                               "token_len": MetricResult(tl, False, str(tl))})

    joined = "\n".join(bullets)
    cov    = rouge_recall(summary, joined)

    return EvaluateResult(round(cov, 4),
                          f"Rouge-L recall {cov:.2f}",
                          {"coverage": MetricResult(cov, cov > .7, f"{cov:.2f}"),
                           "token_len": MetricResult(token_len(summary), True, str(token_len(summary)))})

Once again, let’s run it through our pipeline and get a sample result:

"Three LASD deputies died in a likely accidental blast at SEB facility. FBI, ATF, LAPD responded. Officials expressed condolences. Details from Sheriff awaited."

By rewarding matches to these distilled key facts, the model learns to deliver summaries that are short and on-point—no more empty verbage, far fewer hallucinations. It looks a lot better than when we first started. We could reasonably stop here—the summaries are now short and reliably cover the must‑know facts—but let’s push one step further.

Advanced Reward: Polish style (Fluency)

With essentials and length under control, the last step is polish: we combine the bullet‑coverage score with a fluency bonus (low perplexity from a tiny GPT‑2 scorer). The reward is a weighted average, so you can dial emphasis toward clarity or content with one line of code through the use of reward-kit

# GPT-2 tiny fluency model (load once)
_tok  = AutoTokenizer.from_pretrained("gpt2")
_gpt2 = AutoModelForCausalLM.from_pretrained("gpt2"); _gpt2.eval()

def fluency(text: str) -> float:
    with torch.no_grad():
        ids  = _tok(text, return_tensors="pt").input_ids
        loss = _gpt2(ids, labels=ids).loss.item()
    return max(0.0, min(1.0, 1 - (loss - 2) / 8))   # maps loss ≈2-10 → score 1-0

@reward_function
def summary_reward_final(
    messages:          List[Dict[str, str]],
    original_messages: Optional[List[Dict[str, str]]] = None,
    **kwargs,
) -> EvaluateResult:

    summary = extract_summary(messages)
    bullets = extract_bullets(original_messages)

    if summary is None or bullets is None:
        return EvaluateResult(0.0, "parse error",
                              {"coverage": MetricResult(0, False, "parse"),
                               "fluency" : MetricResult(0, False, "parse")},
                              error="parse_error")

    if token_len(summary) > 50:
        tl = token_len(summary)
        return EvaluateResult(0.0, f"length {tl} > 50",
                              {"coverage": MetricResult(0, False, "too long"),
                               "fluency" : MetricResult(0, False, "too long"),
                               "token_len": MetricResult(tl, False, str(tl))})

    cov = max(0.05, rouge_recall(summary, "\n".join(bullets)))
    fl  = max(0.05, fluency(summary))
    score = math.sqrt(cov * fl)

    return EvaluateResult(round(score, 4),
                          f"cov={cov:.2f}, flu={fl:.2f}",
                          {"coverage": MetricResult(cov, cov > .7, f"{cov:.2f}"),
                           "fluency" : MetricResult(fl,  fl > .7, f"{fl:.2f}"),
                           "token_len": MetricResult(token_len(summary), True, str(token_len(summary)))})

This blended signal nudges the model to mention every must‑know bullet and read naturally, giving us crisp, on‑topic summaries with human‑friendly flow—our final polish after the earlier length and coverage stages. Here’s an output:

"An explosion at LASD’s SEB facility killed three deputies during explosives training. FBI, ATF, and LAPD responded. Officials offered condolences, and further details are expected from Sheriff Luna as the investigation continues into the apparent accident."

Exactly 47 tokens! It names the location, casualties, training context, responding agencies, public response, and the pending investigation—all in polished, complete sentences with no filler.

Takeaways

By walking a plain language model through four reward tweaks—length gate, document overlap, key‑bullet focus, and a final fluency blend—we steered it into a dependable 50‑token summarizer. Each change showed, in minutes, how the model bends to whatever signal we supply, thanks to the lightweight evaluator‑swap workflow built into Fireworks’ RFT platform.

A model follows its incentives, not your intentions. Define the right reward and you steer behaviour directly; leave gaps and the model finds them.
Start simple, then layer complexity. A binary length check exposed verbosity problems instantly; later signals refined relevance and style.
End‑to‑end feedback beats imitation alone. Rewarding the full output captures goals that token‑level training can’t touch.

The exercise also showed how quickly you can iterate when evaluators are first‑class citizens: swap one in, rerun, and immediately trace the effect. Keep that loop handy, keep the reward honest, and your models will do exactly what you ask—nothing more, nothing less. That’s the demo — let the summaries speak for themselves.

Fine-tuning

Reinforcement Learning

Reward hacking

Reward-Driven Summarizer - RFT on Fireworks

Introduction

Goals

Why Reinforcement Fine-Tune?

Setup & Utils

Initial Test

Part 1: Teach brevity (Length Gate)

Part 2: Reward substance (ROUGE-L)

Part 3: Focus on key facts (Bullet Recall)

Advanced Reward: Polish style (Fluency)

Takeaways

Fine-tuning

Reinforcement Learning

​Reward-Driven Summarizer - RFT on Fireworks

​Introduction

​Goals

​Why Reinforcement Fine-Tune?

​Setup & Utils

​Initial Test

​Part 1: Teach brevity (Length Gate)

​Part 2: Reward substance (ROUGE-L)

​Part 3: Focus on key facts (Bullet Recall)

​Advanced Reward: Polish style (Fluency)

​Takeaways

Reward-Driven Summarizer - RFT on Fireworks

Introduction

Goals

Why Reinforcement Fine-Tune?

Setup & Utils

Initial Test

Part 1: Teach brevity (Length Gate)

Part 2: Reward substance (ROUGE-L)

Part 3: Focus on key facts (Bullet Recall)

Advanced Reward: Polish style (Fluency)

Takeaways