In this demo, we will demonstrate how thoughtful reward‑function design can steer a language model toward producing clear, 50‑token summaries that balance brevity with relevance. Using Fireworks’ reinforcement‑fine‑tuning workflow, you’ll see how adjusting a few well‑chosen signals can transform raw model outputs into reliable digests suitable for news briefs, chat recaps, and study notes—revealing, along the way, why defeating reward hacking is central to building trustworthy summarizers.
Reinforcement Fine‑Tuning augments standard supervised training by adding a reward signal that scores each model output after it is generated. Instead of optimizing only for next‑token likelihood, the model learns from these scores—gradually preferring strategies that maximize the reward and discarding those that do not.Traditional supervised fine‑tuning simply teaches a model to imitate example summaries, but it never checks whether the finished output actually satisfies our broader goals—like striking the right balance between brevity and substance. Reinforcement Fine‑Tuning adds a feedback step after each summary is generated, letting us reward outputs that hit that balance and discourage ones that don’t. Because we can adjust this feedback on the fly, RFT gives us a practical steering mechanism: tweak the reward, observe how the model adapts, and quickly converge on summaries that are both concise and informative. For this sort of summarization task, that end‑to‑end feedback loop is essential—imitation alone can’t capture the nuanced trade‑offs we care about.For more information on RFT on the Fireworks platform and when to use it, take a look at our examples on Knowledge Distillation
Before we touch any fine-tuning or reward functions, we first run the task with an off‑the‑shelf model and record how its raw summaries perform. This baseline reveals the model’s natural tendencies—what it captures well, what it omits, and where it drifts from our goals.Let’s define a system prompt:
Copy
Ask AI
sys_prompt = """Your job is to read a long document and produce a single, fluent English paragraph ≤ 50 GPT-2 tokens that captures the document’s four most important facts.Rules you must obey for every response1. Token limit – maximum 50 tokens2. Importance – include the most critical factual points; leave out everything else.3. No PII – never output emails, phone numbers, SSNs, or other personally identifying strings that may occur in the input.4. Fluency – write clear, grammatical English in a single paragraph.5. Output only the paragraph – no explanations, bullet lists, or metadata.If the rules conflict, the priority is: Length > Coverage > No PII > Fluency."""
And try a sample document (I’m using a news article):
Copy
Ask AI
long_document = """MONTEREY PARK (KABC) -- Authorities are investigating an apparent explosion at an LASD training facility in Monterey Park where at least three deputies were killed.The incident was reported just before 7:30 a.m. Friday at what looked to be LASD's SEB compound, which houses the sheriff's department's special enforcement units and bomb squad.It appears the Sheriff's Enforcement Bureau personnel were handling some kind of explosives when there was a blast, according to preliminary information from sources. Three deputies were killed in the incident.The Los Angeles County Fire Department responded to the scene. It's unclear if there were any other injuries.It is believed to have been an accident. More is expected soon from the sheriff.There were no other details immediately available regarding this incident.L.A. City Mayor Karen Bass confirmed that the LAPD bomb squad is responding to the scene and assisting with the incident.Governor Gavin Newsom's Office said the governor has been briefed on the apparent explosion and that the Governor's Office of Emergency Services is in contact with LASD while closely monitoring the situation.L.A. County Supervisor Kathryn Barger issued the following statement regarding the deadly incident:"I am heartbroken to hear of the terrible tragedy that has unfolded today at an L.A. County Sheriff's Department facility. I am closely tracking the situation as we learn more about what occurred and the condition of those affected. My heart is heavy, and my thoughts are with the brave men and women of the Sheriff's Department during this difficult time. We stand with them and their families as they navigate the hours and days ahead."L.A. County Supervisor Hilda Solis also issued a statement:"I am deeply saddened by the tragic incident that occurred this morning at the Los Angeles County Sheriff's Department Biscailuz Training Academy in East Los Angeles. My heart goes out to the families, friends, and colleagues of the three individuals who lost their lives in what appears to have been a devastating explosion. I am in contact with Sheriff Robert Luna and closely monitoring the situation as we await further details. My thoughts are with all those grieving and the first responders who are on the scene."The FBI and ATF responded to the scene, according to a post from U.S. Attorney General Pam Bondi posted on X."Our federal agents are at the scene and we are working to learn more. Please pray for the families of the sheriff's deputies killed," the post said."""
Copy
Ask AI
response = llm.chat.completions.create( messages=[ {"role": "system", "content": sys_prompt}, {"role": "user", "content": long_document} ], max_tokens=100, # can't set it to 50 as the model might just stop in the middle of a sentence)print(response.choices[0].message.content)
Copy
Ask AI
"An apparent explosion occurred at the Los Angeles County Sheriff's Department training facility in Monterey Park, resulting in the deaths of at least three deputies. The incident occurred during a training exercise involving explosives, and authorities are investigating the cause as an accident. The Los Angeles County Fire Department and LAPD bomb squad responded to the scene, with the FBI and ATF also arriving to assist. L.A. County Supervisors Kathryn Barger and Hilda Solis expressed their condolences to the families of the victims, while Governor"
Pretty clear that the “summary” is hardly concise and starts simply copy the input text after a little bit, even though we specified to limit itself to 50 tokens in the system prompt. Not what we want from a summary.To get around this, we’ll need to fine-tune our model. To understand the fundamentals of RFT and how the fireworks platform makes it easy, check out our course on Knowledge Distillation.We’ll need to set up a reward function that gives the fireworks training kernel a signal on how good a certain response is. It’s our job to figure out what “good” means. Let’s get started!
Our opening baseline is a binary “length‑only” reward: a summary earns full credit if it stays within the token budget and zero otherwise. This simple gate makes it crystal‑clear to the model that excess verbosity is unacceptable.
Copy
Ask AI
def token_len(txt: str) -> int: return len(txt.strip().split())def extract_summary(msgs: List[Dict]) -> Optional[str]: for m in reversed(msgs): if m.get("role") == "assistant" and not m.get("tool_calls"): return m.get("content", "").strip() return None
Drop this evaluator into Fireworks’ RFT pipeline, point it at your dataset, and you’ll immediately force the model to tighten its summaries. Taking a look at a sample output, we see the following issue:
Copy
Ask AI
"An explosion at an LASD training site killed 3 deputies."
The model learned that it can output very short “summaries” and achieve very high rewards. We’ll need to iterate on our reward function again.
Once the model has learned that shorter is better, we need to remind it that substance still counts. The second evaluator rewards each summary according to how much of the source document’s wording it captures. A quick overlap measure—ROUGE‑L—is enough to push the policy toward mentioning the main ideas instead of trimming indiscriminately.
Running it through the Fireworks RFT pipeline shows us that summaries regain essential details - which is an important counter-balance to the brevity score that we implemented earlier.
Copy
Ask AI
"Three LASD deputies dead explosion at SEB training place, maybe accident with explosives, unclear if more hurt, FBI ATF LAPD there, waiting sheriff talk more."
This reads much better than before, but it still reads like a bullet mash‑up—missing verbs, punctuation, and time context—so clarity and polish are next on the fix‑list.
Our third evaluator narrows the comparison window from the entire source document to a curated bullet list of key facts. Pure document‑level ROUGE can reward nonsense phrases that merely echo scattered words; by contrast, scoring against a focused checklist forces the model to mention the specific points humans actually care about.The downside is cost: generating high‑quality bullet lists requires either human or much larger LLM annotation.For example, a bullet point list of our new example might look like the following:
Copy
Ask AI
["An explosion occurred at the LASD Special Enforcement Bureau (SEB) training facility in Monterey Park around 7:30 a.m.","Three sheriff’s deputies were killed, reportedly while handling explosives; cause appears accidental.","FBI, ATF, LAPD bomb squad, and L.A. County Fire responded; further injuries are unconfirmed.","Officials including Governor Newsom and Supervisors Barger and Solis issued condolences; more details pending from Sheriff Luna.",]
Let’s enhance our dataset by adding this list and start writing our reward function. We’ll keep parts that we’ve developed so far and build upon that.
Once again, let’s run it through our pipeline and get a sample result:
Copy
Ask AI
"Three LASD deputies died in a likely accidental blast at SEB facility. FBI, ATF, LAPD responded. Officials expressed condolences. Details from Sheriff awaited."
By rewarding matches to these distilled key facts, the model learns to deliver summaries that are short and on-point—no more empty verbage, far fewer hallucinations. It looks a lot better than when we first started. We could reasonably stop here—the summaries are now short and reliably cover the must‑know facts—but let’s push one step further.
With essentials and length under control, the last step is polish: we combine the bullet‑coverage score with a fluency bonus (low perplexity from a tiny GPT‑2 scorer). The reward is a weighted average, so you can dial emphasis toward clarity or content with one line of code through the use of reward-kit
This blended signal nudges the model to mention every must‑know bullet and read naturally, giving us crisp, on‑topic summaries with human‑friendly flow—our final polish after the earlier length and coverage stages.Here’s an output:
Copy
Ask AI
"An explosion at LASD’s SEB facility killed three deputies during explosives training. FBI, ATF, and LAPD responded. Officials offered condolences, and further details are expected from Sheriff Luna as the investigation continues into the apparent accident."
Exactly 47 tokens! It names the location, casualties, training context, responding agencies, public response, and the pending investigation—all in polished, complete sentences with no filler.
By walking a plain language model through four reward tweaks—length gate, document overlap, key‑bullet focus, and a final fluency blend—we steered it into a dependable 50‑token summarizer. Each change showed, in minutes, how the model bends to whatever signal we supply, thanks to the lightweight evaluator‑swap workflow built into Fireworks’ RFT platform.
A model follows its incentives, not your intentions. Define the right reward and you steer behaviour directly; leave gaps and the model finds them.
Start simple, then layer complexity. A binary length check exposed verbosity problems instantly; later signals refined relevance and style.
End‑to‑end feedback beats imitation alone. Rewarding the full output captures goals that token‑level training can’t touch.
The exercise also showed how quickly you can iterate when evaluators are first‑class citizens: swap one in, rerun, and immediately trace the effect. Keep that loop handy, keep the reward honest, and your models will do exactly what you ask—nothing more, nothing less.That’s the demo — let the summaries speak for themselves.