Self-Distillation Enables Continual Learning [pdf]

(arxiv.org)

20 points | by teleforce 3 hours ago ago

5 comments

ArchieScrivener an hour ago ago

From Jan 2026.
This is very interesting:
"Empirical Validation. While we cannot verify these theoretically, we evaluate each empirically. We use the Qwen-2.5-7B-Instruct model (Hui et al., 2024) as the base policy and the ToolAlpaca dataset (Tang et al., 2023). In this benchmark, the model receives a tool-API specification and a user request, and must identify the correct tool call. Without demonstrations, the base model solves only 42% of examples. When provided with the appropriate demonstration c for each prompt x , the teacher achieves a 100% success rate. To further test reward proximity, we manually inspected 50 teacher reasoning traces. In all cases, not only were the final tool calls correct, but the intermediate chain-of-thought was valid and semantically grounded. This suggests that the teacher is reconstructing a correct reasoning process rather than merely copying the expert output. These observations provide evidence for the first requirement, that the demonstration-conditioned model behaves as an optimal policy."
airstrike 2 hours ago ago

Both title and abstract feel a little too confident, which ironically makes me more skeptical rather than less.
I find the choice of the words "enable" in the title and "establishing" at the end of the abstract to be particularly jarring.
greesil an hour ago ago

Wtf is a policy? Is this some sort of RL thing that I'm too ML to understand?
Gemini tells me it's the probability of the next token for an LLM. Okay then.

[-]
- mountainriver 39 minutes ago ago
  
  What is this comment? It’s an RL paper, these are standard RL terms
  
  [-]
  - greesil 34 minutes ago ago
    
    It's a comment. On Hacker News. Not the RL subreddit, or whatever. I'm just amazed at the jargon. I'm sure it's useful, but one could just call it model output.