Self-Distillation Enables Continual Learning [pdf]

(arxiv.org)

20 points | by teleforce 3 hours ago ago

5 comments

  • ArchieScrivener an hour ago ago

    From Jan 2026.

    This is very interesting:

    "Empirical Validation. While we cannot verify these theoretically, we evaluate each empirically. We use the Qwen-2.5-7B-Instruct model (Hui et al., 2024) as the base policy and the ToolAlpaca dataset (Tang et al., 2023). In this benchmark, the model receives a tool-API specification and a user request, and must identify the correct tool call. Without demonstrations, the base model solves only 42% of examples. When provided with the appropriate demonstration c for each prompt x , the teacher achieves a 100% success rate. To further test reward proximity, we manually inspected 50 teacher reasoning traces. In all cases, not only were the final tool calls correct, but the intermediate chain-of-thought was valid and semantically grounded. This suggests that the teacher is reconstructing a correct reasoning process rather than merely copying the expert output. These observations provide evidence for the first requirement, that the demonstration-conditioned model behaves as an optimal policy."

  • airstrike 2 hours ago ago

    Both title and abstract feel a little too confident, which ironically makes me more skeptical rather than less.

    I find the choice of the words "enable" in the title and "establishing" at the end of the abstract to be particularly jarring.

  • greesil an hour ago ago

    Wtf is a policy? Is this some sort of RL thing that I'm too ML to understand?

    Gemini tells me it's the probability of the next token for an LLM. Okay then.

    • mountainriver 39 minutes ago ago

      What is this comment? It’s an RL paper, these are standard RL terms

      • greesil 34 minutes ago ago

        It's a comment. On Hacker News. Not the RL subreddit, or whatever. I'm just amazed at the jargon. I'm sure it's useful, but one could just call it model output.