Show HN: Marlin-2B: a tiny VLM to extract structured information from videos

(huggingface.co)

4 points | by HappyPablo 8 hours ago ago

2 comments

teamcubitflow 7 hours ago ago

I'm surprised that kind of captioning came from a 2B model; glad the fine tuning process actually shows a deliberate approach to making qwen 3.5 into essentially a new model of it's kind.

[-]
- HappyPablo 7 hours ago ago
  
  hey this is shubham, yeah Qwen3.5VL is awesome and it's training vocab is quiet strong so with the right data curation you can prolly take it into a bunch of other narrow tasks eg: we trying to fine-tune it to use SAM3 in a loop for segmentation tasks in the videos