1 comments

  • vkaufmann 9 hours ago ago

    GPT-OSS-20B-Vision: First community VLM for GPT-OSS, trained on a single DGX Spark

    A couple weeks ago I shipped an MCP server (noapi-google-search-mcp) and people in the community challenged me to do something harder - build a VLM. So I bought a DGX Spark, flew to Dubai, and built the first vision-language model for GPT-OSS from a hotel room. Just a Spark, hotel WiFi and stubbornness.

    This is an early proof of concept at 22% training - shipped it to show what's possible and to find compute partners to finish the job.

    What it does: Adds vision to GPT-OSS-20B. Takes an image + text prompt, generates coherent descriptions. Identifies objects, scenes, spatial relationships. Vision was trained directly into the model through QLoRA adaptation - the LLM learned to see, not just pass through visual tokens. All original text capabilities are fully preserved. Hallucinations present - expected at this training stage.

    How it works: A SigLIP vision encoder feeds into the 20B MoE language model through a method I call PseudoDeepStack - extracting visual features from multiple encoder depths instead of just the final layer. Richer visual representations at zero additional inference cost.

    Key finding: Projector-only training (the standard approach for dense VLMs) fails completely on MoE architectures. The expert routing can't handle visual tokens it's never seen. QLoRA adaptation solves this.

    The setup: Single NVIDIA DGX Spark GB10, hotel room in Dubai, Domino's pizza. No cluster, no team. ~3.5 days of training to this checkpoint.

    What's next: Finishing training with new hyperparameters based on what we learned from this run, scaling to GPT-OSS-120B (same projector works - shared hidden dimensions), benchmarking. Need compute to get there.

    Model + code + full model card: https://huggingface.co/vincentkaufmann/gpt-oss-20b-vision-pr...