Mamba-3

(together.ai)

130 points | by matt_d 4 days ago ago

19 comments

nl 5 hours ago ago

I'm looking forward to comparing this to Inception 2 (the text diffusion model) which in my experience is very fast and reasonably high quality.

[-]
- PhilippGille 2 hours ago ago
  
  You mean Mercury 2, by Inception: https://openrouter.ai/inception/mercury-2
- cubefox 3 hours ago ago
  
  Mamba-3 is an architecture while diffusion is, I believe, a type of objective. So these are not mutually exclusive and therefore not comparable.
  
  [-]
  - gyrovagueGeist an hour ago ago
    
    Not wrong, but I think it's more accurate to say:
    Mamba is an architecture for the middle layers of the network (the trunk) which assumes decoding takes place through an autoregressive sequence (popping out tokens in order). This is the SSM they talk about.
    Diffusion is an alternative to the autoregressive approach where decoding takes place through iterative refinement on a batch of tokens (instead of one at a time processing and locking each one in only looking forward). This can require different architectures for the trunk, the output heads, and modifications to the objective to make the whole thing trainable. Could mamba like ideas be useful in diffusion networks...maybe but it's a different problem setup.
- jychang 3 hours ago ago
  
  That's completely different. That's like saying you want to compare the Nvidia 5090 GPU to the latest Call of Duty.
jychang an hour ago ago

I'm not sure that I buy their conclusion that more compute during inference is good.
Yes, batch=1 inference is mostly memory bandwidth bound, not GPU compute bound. But no provider does batch=1 inference. Everyone groups all the requests into a batch, and the GPU computes them together.
With a fused kernel, that means the GPU streams the tensors from VRAM, and does a bunch of compute on different conversations in the batch, at the same time.
If they increase the amount of compute required per token, that just reduces the maximum batch size a GPU can handle. In practice, yes this does mean each GPU can serve less users. Providers aren't leaving GPU cores idle normally during inference.

[-]
- zozbot234 an hour ago ago
  
  > Everyone groups all the requests into a batch, and the GPU computes them together.
  You're only saving on fetching read-only parameters, and not even on that if you're using MoE models where each inference in the batch might require a different expert (unless you rearrange batches so that sharing experts becomes more likely, but that's difficult since experts change per-token or even per-layer). Everything else - KV-cache, activations - gets multiplied by your batch size. You scale both compute and memory pressure by largely the same amount. Yes, GPUs are great at hiding memory fetch latency, but that applies also to n=1 inference.
- yorwba an hour ago ago
  
  Their latency measurements comparing Mamba-2 and Mamba-3 are done with a batch size of 128. It doesn't seem like Mamba-2 was compute-bound even at that batch size.
robofanatic 5 hours ago ago

> Mamba-3 is a new state space model (SSM) designed with inference efficiency as the primary goal — a departure from Mamba-2, which optimized for training speed. The key upgrades are a more expressive recurrence formula, complex-valued state tracking, and a MIMO (multi-input, multi-output) variant that boosts accuracy without slowing down decoding.
Why can’t they simply say -
Mamba-3 focuses on being faster and more efficient when making predictions, rather than just being fast to train like Mamba-2.

[-]
- esquire_900 5 hours ago ago
  
  This is sort of what their first sentence states? Except your line implies that they are fast in training and inference, they imply they are focusing on inference and are dropping training speed for it.
  It's a nice opening as it is imo
  
  [-]
  - cubefox 2 hours ago ago
    
    They don't say anything about dropping training speed.
- E-Reverance 5 hours ago ago
  
  The first sentence basically does though, no?
  
  [-]
  - robofanatic 4 hours ago ago
    
    Of course my only objection was the language. LLMs are now old enough to leave the jargon behind and talk in simple easy to understand terms.
    
    [-]
    - oersted 3 hours ago ago
      
      I’d argue the opposite, the terminology is fairly mainstream by now and “inference” has a much more specific sense than “making predictions”.
- mufasachan 3 hours ago ago
  
  The blog is technical, technical terms in the TL;DR seems relevant to me.
- arendtio 3 hours ago ago
  
  I don't get the downvotes, as I had trouble understanding the intro as well. It seems it was written for a very specific audience.
  
  [-]
  - qeternity 3 hours ago ago
    
    Yes, it is written for a specific audience.
    That is not a reason for snark.
    As other commenters have noted, it’s well written.
  - magicalhippo 2 hours ago ago
    
    > I don't get the downvotes
    Because the blog post is a technical one and the intro contains very common jargon, and the proposed alternative was wrong.
- camillomiller 3 hours ago ago
  
  I don’t know why you’re being downvoted. As a longtime editor your version is immensely better. Looks like the original was probably not human-written.