also check out their interactive examples on the webapp. It's a bit more rough around the edges but shows real user input/output. Arguably such examples could be pushed further to better quality output.
Unsurprisingly the results are by far the best in the area shown in the image in the prompt, and quickly deteriorate beyond it, or more than a couple meters behind the camera.
It's worlds better than just doing gaussian splats from images, but given how much the quality is influenced by images the limit to four images with text prompt or eight images without prompt is quite limiting. That's plenty to describe a chair, but almost nothing to describe a home or a space station. I hope they can extend those limits in future updates
Feifei is a great researcher. But to be honest, the progress her company has made in "world modeling" seems to deviate somewhat from what she has advertised, which is a bit disappointing. As this article (https://entropytown.com/articles/2025-11-13-world-model-lecu...) summarizes, she is mainly working on 3DGS applications. The problem is that, despite the substantial funding, this demo video clearly avoids the essentials; the camera movement is merely a panning motion. It's safe to assume that adding even a one-second extra second to each shot would drastically reduce the quality. It offers almost no improvement over the earliest 3DGS demo, let alone the addition of any characters.
I'm confused, the article talks about static generation. It creates a gaussian splat or models, which are rendered by an engine. This isn't a real-time model or a normal video generator like Sora, or am I misreading?
This is not a world model, this ise at best the reimplementation of the the NVIDIA prior art around NeRF / 3D Gaussian Splatting and monocular depth, wrapped in a nice product and workflow. What they’re actually shipping is an offline asset generator: you feed it text, images, or video, it runs depth/structure estimation and neural 3D reconstruction, and you get a static splat/mesh world you can then render or simulate in a real engine. That’s useful and impressive engineering, but it’s very different from a proper “world model” in the RL/embodied‑AI sense. Here there’s no online dynamics, no agent loop, and no interactive rollouts; it’s closer to a high‑end NeRF/GS pipeline plus tooling than to something like Google’s Genie/2/3, which actually couples generative rendering with action‑conditioned temporal evolution. Calling this a “world model” feels more like marketing language than a meaningful technical distinction.
Infact my definition of a world model is more closer to what Demis has hinted in his discussions, that video gen models like veo are able to intuit they physics from just video trainingdata suggest that there is an underlying manifold in reality that is essentially computable and thus is being simulated by these models. Building such a model would essentially mean building a physics engine of some kind that predicts this manifold.
I like that they distinguish between the collider mesh (lower poly) and the detailed mesh (higher poly).
As a game developer I'm looking for:
• Export low-poly triangle mesh (ideally OBJ or FBX format — something fairly generic, nothing too fancy)
• Export texture map
• Export normals
• Bonus: export the scene as "de-structured" objects (e.g. instead of a giant world mesh with everything baked into it, separate exports for foreground and background objects to make it more game engine-ready.
Gaussian splats are awesome, but not critical for my current renderers. Cool to have though.
Aren't the gausian splats the output here? Or are these worlds fully meshed and textured assets?
From my understanding, admittedly quite a shallow look so far, the model generates gaussian splats then from that could implement the collider.
I guess from the splat and the colliders you could generate actual assets that could be interactable/animated/have physics etc. Unsure, exciting space though! I just don't know how I would properly use this in a game, the examples are all quite on-rails and seem to avoid interacting too much with stuff in the environment.
The page shows, near the bottom, how the main output is gaussian splats, but it can also generate triangular meshes (visual mesh + collider).
However, to my eye, the triangular meshes shown look pretty low quality compared to the splat: compare the triangulated books on the shelves, and the wooden chair by the door, as well as weird hole-like defects in the blanket by the fireplace.
It's also not clear if it's generating one mesh for the entire world, it looks like it is - that would make interactability and optimisation more difficult (no frustrum culling etc, though you could feasibly chop the mesh up into smaller pieces I suppose).
Is Marble's definition of a "world model" the same as Yann LeCun's definition of a world model? And is that the same as Genie's definition of a world model?
Pretty sure it's used as a marketing term here. They train on images that you generate/give it, but the output of that training is not a model, it's a static 3d scene made up out of gaussian splats. You are not running inference on a model when traversing one of those scenes, you are just rendering the splats.
At the very least it differs greatly from "world model" as understood in earlier robotics and AI research, wherein it referred to a model describing all the details of the world outside the system relevant to the problem at hand.
Very different, it would seem. Then again, it’s never been clear to me why LeCun believes that LLM architectures don’t inherently produce world models in the course of training.
IMO LLM more or less literally cannot do what they do without a world model, not least because much of what language is, is a protocol for making assertions about that model, testing the degree to which it is shared, and seeking to alter the model one carries of one's interlocutor's model.
To the "parrot people" I suggest, there is no more optimized mechanism for the inner layers of a network to approach than one which most parsimoniously models the world, so as to correctly emit tokens reflective of that.
Genie delivers on-the-fly generated video that responds to user inputs in real time.
Marble renders a static Gaussian Splat asset (like a 3D game engine asset) that you then render in a game engine.
Marble seems useful for lots of use cases - 3D design, online games, etc. You pay the GPU cost to render once, then you can reuse it.
Genie seems revolutionary but expensive af to render and deliver to end users. You never stop paying boatloads of H100 costs (probably several H100s or TPU equivalents per user session) per second.
You could make a VRChat type game with Marble.
You could make a VRChat game with Genie, but only the billionaires could afford to play it.
To be clear, Genie does some remarkably cool things. You can prompt it, "T-Rex tap dancing by" and it'll appear animated in the world. I don't think any other system can do this. But the cost is enormous and it's why we don't have a playable demo.
When the cost of GPU compute comes down, I'm sure we'll all be steaming a Google Stadia like experience of "games" rendered on the fly. Multiplayer, with Hollywood grade visuals. Like playing real time Lord of the Rings or something wild.
Interestingly, there is a model like Google Genie that is open source and available to run on your local Nvidia desktop GPU. It's called DiamondWM [1], and it's a world model trained on FPS gameplay footage. It generates a 10 fps 160x160 image you can play through. Maybe we'll develop better models and faster techniques and the dream of local world models can one day be realized.
Graphics have long reached diminishing returns in gameplay, people aren't going to playing VRChat tomorrow for the same reasons today.
AI can speed up asset development, but that is not a major bottleneck for video games, what matters is the creative game design and backend systems, which existing on the interaction between players and systems is just about as hard as any management role, if not harder.
From what I can tell, you can actually export a mesh in (paid) Marble, whereas I haven't seen mesh exports offered in Genie 3 yet (could be wrong though).
> I work in AI and, to this day, I don't know what they mean by “world” in “world model”.
I have a PhD in ML and a B.S. in physics. What people in ML call a "world model" seems incredibly strange to me. With my physics hat on, a "world model" is pretty clear. It is "a physics." Mind you, there is not one physics, there are competing models and we're just at a point of time that models have converged up to quantum and gravity.
But "a physics" can be a model that describes any world, not just the one we live in. For ML models, this should be based on the data they're processing. Ideally we'd want this to be similar to our own, but if it is modeling a "world" where pi=3, then that's still "a physics".
The key points here are that a physics is a counterfactual description of the environment. You have to have language to formalize relationships between objects. In standard physics (and most of science) we use math[0], though we use several languages (different algebras, different groups, different branches, etc) of math to describe different phenomena. But the point is that an equation is designed to be the maximum compression of that description. I don't really care if you use numbers or symbols, what matters is if you have counterfactual, testable, consistent, and concise descriptions of "the world".
Oddly enough, there are a lot of physicists and former physicists that work in ML but it is fairly uncommon for them to be working on "world modeling." I can tell you from my own experience talking to people who research world models that they respond to my concerns as "we just care if it works" as if that is also not my primary concern. Who the fuck isn't concerned with that? Philosophers?
[0] It can be easy to read "The Unreasonable Effectiveness of Mathematics in the Natural Sciences" out of context as we're >60 years past where math has been the lingua Franca of science. But math is a language, we invented it, and it should be surprising that this language is so powerful that we can work out the motion of the stars from a piece of paper. Math is really the closest thing we have to a magical language https://web.archive.org/web/20210212111540/http://www.dartmo...
Broadly 'world' means 'the domain I'm interested in'. In current use in the DNN context 'world' tends to be physical space at a scale relevant to humans or robots (eg. autonomous vehicles). So when someone says 'world model' you have to ask 'what kind of world, and how is it represented?'.
We don't need to agree on one very specific meaning, which is good, because we would fail.
Yeh I still don't think there's a fixed definition of what a world model is or in what modality it will emerge. I'm unconvinced it will emerge as a satisfying 3d game-like first-person walkthrough.
Same terms - gentlemen's agreement. The loser owes the winner a meal whenever they meet :). For a HN visitor to blore, I'll happy to host a meal anyway :)
Yeah, it's not quite there yet, but think of this as Stable Diffusion 1, or DALL-E 1/2. It's hard to imagine this not being a part of the VFX workflow within 5 years.
Incredibly disappointing release, especially for a company with so much talent and capital.
Looking at the worlds generated here https://marble.worldlabs.ai/ it looks a lot more like they are just doing image generation for a multiview stereo 360 panoramas and then reprojecting that into space. The generations exhibit all the same image artifacts that come from this type of scanning/reconstruction work, all the same data shadow artifacts, etc.
This is more of a glorified image generator, a far cry from a "world model".
To be fair, multiview-consistent diffusion is extremely hard - it's an accomplishment of it's own right to get right, and still very useful. "World model" is probably a misnomer though (what even is a world model?). Their recent work on frame gen models is probably a bit closer to an actual world model in the traditional sense (https://www.worldlabs.ai/blog/rtfm).
They have $230m in funding and some of the best CS/AI researchers in the world. People like Skybox labs have already released stuff that is effectively the same as this with far less capital and resources. This is THE premiere world model company, and the fact their first release is a far cry from the promise here feels like a bit of a bellweather.
I agree RTFM is in more of the "right" direction here, and what is presented here is a bit of a derivative of that. Which makes this release so much more crass, as it seems like a ploy to get platform buy in from users more so than a release of a "world model".
Yeah, I'm likewise a bit underwhelmed by the results.
If you go in with the expectation that you give it a single image and it's doing gaussian splatting from a single image and a prompt it's phenomenal. If you deviate too far from the image viewpoint it breaks down, but it looks decent long enough to be very usable. But if you go in with the expectation that it's generating "worlds" it's not very good. This only passes as a world in a 20 second tech demo where the user isn't given camera controls
My best guess is that they are forced (by investors, lack of investors, fear of the AI bubble, or whatever) to release something, and this was something they could polish up to production quality and host with reasonable GPU resources
I assume this is definitely the case, with a drive to create platform economics on their sharing platform so that there is platform lock-in when any better thing releases. This is more of a platform launch than any notable model launch imo.
As someone with barebones understanding of "world models," how does this differ from sophisticated game engines that generate three-dimensional worlds? Is it simply the adaptation of transformer architecture in generating the 3-D world v/s using a static/predictable script as in game engines (learned dynamics vs deterministic simulation mimicking 'generation')? Would love an explanation from SMEs.
Games are still mostly polygon based due to tooling (Even Unreal Nanite is a special variation of handling polygons), some engines have tried voxels (Teardown, Minecraft genererates polygons and would fall in the previous category as far as rendering goes) or even implict surface modes by composing SDF'y primitives (Dreams on Playstation and more recently unbound.io).
All of these have fairly "exact" representations, and generation techniques are also often fairly "exact" in trying to create worlds that won't break physics engines(big part) or rendering engines, often hand-crafted algorithms but nothing really that really stopped neural networks from being used on a higher level.
One important detail in most generation systems in games is that they are often built to be controllable to work with game-logic (think how Minecraft generates the world to include biomes,villages,etc) or more or less artist controllable.
3d scanning has often relied on point-clouds, but were heavy, full of holes,etc and have been infeasible for direct rendering for long so many methods were developed to make decent polygon meshes.
Nerf's and Gaussian splatting(GS) started appearing a few years back, these are more "approximate" and totally ignore polygon generation instead relying on quantization of the world into NN-matrix-"fields"(NERF) or fuzzy-point-clouds (GS), visually these have been impressive since they managed to capture "real" images well.
This system is built on GS since that probably meshed fairly well with neural network token and diffusion techniques for encoding inputs (images, texts).
They do mention mesh exports (there has been some research into polygon generation from GS).
If the system scales to huge worlds this could change game-dev, and there seems to be some aim with the control methods, but it'd probably require more control and world/asset management since you need predictability with existing things to produce in the long term (same as with code agents).
Your later point is what makes me think this doesn't have comprehensive legs, just niche usage.
A typical game has thousands of hand placed nodes in 3D space, that do things like place lights, trigger story beats, account for physics and collisions etc. That wouldn't change with Gaussian splats, but if you needed to edit the world then even with deterministic generation, the whole world might change, and all your gameplay nodes are now misplaced.
That doesn't matter for some games, but I think it does matter for most.
Oh I agree fully, this is probably more created by researchers and/or "AI-bros" with less experience as actual game developers (that they have actually added a way of placing objects is after all far more than most other tools has provided with their text-focus).
That said, all those collisions, triggers, lights, etc could be authored together with blockouts in Unity, Godot or some other editor capable of creating levels that integrates with the rest of the game authoring process.
If they create a way to keep the contexts of generation (or rebuild them from marker objects with prompts that are kept in the level editor and continiously re-imported) and allow for a sane way to re-generate and keep chunks then I feel that this could be fairly bad for world artists (Yes, they'd probably still be needed to adjust things to not look like total slop).
You could in theory combine point clouds and Nanite: cull sub-pixel points and generate geometry on the fly by filling the voids between remaining points with polygons. The main issue is bandwidth, GPUs are barely able to handle Nanite; and this would be at least an order of magnitude more complex to do at runtime.
Nanite is doing a lot of offline precomputation, storing some sort of intermediate models etc.
I agree, but I don't think this work is for realtime creation (like those Google models) but rather offline authoring. So the fixups can be done later.
What does the Gaussian approach do that resolves the issue with voxel engines? I recall if you wanted to start doing animation it becomes a mess of computational complexity.
GS does 2 things that makes it great for _rendering_ and _world approximation_, it's a view-dependent "fuzzy" thing, so rendering-wise you don't need to fill in blanks of reconstruction, they also encodes view dependent things like reflections (that should help an AI model infer beyond-view details).
The issue of real voxels (not MC style) is that they fill in fixed spaces that then can creates gaps once you start animating, you probably have the same issues with GS (but that's probably why they are doing exports).
Whenever I see these and play with models like this (and the demos on this page), the movement in the world always feel like a dolly zoom. Things in the distance tend to stay in the distance, even as the camera moves in that direction, and only the local area changes features.
That's the thing about this. Calling things "world models" is only done to confuse people, because "world" is such a loose word. In this scenario the meaning is "3d scene". When others use it, they may mean "screen space physics model". In the context of LLMs it means something like "reasoning about real-world processes outside of text".
Nice tech!
Would be great if this can also work on factual data, like design drawings. With that it could be used for BIM and regulatory use. For example to showcase to residents how a new residential area will look that is planned. Or to test the layout of a planned airport.
An established founder makes claims X is the new frontier. X receives hundreds of millions in funding. Other less established founders claim they are working on X too. VCs suffering from terminal FOMO pump billions more into X. X becomes the next frontier. The previous frontiers are promptly forgotten about.
I think it's a bit confusing when it comes to terminology, this seems more graphics focused while I suspect that a 10 year plan as mentioned by YLC probably revolves around re-architecting AI systems to be less reliant on LLM style nets/refinements and better understand the world in a way that isn't as prone to hallucinations.
If biology operated this way animals would never have evolved. This is bunk, it has nothing to do with intelligence and everything to do with hyping the oxymoronic/paradox branded as spatial intelligence. It's a flimsy way of using binary to burn energy.
Senses do not represent, if we needed them to in order to survive, we'd be dead before we never saw that branch, or that step, etc. This is the same mistaken approach cog-sci took in inventing the mind from the brain.
The problem is the whole brain prior to sensory and emotional integration is deeply involved so that an incredibly shallow oscillatory requirement fits atop the depth of long-term path-integration, memory consolidation involving allo and egocentric references of space and time, these are then correlated in affinities by sense emotion relays or values. None of this is here, it's all discarded for junk volumes made arbitrary, whereas the spaces here are immeasurably specific, fused by Sharp Wave Ripples.
There's no world model in the instantaneous shallow optic flow response (knowing when to slam on brakes, or pull in for a dive) and in the deep entanglements of memory and imagination and creativity.
This is one-size fits all nonsense. It's junk. It can't hope to resolve the space time requirements of shallow and deep experiences.
Counterpoint: Humans visualize stuff in their minds before trying new things or when learning new concepts. An AI system with LLM based language center and a world model to visualize during training and inference would help it overlap more of human intelligence. Also it looks cool.
Edit: After seeing your edited (longer) comment, I have no idea what you’re talking about.
those two words only describe AI models, as they are models. A "world model" is worse than those two words as it is oxymoronic.
The idea that words and space are being conflated as a formula for spatial intelligence is fundamentally absurd as our relationships to space have no resolution, both within any one language and worse, between them, as language is arbitrary. Language and thought are entirely separate forms. Aphasia has proved this since 2016.
AI developers have to face the music, these impoverished gimmicks aren't even magic, they are bunk. And debunkable once compared to the sciences of the brain.
Is that a more convoluted way to say that a next thing predictor can't exhibit complex behavior? Aka the stochastic parrot argument. Or that one modality can't be a good enough proxy for the other. If so, you probably have to pay more attention to the interpretability research.
But actually most people should start with strong definitions. Consciousness, intelligence, and other adjacent terms have never been defined rigorously enough, even if a ton of philosophers think otherwise. These discussions always dance around ill-defined terms.
Neurobio is built from the base units of consciousness outwards, not intuited interpretation. Eg prediction has nothing inherent to do with consciousness directly. That’s a process imposed on brains post hoc.
Easily refute prediction or error prediction as fundamental.
The path to intelligence or consciousness isn’t mimicry of interpretation.
In terms of strong definitions, start at the base, coders: oscillation, dynamics,
Topologies, sharp wave ripples, and I would say roughly 60 more strongly defined material units and processes. This reverse intuition is going nowhere and it’s pseudoscientific nonsense for social media timeline filling.
I started writing the counterargument, but somehow I think you have a weird idea of what both interpretability in ML and neurobiology are, especially seeing how you're dealing with things nobody has a full idea about in such absolutes
I'm floored. Incredible work.
also check out their interactive examples on the webapp. It's a bit more rough around the edges but shows real user input/output. Arguably such examples could be pushed further to better quality output.
e.g. https://marble.worldlabs.ai/world/b75af78a-b040-4415-9f42-6d...
e.g. https://marble.worldlabs.ai/world/cbd8d6fb-4511-4d2c-a941-f4...
Unsurprisingly the results are by far the best in the area shown in the image in the prompt, and quickly deteriorate beyond it, or more than a couple meters behind the camera.
It's worlds better than just doing gaussian splats from images, but given how much the quality is influenced by images the limit to four images with text prompt or eight images without prompt is quite limiting. That's plenty to describe a chair, but almost nothing to describe a home or a space station. I hope they can extend those limits in future updates
Very cool tech, extremely annoying voice over.
Feifei is a great researcher. But to be honest, the progress her company has made in "world modeling" seems to deviate somewhat from what she has advertised, which is a bit disappointing. As this article (https://entropytown.com/articles/2025-11-13-world-model-lecu...) summarizes, she is mainly working on 3DGS applications. The problem is that, despite the substantial funding, this demo video clearly avoids the essentials; the camera movement is merely a panning motion. It's safe to assume that adding even a one-second extra second to each shot would drastically reduce the quality. It offers almost no improvement over the earliest 3DGS demo, let alone the addition of any characters.
I'm confused, the article talks about static generation. It creates a gaussian splat or models, which are rendered by an engine. This isn't a real-time model or a normal video generator like Sora, or am I misreading?
This is not a world model, this ise at best the reimplementation of the the NVIDIA prior art around NeRF / 3D Gaussian Splatting and monocular depth, wrapped in a nice product and workflow. What they’re actually shipping is an offline asset generator: you feed it text, images, or video, it runs depth/structure estimation and neural 3D reconstruction, and you get a static splat/mesh world you can then render or simulate in a real engine. That’s useful and impressive engineering, but it’s very different from a proper “world model” in the RL/embodied‑AI sense. Here there’s no online dynamics, no agent loop, and no interactive rollouts; it’s closer to a high‑end NeRF/GS pipeline plus tooling than to something like Google’s Genie/2/3, which actually couples generative rendering with action‑conditioned temporal evolution. Calling this a “world model” feels more like marketing language than a meaningful technical distinction. Infact my definition of a world model is more closer to what Demis has hinted in his discussions, that video gen models like veo are able to intuit they physics from just video trainingdata suggest that there is an underlying manifold in reality that is essentially computable and thus is being simulated by these models. Building such a model would essentially mean building a physics engine of some kind that predicts this manifold.
I like that they distinguish between the collider mesh (lower poly) and the detailed mesh (higher poly).
As a game developer I'm looking for:
• Export low-poly triangle mesh (ideally OBJ or FBX format — something fairly generic, nothing too fancy) • Export texture map • Export normals • Bonus: export the scene as "de-structured" objects (e.g. instead of a giant world mesh with everything baked into it, separate exports for foreground and background objects to make it more game engine-ready.
Gaussian splats are awesome, but not critical for my current renderers. Cool to have though.
Aren't the gausian splats the output here? Or are these worlds fully meshed and textured assets?
From my understanding, admittedly quite a shallow look so far, the model generates gaussian splats then from that could implement the collider.
I guess from the splat and the colliders you could generate actual assets that could be interactable/animated/have physics etc. Unsure, exciting space though! I just don't know how I would properly use this in a game, the examples are all quite on-rails and seem to avoid interacting too much with stuff in the environment.
The page shows, near the bottom, how the main output is gaussian splats, but it can also generate triangular meshes (visual mesh + collider).
However, to my eye, the triangular meshes shown look pretty low quality compared to the splat: compare the triangulated books on the shelves, and the wooden chair by the door, as well as weird hole-like defects in the blanket by the fireplace.
It's also not clear if it's generating one mesh for the entire world, it looks like it is - that would make interactability and optimisation more difficult (no frustrum culling etc, though you could feasibly chop the mesh up into smaller pieces I suppose).
Is Marble's definition of a "world model" the same as Yann LeCun's definition of a world model? And is that the same as Genie's definition of a world model?
Pretty sure it's used as a marketing term here. They train on images that you generate/give it, but the output of that training is not a model, it's a static 3d scene made up out of gaussian splats. You are not running inference on a model when traversing one of those scenes, you are just rendering the splats.
At the very least it differs greatly from "world model" as understood in earlier robotics and AI research, wherein it referred to a model describing all the details of the world outside the system relevant to the problem at hand.
Very different, it would seem. Then again, it’s never been clear to me why LeCun believes that LLM architectures don’t inherently produce world models in the course of training.
Nor I.
IMO LLM more or less literally cannot do what they do without a world model, not least because much of what language is, is a protocol for making assertions about that model, testing the degree to which it is shared, and seeking to alter the model one carries of one's interlocutor's model.
To the "parrot people" I suggest, there is no more optimized mechanism for the inner layers of a network to approach than one which most parsimoniously models the world, so as to correctly emit tokens reflective of that.
I understand that DeepMind is working on this too: https://deepmind.google/blog/genie-3-a-new-frontier-for-worl...
I wonder how their approaches and results compare?
Genie delivers on-the-fly generated video that responds to user inputs in real time.
Marble renders a static Gaussian Splat asset (like a 3D game engine asset) that you then render in a game engine.
Marble seems useful for lots of use cases - 3D design, online games, etc. You pay the GPU cost to render once, then you can reuse it.
Genie seems revolutionary but expensive af to render and deliver to end users. You never stop paying boatloads of H100 costs (probably several H100s or TPU equivalents per user session) per second.
You could make a VRChat type game with Marble.
You could make a VRChat game with Genie, but only the billionaires could afford to play it.
To be clear, Genie does some remarkably cool things. You can prompt it, "T-Rex tap dancing by" and it'll appear animated in the world. I don't think any other system can do this. But the cost is enormous and it's why we don't have a playable demo.
When the cost of GPU compute comes down, I'm sure we'll all be steaming a Google Stadia like experience of "games" rendered on the fly. Multiplayer, with Hollywood grade visuals. Like playing real time Lord of the Rings or something wild.
Interestingly, there is a model like Google Genie that is open source and available to run on your local Nvidia desktop GPU. It's called DiamondWM [1], and it's a world model trained on FPS gameplay footage. It generates a 10 fps 160x160 image you can play through. Maybe we'll develop better models and faster techniques and the dream of local world models can one day be realized.
[1] https://diamond-wm.github.io/
Graphics have long reached diminishing returns in gameplay, people aren't going to playing VRChat tomorrow for the same reasons today.
AI can speed up asset development, but that is not a major bottleneck for video games, what matters is the creative game design and backend systems, which existing on the interaction between players and systems is just about as hard as any management role, if not harder.
From what I can tell, you can actually export a mesh in (paid) Marble, whereas I haven't seen mesh exports offered in Genie 3 yet (could be wrong though).
Isn't this a Gaussian Splat model?
I work in AI and, to this day, I don't know what they mean by “world” in “world model”.
But "a physics" can be a model that describes any world, not just the one we live in. For ML models, this should be based on the data they're processing. Ideally we'd want this to be similar to our own, but if it is modeling a "world" where pi=3, then that's still "a physics".
The key points here are that a physics is a counterfactual description of the environment. You have to have language to formalize relationships between objects. In standard physics (and most of science) we use math[0], though we use several languages (different algebras, different groups, different branches, etc) of math to describe different phenomena. But the point is that an equation is designed to be the maximum compression of that description. I don't really care if you use numbers or symbols, what matters is if you have counterfactual, testable, consistent, and concise descriptions of "the world".
Oddly enough, there are a lot of physicists and former physicists that work in ML but it is fairly uncommon for them to be working on "world modeling." I can tell you from my own experience talking to people who research world models that they respond to my concerns as "we just care if it works" as if that is also not my primary concern. Who the fuck isn't concerned with that? Philosophers?
[0] It can be easy to read "The Unreasonable Effectiveness of Mathematics in the Natural Sciences" out of context as we're >60 years past where math has been the lingua Franca of science. But math is a language, we invented it, and it should be surprising that this language is so powerful that we can work out the motion of the stars from a piece of paper. Math is really the closest thing we have to a magical language https://web.archive.org/web/20210212111540/http://www.dartmo...
Broadly 'world' means 'the domain I'm interested in'. In current use in the DNN context 'world' tends to be physical space at a scale relevant to humans or robots (eg. autonomous vehicles). So when someone says 'world model' you have to ask 'what kind of world, and how is it represented?'.
We don't need to agree on one very specific meaning, which is good, because we would fail.
Yeh I still don't think there's a fixed definition of what a world model is or in what modality it will emerge. I'm unconvinced it will emerge as a satisfying 3d game-like first-person walkthrough.
Ye but there wont be, same as with "agi" and "ai" depends on whom you are asking *shrug
I think absolutely it will in a year
but it sounds cool
This is going to take movie making to another level because now we can: 1. Generate a full scene 2. Generate a character with specific movements.
Combine these 2, and we can have moving cameras as needed (long takes). This is going to make storytelling very expressive.
Incredible times! Here's a bet: We'll have a AI superstar (A-list level) in the next 12 months.
An a-list level actor superstar within 12 months?
I’m willing to take that bet. Name any amount you’re willing to lose.
Before you agree: movies take more than 1 year to make and get published, and it takes more than 1 movie to make somebody an a-lister
Fair warning, when I last put up a bet in AI video arena, I won! https://www.linkedin.com/posts/anilgulecha_kitsune-activity-...
Same terms - gentlemen's agreement. The loser owes the winner a meal whenever they meet :). For a HN visitor to blore, I'll happy to host a meal anyway :)
What was the 90 minute movie?
The LinkedIn thread also seems as AI generated there
the last few % that make a a-list actor an a-list actor is the hardest part, i would bet you that its going to take longer than 12 months
Hard disagree. CG in films is awful when done cheaply, and this all looks like really cheap CG.
Yeah, it's not quite there yet, but think of this as Stable Diffusion 1, or DALL-E 1/2. It's hard to imagine this not being a part of the VFX workflow within 5 years.
> This is going to make storytelling very expressive.
finally! we should come up for a term for this new tech... maybe computer generated imagery?
Incredibly disappointing release, especially for a company with so much talent and capital.
Looking at the worlds generated here https://marble.worldlabs.ai/ it looks a lot more like they are just doing image generation for a multiview stereo 360 panoramas and then reprojecting that into space. The generations exhibit all the same image artifacts that come from this type of scanning/reconstruction work, all the same data shadow artifacts, etc.
This is more of a glorified image generator, a far cry from a "world model".
To be fair, multiview-consistent diffusion is extremely hard - it's an accomplishment of it's own right to get right, and still very useful. "World model" is probably a misnomer though (what even is a world model?). Their recent work on frame gen models is probably a bit closer to an actual world model in the traditional sense (https://www.worldlabs.ai/blog/rtfm).
They have $230m in funding and some of the best CS/AI researchers in the world. People like Skybox labs have already released stuff that is effectively the same as this with far less capital and resources. This is THE premiere world model company, and the fact their first release is a far cry from the promise here feels like a bit of a bellweather.
I agree RTFM is in more of the "right" direction here, and what is presented here is a bit of a derivative of that. Which makes this release so much more crass, as it seems like a ploy to get platform buy in from users more so than a release of a "world model".
https://www.skyboxai.net/ https://worldgen.github.io/
Yeah, I'm likewise a bit underwhelmed by the results.
If you go in with the expectation that you give it a single image and it's doing gaussian splatting from a single image and a prompt it's phenomenal. If you deviate too far from the image viewpoint it breaks down, but it looks decent long enough to be very usable. But if you go in with the expectation that it's generating "worlds" it's not very good. This only passes as a world in a 20 second tech demo where the user isn't given camera controls
My best guess is that they are forced (by investors, lack of investors, fear of the AI bubble, or whatever) to release something, and this was something they could polish up to production quality and host with reasonable GPU resources
I assume this is definitely the case, with a drive to create platform economics on their sharing platform so that there is platform lock-in when any better thing releases. This is more of a platform launch than any notable model launch imo.
As someone with barebones understanding of "world models," how does this differ from sophisticated game engines that generate three-dimensional worlds? Is it simply the adaptation of transformer architecture in generating the 3-D world v/s using a static/predictable script as in game engines (learned dynamics vs deterministic simulation mimicking 'generation')? Would love an explanation from SMEs.
Games are still mostly polygon based due to tooling (Even Unreal Nanite is a special variation of handling polygons), some engines have tried voxels (Teardown, Minecraft genererates polygons and would fall in the previous category as far as rendering goes) or even implict surface modes by composing SDF'y primitives (Dreams on Playstation and more recently unbound.io).
All of these have fairly "exact" representations, and generation techniques are also often fairly "exact" in trying to create worlds that won't break physics engines(big part) or rendering engines, often hand-crafted algorithms but nothing really that really stopped neural networks from being used on a higher level.
One important detail in most generation systems in games is that they are often built to be controllable to work with game-logic (think how Minecraft generates the world to include biomes,villages,etc) or more or less artist controllable.
3d scanning has often relied on point-clouds, but were heavy, full of holes,etc and have been infeasible for direct rendering for long so many methods were developed to make decent polygon meshes.
Nerf's and Gaussian splatting(GS) started appearing a few years back, these are more "approximate" and totally ignore polygon generation instead relying on quantization of the world into NN-matrix-"fields"(NERF) or fuzzy-point-clouds (GS), visually these have been impressive since they managed to capture "real" images well.
This system is built on GS since that probably meshed fairly well with neural network token and diffusion techniques for encoding inputs (images, texts).
They do mention mesh exports (there has been some research into polygon generation from GS).
If the system scales to huge worlds this could change game-dev, and there seems to be some aim with the control methods, but it'd probably require more control and world/asset management since you need predictability with existing things to produce in the long term (same as with code agents).
Your later point is what makes me think this doesn't have comprehensive legs, just niche usage.
A typical game has thousands of hand placed nodes in 3D space, that do things like place lights, trigger story beats, account for physics and collisions etc. That wouldn't change with Gaussian splats, but if you needed to edit the world then even with deterministic generation, the whole world might change, and all your gameplay nodes are now misplaced.
That doesn't matter for some games, but I think it does matter for most.
Oh I agree fully, this is probably more created by researchers and/or "AI-bros" with less experience as actual game developers (that they have actually added a way of placing objects is after all far more than most other tools has provided with their text-focus).
That said, all those collisions, triggers, lights, etc could be authored together with blockouts in Unity, Godot or some other editor capable of creating levels that integrates with the rest of the game authoring process.
If they create a way to keep the contexts of generation (or rebuild them from marker objects with prompts that are kept in the level editor and continiously re-imported) and allow for a sane way to re-generate and keep chunks then I feel that this could be fairly bad for world artists (Yes, they'd probably still be needed to adjust things to not look like total slop).
You could in theory combine point clouds and Nanite: cull sub-pixel points and generate geometry on the fly by filling the voids between remaining points with polygons. The main issue is bandwidth, GPUs are barely able to handle Nanite; and this would be at least an order of magnitude more complex to do at runtime. Nanite is doing a lot of offline precomputation, storing some sort of intermediate models etc.
I agree, but I don't think this work is for realtime creation (like those Google models) but rather offline authoring. So the fixups can be done later.
What does the Gaussian approach do that resolves the issue with voxel engines? I recall if you wanted to start doing animation it becomes a mess of computational complexity.
GS does 2 things that makes it great for _rendering_ and _world approximation_, it's a view-dependent "fuzzy" thing, so rendering-wise you don't need to fill in blanks of reconstruction, they also encodes view dependent things like reflections (that should help an AI model infer beyond-view details).
The issue of real voxels (not MC style) is that they fill in fixed spaces that then can creates gaps once you start animating, you probably have the same issues with GS (but that's probably why they are doing exports).
The model is predicting what the state of the world would look like after a given action.
Along with entertainment, they can be used for simulation training for robots. And allow for imagining potential trajectories
Marble is not that type of world model. It generates static Gaussian Splat assets that you can render using 3D libraries.
Whenever I see these and play with models like this (and the demos on this page), the movement in the world always feel like a dolly zoom. Things in the distance tend to stay in the distance, even as the camera moves in that direction, and only the local area changes features.
[0] https://en.wikipedia.org/wiki/Dolly_zoom
That's the thing about this. Calling things "world models" is only done to confuse people, because "world" is such a loose word. In this scenario the meaning is "3d scene". When others use it, they may mean "screen space physics model". In the context of LLMs it means something like "reasoning about real-world processes outside of text".
This "world model" is Image to Gaussian Splat. This is a static render that a web-based Gaussian Splat viewer then renders.
Other "world model"s are Image + (keyboard input) to Video or Streaming Images, that effectively function like a game engine / video hybrid.
Duplicate: https://news.ycombinator.com/item?id=45902732
Nice tech! Would be great if this can also work on factual data, like design drawings. With that it could be used for BIM and regulatory use. For example to showcase to residents how a new residential area will look that is planned. Or to test the layout of a planned airport.
This seems very interesting. Timely, given that Yann LeCun's vision also seems to align with world models being the next frontier: https://news.ycombinator.com/item?id=45897271
An established founder makes claims X is the new frontier. X receives hundreds of millions in funding. Other less established founders claim they are working on X too. VCs suffering from terminal FOMO pump billions more into X. X becomes the next frontier. The previous frontiers are promptly forgotten about.
I think it's a bit confusing when it comes to terminology, this seems more graphics focused while I suspect that a 10 year plan as mentioned by YLC probably revolves around re-architecting AI systems to be less reliant on LLM style nets/refinements and better understand the world in a way that isn't as prone to hallucinations.
What’s it going to do, take away funds from the otherwise extremely prudent AI sector?
What happens when you prompt one of these kind of models with de_dust? Will it autocomplete the rest of the map?
edit: Just tried it and it doesn't, but it does a good job of creating something like a CS map.
>What happens when you prompt one of these kind of models with de_dust?
Presumably de_dust2
You both should check out DiamondWM. It runs on Ubuntu and I think Windows, presuming you have an Nvidia GPU. It's exactly what you're talking about.
I linked it elsewhere in this thread.
I just want to give it a picture of my house and have it show what it could look like organized so I know where to put everything
Got the same use case, i tried last month with gemini app and it was awful, most rendering was messed up.
Are we nearing the capability to build something like the Mind's Game (from the Ender's Game book)
This is great. Can I use it with existing scan of my room to fill the gaps? Not a random world
Update - yes you can. To be tested.
You also can't regenerate with the new point of origin. Generate -> stop ... No way to continue?
Update - it is a paid feature
It doesn't seem all great researchers can create great companies or products
this prompt seems to be blocked, "Los Angeles moments before a 8mile wide asteroid impacts." others work but when I use that it's always 'too busy'.
seems anything to do with asteroids (or explosions I imagine) are blocked.
RIP GTA6
exciting towards world intelligence
Impressive!
If biology operated this way animals would never have evolved. This is bunk, it has nothing to do with intelligence and everything to do with hyping the oxymoronic/paradox branded as spatial intelligence. It's a flimsy way of using binary to burn energy.
Senses do not represent, if we needed them to in order to survive, we'd be dead before we never saw that branch, or that step, etc. This is the same mistaken approach cog-sci took in inventing the mind from the brain.
The problem is the whole brain prior to sensory and emotional integration is deeply involved so that an incredibly shallow oscillatory requirement fits atop the depth of long-term path-integration, memory consolidation involving allo and egocentric references of space and time, these are then correlated in affinities by sense emotion relays or values. None of this is here, it's all discarded for junk volumes made arbitrary, whereas the spaces here are immeasurably specific, fused by Sharp Wave Ripples.
There's no world model in the instantaneous shallow optic flow response (knowing when to slam on brakes, or pull in for a dive) and in the deep entanglements of memory and imagination and creativity.
This is one-size fits all nonsense. It's junk. It can't hope to resolve the space time requirements of shallow and deep experiences.
I don't know who you are or why you're so negative, but I have an immediate use case for this and I'll come back and post a "Show HN" on this.
This is immensely useful tech for blocking out consistent scenes, eg. for video generation.
Beyond entertainment, this is going to be huge in the field of robotics.
I wish you could see what I see.
I’m really curious to see what you are cooking. What field are you in?
This is arbitrary, it has nothing to do with specifics or coherence. Study it carefully particularly as it relates to general robitics.
Counterpoint: Humans visualize stuff in their minds before trying new things or when learning new concepts. An AI system with LLM based language center and a world model to visualize during training and inference would help it overlap more of human intelligence. Also it looks cool.
Edit: After seeing your edited (longer) comment, I have no idea what you’re talking about.
It’s irrelevant - it has nothing analogous to mental imagery. It’s pseudoscience.
Edit - of course you have no idea, you have no grasp of the oscillatory-dynamic origins of consciousness, nor does it seem anyone in AI.
histrionic and meretricious
those two words only describe AI models, as they are models. A "world model" is worse than those two words as it is oxymoronic.
The idea that words and space are being conflated as a formula for spatial intelligence is fundamentally absurd as our relationships to space have no resolution, both within any one language and worse, between them, as language is arbitrary. Language and thought are entirely separate forms. Aphasia has proved this since 2016.
AI developers have to face the music, these impoverished gimmicks aren't even magic, they are bunk. And debunkable once compared to the sciences of the brain.
https://mcgovern.mit.edu/2024/06/19/what-is-language-for/
Is that a more convoluted way to say that a next thing predictor can't exhibit complex behavior? Aka the stochastic parrot argument. Or that one modality can't be a good enough proxy for the other. If so, you probably have to pay more attention to the interpretability research.
But actually most people should start with strong definitions. Consciousness, intelligence, and other adjacent terms have never been defined rigorously enough, even if a ton of philosophers think otherwise. These discussions always dance around ill-defined terms.
Neurobio is built from the base units of consciousness outwards, not intuited interpretation. Eg prediction has nothing inherent to do with consciousness directly. That’s a process imposed on brains post hoc.
https://pubmed.ncbi.nlm.nih.gov/38579270/
And
https://mitpress.mit.edu/9780262552820/the-spontaneous-brain...
Easily refute prediction or error prediction as fundamental.
The path to intelligence or consciousness isn’t mimicry of interpretation.
In terms of strong definitions, start at the base, coders: oscillation, dynamics, Topologies, sharp wave ripples, and I would say roughly 60 more strongly defined material units and processes. This reverse intuition is going nowhere and it’s pseudoscientific nonsense for social media timeline filling.
I started writing the counterargument, but somehow I think you have a weird idea of what both interpretability in ML and neurobiology are, especially seeing how you're dealing with things nobody has a full idea about in such absolutes
Fundamentally incorrect across the board. we study ML for any signs of parallel function even from a tinkering level. Nope.
Look at Unlocking The brain both volumes, rhythms of the brain and the brain from inside out, and these are the tip.
[dead]