My first impression coming away from this is skepticism.
Anything with voice controls for routine use is a pretty tough sell. Doing this when you're not completely alone would be annoying to everyone around you.
Most of their examples seem like they could have been done with a right click drop down menu so they don't really need to "re-invent the mouse pointer".
So is this thing talking to Google's servers all the time for the AI integration? So it won't work if you're not connected to the internet? Privacy concerns are obvious; now Google wants to have an AI watching literally everything you do on your computer?
Does it cost the user anything for the LLM use? If it's free will it stay free forever? That's quite a lot to give away if they're expecting people to use it to change a single word like in one of their examples. I guess they're expecting to make the money back by gathering data about literally everything you do on your computer.
There might be a killer app for AI integration with personal computers that has yet to be invented, but this doesn't look like it.
The second half of your comment is a go-to-market concern but doesn't feel so relevant for a research prototype. It could be done with a private local model too, maybe not by Google.
But I don't think the voice problem is surmountable. I closed their image editing demo when I saw it required a mic.
It would be appealing as a Spotlight-like text pop-up interface where you type instructions, which would work in social/office environments, but that might only appeal to power users.
Yeah I think there could be something to the integration of AI in an operating system so that it can handle things going on in different applications the same way you can already copy and paste between things.
But if it's going to require phoning home to some Google/OpenAI/whoever then forget it. I don't want a constant connection to my OS from one of these companies.
This will sound like another brick in the paved road to dystopia but I'm kinda bullish on equipment that can recognize subvocalization. Or at least let me have a small drawing tablet with a stylus (think etch-a-sketch or Wacom Intuos) because at this point I'd rather practice writing and do away with typing altogether (even though I enjoy typing for typing's sake via MonkeyType).
That demo was an absolute disaster for me on Firefox on mac. It just fundamentally didn't work - the voice wasway behind my pointer, there were multiple agents speaking over each other saying conflicting things, and it couldn't even move the crab to the bottom right of the image. Embarassingly bad I would say!
Right — it does seem cool but the voice is patching over a major gap. If I'm talking already, why wouldn't I just describe what I'm looking at and have the AI grab it for me?
I think they answer that question pretty convincingly: Because if what you're looking at is already on the screen, it much more easy to point to it and say "that" than to describe it.
(And if it's an abstract entity like a file, it might not even be possible to describe it, short of rattling off the entire file path)
Oh interesting, this is very cool. At first I thought it was just focus-follows-mouse but it's more interesting. You have certain keywords trigger "add to prompt". Ignoring the voice functionality (which is admittedly crucial currently because other inputs currently take over focus), I've often wanted to just have a continuous conversation with the LLM as I 'point and click' (or tab over and select) at various things. Might be neat to have text input focus continue to go to the LLM where I'm typing text etc.
Sometimes I go to a different page to take a screenshot and other times I'm browsing for a file, and other times I'm highlighting some log lines. Cursor did this well, with selecting text in the terminal auto-focusing the Cursor agent textbox so you could talk to the agent and then select some text and you didn't have to re-select the original agent textbox again. The agent is a top-level function in that system not "just another app I have to switch to" to take my context with.
I have some small amount of bias because I've always felt input-constrained on computers. I have to move my hands to go places and that's exasperating. I've tried head tracking, had a vim pedal for a while, and used tiling WMs, and things like this to aid but while my vim-fu is pretty good and I function inside things very well with it, my cross-application interface isn't.
In the end, perhaps we all have our home offices with our Apple Vision Pros and we talk to them like this to maneouvre faster through our machines and get our ideas into them.
My reaction to the first demo (recipe) is that it was slower than typing the same thing on your keyboard.
The second demo seems to be a wash: there's no time saved in saying "move this" versus "move crab". And an app-specific contextual menu would probably be faster.
The third demo doesn't seem to warrant the use of a pointer at all, since there is only one way to interpret the prompt.
None of this means that this approach will not be successful, but there's a reason why so many attempts to revolutionize user interfaces ended up going nowhere. Talking to your computer was always supposed to be the future, but in practice, it's slower and more finicky than typing.
In fact, the only new UI paradigm of the past 28+ years appears to have been touchscreens and swipe gestures on phones. But they are a matter of necessity. No one wants to finger-paint on a desktop screen.
Talking to your computer can only ever work for people in atomized work-from-home silos, surely. I can't really imagine living in a world where everybody is just muttering commands to the computer all the time.
What I actually would be more willing to allow is a version of this that is built into macOS, runs locally, never phones home. If Apple Intelligence made an "AI sees everything on my screen", I might turn it on.
I think this falls flat for a technical audience because we already know how to do this stuff. But there are a lot of people who don't know how to copy paste, or use reverse image search, or apply a filter to a table. Being able to use plain language to do these things is a game changer for them. Sure, it's inefficient and inelegant but it's an interface that will do for basic technical stuff what the ipad touch screen did for the mouse and keyboard.
It reminds me of Microsoft Recall in the sense that some portion of the screen is going to be continuously transmitted outside of the users control.
What happens when someone browses something very private (planning a surprise engagement. looking at medical data. planning a protest)? All that data gets slurped to google and subject to a warrant or discovery or building your advertising fingerprint.
Maybe the idea is that the data is sent to AI only when you right click, but that seems like a very thin firewall that a product manager will breach in the interests of delivering "predictive AI" via some kind of precomputed results.
At some point I hope that we will reach a point where these megacorps figure that running these things locally might be most cost effective. FWIW I think local models I run on my MacBook are good enough for most of the tasks that this kind of interaction may ask for.
you can really tell the people building these tools spend a lot of time alone. I work from a home office 90% of the time and I wouldn't want this to be my workflow. I don't want to talk to my computer, I want to listen to music while I work, and I want to not sound deranged and disturb everyone around me when I am working from a coffeeshop or the open-plan office or the airport or the train or whatever.
and that's aside from the obvious privacy problems.
Agreed. A lot of Google products feel anti-social (like Google glass). They are definitely missing a human touch. Perhaps a byproduct of elitism and leet code grind filtering of employees mixed with founder personality.
yeah, i understand the frustration of needing to do all the communication through typing and clicking and that it can feel limiting - but i want the computer to be less demanding of my physical reality not more. i want to be able to talk to someone on the phone, work on something ith my hands, and still successfully manage my compute tasks. improvement can only be made by requiring less attention to hte screen and less hand movements, not adding in anything new like voice
This is how I always imagined FE development would work once ChatGPT 3 came out. Then Cursor appeared and seeing how successful they were with just a chat and a few tool calls, I thought I was over-complicating things.
Anyway, I built a prototype on this idea, but instead of relying only on hover, I press Option to select a node in a custom AST-ish semantic layer I designed around a minimalist UI grammar, and Option + up/down arrows to move to parent/child node. This way, I have have an accurate pointer to the element I want to talk about, plus a minimal context window (parent component, state, a few navigation related queries).
What I learned from using it, though, is that the killer use case isn't necessarily the flashy "talk to this UI element" interaction shown in the Google demos. I do use it that way too; I have `Option + Shift + click` to copy a selector to the clipboard, so I can give an LLM connected to the live medium a precise reference to the element I want to discuss.
But the place where it has been most useful day to day is much simpler: source navigation. Point at the thing in the UI, jump to the code that is responsible for it. The difficult part is jumping to the code you care about (the code for UI or for the semantic element?), but in my system that distinction turned out to be usually obvious, which is what makes the interaction useful.
while these examples might be easy fodder for criticism I do feel like this whole idea of talking to an LLM across multiple applications and anything your pointer is on will give it context is pretty powerful and cool idea.
I'm imagining a webpage with a link - instead of opening a new link to quickly google something or opening three new tabs based on hyperlinks, i can point at a paragraph or line and ask it to tell me about it.
Maybe I can point at a song on Spotify and have it find me the youtube video, or vice versa (of course this is assuming a tool like this wouldn't stay locked into one ecosystem.. which it will).
Point is that the concept of talking to the computer with mouse as pointer is pretty cool and i guess a step closer to that whole sci-fi "look at this part of the screen and do something"
I've been iterating on some 3D models for various wacky garage projects I have. It's fun. I've often wished I could click on an arbitrary place and say "add an eye bolt here" or somesuch.
Of course learning proper cad software is probably the right thing here, but having Claude write python scripts which generate HTML files which reference three.js to provide a 3d view has gotten me surprisingly far. If something could take my pointer click and reverse whatever coordinate transforms are between the source code and my screen such that the model sees my click in terms of the same coordinate system it's writing python in, well that would be pretty slick.
Hah, Ive been thinking the same thing. Recently I prototyped a 2d paint app to validate the idea using chrome's Prompt API: https://arjo129.github.io/apps/voice_paint.html
Honestly what'd be epic is if this could be made to work with a XR headset. Imagine using the headset to capture the piece you are working on and generating saying "hey can you drill some holes over here"?
I'm pretty sure all these models have terms of service that make the user assert that they have permission to use the content you're feeding into them (clickwrap infringement-is-the-user's-fault). This kind of integration makes a mockery of that.
Indeed. "AI-enabled pointer" is misdirection. This isn't an AI-enabled pointer; it's sending screen to AI, which yes, includes pointer position. The AI doesn't live in the pointer. The AI lives, apparently, so thoroughly in the system that it can see and do anything, and the pointer is just a way of giving it context.
Next generation of OS should have constant video and audio recognition by on device LLM. This will provide valuable context for a lot of scenarios. So instead of frequent copy-pasting we are used to, we can let agents access context of our whole workflows from different apps.
But Google is a very ill positioned candidate for such OS. I would rather trust Apple and local-first on-device models.
A zigzag merge gesture is obviously a terrible idea until/unless everything is a touch screen. Did they even think about this stuff at all? Ergonomics and RSI aside, if a horizontal drag means add, why not just make vertical drag mean merge. Not a fan of voice interaction generally, but it's something we'll all be grateful for as we get older. No need to accelerate it
I've been doing something similar to this in a personal claude code frontend, though not particularly "magical".
I'm mostly using my system to make comments on long AI-generated documents (especially design documents). I find it works well to have the AI generate something, and then I read through it, making comments along the way.
You can get pretty far just repeating the things you see... "I'm reading [heading] and [comments]". But I do find some use in selecting content and saying "I don't agree with this" or whatever else.
The result is just an augmented message. It looks like:
<transcript>
Let's see what we've got here.
<selection doc="proposal.md" location="paragraph 3">
The system already...
</selection>
No, I don't like how this is approaching the problem, ...
</transcript>
Then I just send this as a user message. Claude Code (and I'm guessing any of the agentic systems) picks up on the markup very easily. It also helps to label it as a transcript, as it can understand there may be errors, and things like spelling and punctuation are inferred not deliberate. (Some additional instruction is necessary to help it understand, for example, that it should look for homophones that might make more sense in context.)
It makes reviewing feel pretty relaxed and natural. I've played around with similar note taking systems, which I think could be great for studying in school, but haven't had the focus on that particular problem to take it very far.
But I think the best thing really is giving the agent a richer understanding of what the user is experiencing and doing and just creating a rich representation of that. The keywords can be useful, but almost only as checkpoints: a keyword can identify the moment to take the transcript and package it up and deliver it.
One difference perhaps in design motivation: I have really embraced long latency interactions. I use ChatGPT with extended thinking by default, and just suck it up when the answer didn't really require thinking. I deliver 10 points of feedback at once instead of little by little. (Often halfway through I explicitly contradict myself, because I'm thinking out loud and my ideas are developing.) I just don't stress out about latency or feedback, and so low-latency but lower-intelligence interactions don't do it for me (such as ChatGPT's advanced voice mode, or probably Thinking Machine's work). I think this focus is in part a value statement: I'm trying to do higher quality work, not faster work.
How about you give me my normal white cursor and an "AI enhanced" orange cursor only when I'm doing AI things. To use their words, that would be "intuitive AI that meets users across all the tools they use, without interrupting their flow"
This seems like one of those things that is usable infrequently enough to be forgotten/poorly developed/never used. (Even before accounting for the actual failure rate of the LLM which will be none-zero).
Perhaps a text box and file upload isn’t the perfect interface for every use case but it is versatile which is a huge barrier to overcome.
I tried it in ai studio and it was extremely disappointing. It did not follow the directions. I pointed at the door of the sand castle and said make the door of the sand castle bigger. it created another very big sand castle on the side. Then i pointed at one flag of the castle and said turn this flag into a blue flag. It turned two other random flags to blue but the one i pointed the pointer at.
Interesting! I wonder how UI will evolve in the long-term? If there are browser-use/computer-use and clicky-clones automating pointer actions, do we really need complex UI anymore? If yes, when?
Google needs to beat OpenAI and Antropic in coding models because that's where the big money is going. I love using the Gemini pro model for quick questions, but that's not where I'm spending the real money.
They have so many great software engineers but unable to use them to speed up coding AI research. Hopefully with Sergey's focus it will get better.
This cursor thing is just another experiment nobody cares about.
The AI is adding that you don’t need to provide a button or menu item or keyboard shortcut for each possible action, or for the user having to know that the command is there and how to access it.
I prefer keyboard operation myself, but I can see how this could become useful in the future, for certain use cases.
What would be useful as well then is if you could bind such a repeatedly-used AI command to button/menu item/keyboard shortcut in a way that it can still be used with pointing “this” and “that”.
They’re going to take your abilities to do anything and spread it across many places so you have to run around to do them, same as all the moneyed technology.
It tracks what's on the screen and sends it back to Alphabet. If you're watching a video about BBQ, enjoy a bunch of ads for Omaha steaks and big green egg in your Gmail.
On a less serious note, the audience for this is people who want to optimize for what seems like the least amount of effort.
It's like a demo from Xerox PARC in an alternate universe where everything is run by marketing department MBAs. Oh wait, that's the one we live in now.
Nice, cute, silly little feel-good demo so that we can all pretend like we’re all going to be making decisions and micro-managing AIs by pointing at things in 5 years. It’s going to be great! The future is bright!
I don't understand why we need to move from an explicit operation like, say, circling something, to a fuzzy one where you have to hope the machine understands what you're pointing at.
I also don't think people want to constantly talk to their computers.
My first impression coming away from this is skepticism.
Anything with voice controls for routine use is a pretty tough sell. Doing this when you're not completely alone would be annoying to everyone around you.
Most of their examples seem like they could have been done with a right click drop down menu so they don't really need to "re-invent the mouse pointer".
So is this thing talking to Google's servers all the time for the AI integration? So it won't work if you're not connected to the internet? Privacy concerns are obvious; now Google wants to have an AI watching literally everything you do on your computer?
Does it cost the user anything for the LLM use? If it's free will it stay free forever? That's quite a lot to give away if they're expecting people to use it to change a single word like in one of their examples. I guess they're expecting to make the money back by gathering data about literally everything you do on your computer.
There might be a killer app for AI integration with personal computers that has yet to be invented, but this doesn't look like it.
The second half of your comment is a go-to-market concern but doesn't feel so relevant for a research prototype. It could be done with a private local model too, maybe not by Google.
But I don't think the voice problem is surmountable. I closed their image editing demo when I saw it required a mic.
It would be appealing as a Spotlight-like text pop-up interface where you type instructions, which would work in social/office environments, but that might only appeal to power users.
Yeah I think there could be something to the integration of AI in an operating system so that it can handle things going on in different applications the same way you can already copy and paste between things.
But if it's going to require phoning home to some Google/OpenAI/whoever then forget it. I don't want a constant connection to my OS from one of these companies.
This will sound like another brick in the paved road to dystopia but I'm kinda bullish on equipment that can recognize subvocalization. Or at least let me have a small drawing tablet with a stylus (think etch-a-sketch or Wacom Intuos) because at this point I'd rather practice writing and do away with typing altogether (even though I enjoy typing for typing's sake via MonkeyType).
The "Edit an Image" Demo at the bottom is pretty fun. Maybe this is just Google flexing their LLM inference capacity.
That demo was an absolute disaster for me on Firefox on mac. It just fundamentally didn't work - the voice wasway behind my pointer, there were multiple agents speaking over each other saying conflicting things, and it couldn't even move the crab to the bottom right of the image. Embarassingly bad I would say!
Right — it does seem cool but the voice is patching over a major gap. If I'm talking already, why wouldn't I just describe what I'm looking at and have the AI grab it for me?
I think they answer that question pretty convincingly: Because if what you're looking at is already on the screen, it much more easy to point to it and say "that" than to describe it.
(And if it's an abstract entity like a file, it might not even be possible to describe it, short of rattling off the entire file path)
Oh interesting, this is very cool. At first I thought it was just focus-follows-mouse but it's more interesting. You have certain keywords trigger "add to prompt". Ignoring the voice functionality (which is admittedly crucial currently because other inputs currently take over focus), I've often wanted to just have a continuous conversation with the LLM as I 'point and click' (or tab over and select) at various things. Might be neat to have text input focus continue to go to the LLM where I'm typing text etc.
Sometimes I go to a different page to take a screenshot and other times I'm browsing for a file, and other times I'm highlighting some log lines. Cursor did this well, with selecting text in the terminal auto-focusing the Cursor agent textbox so you could talk to the agent and then select some text and you didn't have to re-select the original agent textbox again. The agent is a top-level function in that system not "just another app I have to switch to" to take my context with.
I have some small amount of bias because I've always felt input-constrained on computers. I have to move my hands to go places and that's exasperating. I've tried head tracking, had a vim pedal for a while, and used tiling WMs, and things like this to aid but while my vim-fu is pretty good and I function inside things very well with it, my cross-application interface isn't.
In the end, perhaps we all have our home offices with our Apple Vision Pros and we talk to them like this to maneouvre faster through our machines and get our ideas into them.
Cool research. I wonder what we'll end up with.
My reaction to the first demo (recipe) is that it was slower than typing the same thing on your keyboard.
The second demo seems to be a wash: there's no time saved in saying "move this" versus "move crab". And an app-specific contextual menu would probably be faster.
The third demo doesn't seem to warrant the use of a pointer at all, since there is only one way to interpret the prompt.
None of this means that this approach will not be successful, but there's a reason why so many attempts to revolutionize user interfaces ended up going nowhere. Talking to your computer was always supposed to be the future, but in practice, it's slower and more finicky than typing.
In fact, the only new UI paradigm of the past 28+ years appears to have been touchscreens and swipe gestures on phones. But they are a matter of necessity. No one wants to finger-paint on a desktop screen.
Talking to your computer can only ever work for people in atomized work-from-home silos, surely. I can't really imagine living in a world where everybody is just muttering commands to the computer all the time.
This happens daily in radiology departments around the world
What I actually would be more willing to allow is a version of this that is built into macOS, runs locally, never phones home. If Apple Intelligence made an "AI sees everything on my screen", I might turn it on.
I think this falls flat for a technical audience because we already know how to do this stuff. But there are a lot of people who don't know how to copy paste, or use reverse image search, or apply a filter to a table. Being able to use plain language to do these things is a game changer for them. Sure, it's inefficient and inelegant but it's an interface that will do for basic technical stuff what the ipad touch screen did for the mouse and keyboard.
one step back, both for technical and non technical, is the knowledge that thats even a problem that you have.
the agent occasionally spots your real problem like an experienced engineer
I sense a privacy problem brewing.
It reminds me of Microsoft Recall in the sense that some portion of the screen is going to be continuously transmitted outside of the users control.
What happens when someone browses something very private (planning a surprise engagement. looking at medical data. planning a protest)? All that data gets slurped to google and subject to a warrant or discovery or building your advertising fingerprint.
Maybe the idea is that the data is sent to AI only when you right click, but that seems like a very thin firewall that a product manager will breach in the interests of delivering "predictive AI" via some kind of precomputed results.
At some point I hope that we will reach a point where these megacorps figure that running these things locally might be most cost effective. FWIW I think local models I run on my MacBook are good enough for most of the tasks that this kind of interaction may ask for.
> What happens when someone browses something very private?
Profit!
you can really tell the people building these tools spend a lot of time alone. I work from a home office 90% of the time and I wouldn't want this to be my workflow. I don't want to talk to my computer, I want to listen to music while I work, and I want to not sound deranged and disturb everyone around me when I am working from a coffeeshop or the open-plan office or the airport or the train or whatever.
and that's aside from the obvious privacy problems.
Agreed. A lot of Google products feel anti-social (like Google glass). They are definitely missing a human touch. Perhaps a byproduct of elitism and leet code grind filtering of employees mixed with founder personality.
Nothing else expected - one of the examples in this very article marks some text in a doc and prompts "make this more human"
yeah, i understand the frustration of needing to do all the communication through typing and clicking and that it can feel limiting - but i want the computer to be less demanding of my physical reality not more. i want to be able to talk to someone on the phone, work on something ith my hands, and still successfully manage my compute tasks. improvement can only be made by requiring less attention to hte screen and less hand movements, not adding in anything new like voice
> i want to be able to talk to someone on the phone, work on something with my hands, and still successfully manage my compute tasks.
Maybe you can share a scenario for that one? I can’t figure a scenario where all of this needs to be true. It seems like a recipe for accidents.
Can I use this AI mouse pointer to tell the difference between hotdog or not hotdog?
Indeed you could!
This is how I always imagined FE development would work once ChatGPT 3 came out. Then Cursor appeared and seeing how successful they were with just a chat and a few tool calls, I thought I was over-complicating things.
Anyway, I built a prototype on this idea, but instead of relying only on hover, I press Option to select a node in a custom AST-ish semantic layer I designed around a minimalist UI grammar, and Option + up/down arrows to move to parent/child node. This way, I have have an accurate pointer to the element I want to talk about, plus a minimal context window (parent component, state, a few navigation related queries).
What I learned from using it, though, is that the killer use case isn't necessarily the flashy "talk to this UI element" interaction shown in the Google demos. I do use it that way too; I have `Option + Shift + click` to copy a selector to the clipboard, so I can give an LLM connected to the live medium a precise reference to the element I want to discuss.
But the place where it has been most useful day to day is much simpler: source navigation. Point at the thing in the UI, jump to the code that is responsible for it. The difficult part is jumping to the code you care about (the code for UI or for the semantic element?), but in my system that distinction turned out to be usually obvious, which is what makes the interaction useful.
Please don't.
I like text selection exactly how it is. I want precise controls.
It's fine for a touch interface like a phone, but on a computer I expect precision. As much as I can get.
while these examples might be easy fodder for criticism I do feel like this whole idea of talking to an LLM across multiple applications and anything your pointer is on will give it context is pretty powerful and cool idea.
I'm imagining a webpage with a link - instead of opening a new link to quickly google something or opening three new tabs based on hyperlinks, i can point at a paragraph or line and ask it to tell me about it.
Maybe I can point at a song on Spotify and have it find me the youtube video, or vice versa (of course this is assuming a tool like this wouldn't stay locked into one ecosystem.. which it will).
Point is that the concept of talking to the computer with mouse as pointer is pretty cool and i guess a step closer to that whole sci-fi "look at this part of the screen and do something"
Yeah. We just need 10x more compute. But constant ai analysing of everything is the ultimate direction.
I've been iterating on some 3D models for various wacky garage projects I have. It's fun. I've often wished I could click on an arbitrary place and say "add an eye bolt here" or somesuch.
Of course learning proper cad software is probably the right thing here, but having Claude write python scripts which generate HTML files which reference three.js to provide a 3d view has gotten me surprisingly far. If something could take my pointer click and reverse whatever coordinate transforms are between the source code and my screen such that the model sees my click in terms of the same coordinate system it's writing python in, well that would be pretty slick.
Hah, Ive been thinking the same thing. Recently I prototyped a 2d paint app to validate the idea using chrome's Prompt API: https://arjo129.github.io/apps/voice_paint.html Honestly what'd be epic is if this could be made to work with a XR headset. Imagine using the headset to capture the piece you are working on and generating saying "hey can you drill some holes over here"?
Of course, it isn't a Google Demo, if you can't use it to book a table at restaurant. (shown at the bottom of the page)
Reminds me of Put That There https://m.youtube.com/watch?v=RyBEUyEtxQo
Ah, yes - I was trying to remember the name.
Also featured in the Starfire vision video from 1992: https://youtu.be/jhe1DFY-SsQ?t=286
I'm pretty sure all these models have terms of service that make the user assert that they have permission to use the content you're feeding into them (clickwrap infringement-is-the-user's-fault). This kind of integration makes a mockery of that.
Wiggle at CAPTCHAs, wiggle at Termux, wiggle at Emacs, wiggle at the Godot Editor, wiggle at my remote desktop.
(Not going to happen)
so Google will be monitoring whatever on the screen continuously or only when the user say the magic words (this, that, here, there)?
Indeed. "AI-enabled pointer" is misdirection. This isn't an AI-enabled pointer; it's sending screen to AI, which yes, includes pointer position. The AI doesn't live in the pointer. The AI lives, apparently, so thoroughly in the system that it can see and do anything, and the pointer is just a way of giving it context.
Google Recall. Hey, it's all about the marketing.
Next generation of OS should have constant video and audio recognition by on device LLM. This will provide valuable context for a lot of scenarios. So instead of frequent copy-pasting we are used to, we can let agents access context of our whole workflows from different apps.
But Google is a very ill positioned candidate for such OS. I would rather trust Apple and local-first on-device models.
Next generation OS should absolutely -not- have always-on surveillance like you describe.
Reimagine the chat interface first. For example, let the user click where the LLM went off the rails.
A zigzag merge gesture is obviously a terrible idea until/unless everything is a touch screen. Did they even think about this stuff at all? Ergonomics and RSI aside, if a horizontal drag means add, why not just make vertical drag mean merge. Not a fan of voice interaction generally, but it's something we'll all be grateful for as we get older. No need to accelerate it
It's beautiful how the human mind can take something very obvious but overlooked and make it into this fantastic innovation. Fab stuff.
I've been doing something similar to this in a personal claude code frontend, though not particularly "magical".
I'm mostly using my system to make comments on long AI-generated documents (especially design documents). I find it works well to have the AI generate something, and then I read through it, making comments along the way.
You can get pretty far just repeating the things you see... "I'm reading [heading] and [comments]". But I do find some use in selecting content and saying "I don't agree with this" or whatever else.
The result is just an augmented message. It looks like:
Then I just send this as a user message. Claude Code (and I'm guessing any of the agentic systems) picks up on the markup very easily. It also helps to label it as a transcript, as it can understand there may be errors, and things like spelling and punctuation are inferred not deliberate. (Some additional instruction is necessary to help it understand, for example, that it should look for homophones that might make more sense in context.)It makes reviewing feel pretty relaxed and natural. I've played around with similar note taking systems, which I think could be great for studying in school, but haven't had the focus on that particular problem to take it very far.
But I think the best thing really is giving the agent a richer understanding of what the user is experiencing and doing and just creating a rich representation of that. The keywords can be useful, but almost only as checkpoints: a keyword can identify the moment to take the transcript and package it up and deliver it.
One difference perhaps in design motivation: I have really embraced long latency interactions. I use ChatGPT with extended thinking by default, and just suck it up when the answer didn't really require thinking. I deliver 10 points of feedback at once instead of little by little. (Often halfway through I explicitly contradict myself, because I'm thinking out loud and my ideas are developing.) I just don't stress out about latency or feedback, and so low-latency but lower-intelligence interactions don't do it for me (such as ChatGPT's advanced voice mode, or probably Thinking Machine's work). I think this focus is in part a value statement: I'm trying to do higher quality work, not faster work.
this is pretty built in with vscode plugins.
you select text in vscode, and write a comment, and the llm gets both
The image editing demo was fun... the model is not very well censored.
How about you give me my normal white cursor and an "AI enhanced" orange cursor only when I'm doing AI things. To use their words, that would be "intuitive AI that meets users across all the tools they use, without interrupting their flow"
The concept is good but accuracy in cluttered environment can be a concern, also misinterpreting context can be a problem
Haha, April Fools! Good one.
Wait…it's May. Ugh, I'm so confused. :spiral eyes emoji:
This seems like one of those things that is usable infrequently enough to be forgotten/poorly developed/never used. (Even before accounting for the actual failure rate of the LLM which will be none-zero).
Perhaps a text box and file upload isn’t the perfect interface for every use case but it is versatile which is a huge barrier to overcome.
I spent quite some effort to _completely_ get rid of mouse usage in my computer workflows and I believe it paid off.
I tried it in ai studio and it was extremely disappointing. It did not follow the directions. I pointed at the door of the sand castle and said make the door of the sand castle bigger. it created another very big sand castle on the side. Then i pointed at one flag of the castle and said turn this flag into a blue flag. It turned two other random flags to blue but the one i pointed the pointer at.
Interesting! I wonder how UI will evolve in the long-term? If there are browser-use/computer-use and clicky-clones automating pointer actions, do we really need complex UI anymore? If yes, when?
I've been playing with writing a visionOS app that allows an AI agent to be aware of what you're looking at at any given time.
At some point I fully expect eye tracking (or attention tracking) to be common enough to be a first-class input method.
No thanks
It only took Google and their AI offering to come up with Graffiti.
Don't build these things, instead build protocols and expose system level APIs for application developers to build things.
That example with the recipe is funny. Did they really need AI to copy two lines and then compute 2×1?
Google made a Microsoft Kinect.
I wonder what sort of monstrous power would be unleashed if Google used Plan9 as a foundation.
They'd half-finish it then bury it, like they did with Fuchsia which is heavily Plan-9-inspired.
Google needs to beat OpenAI and Antropic in coding models because that's where the big money is going. I love using the Gemini pro model for quick questions, but that's not where I'm spending the real money.
They have so many great software engineers but unable to use them to speed up coding AI research. Hopefully with Sergey's focus it will get better.
This cursor thing is just another experiment nobody cares about.
Both of the text based demos would have been simpler and faster with traditional mouse and keyboard interactions. What is the AI adding?
The AI is adding that you don’t need to provide a button or menu item or keyboard shortcut for each possible action, or for the user having to know that the command is there and how to access it.
I prefer keyboard operation myself, but I can see how this could become useful in the future, for certain use cases.
What would be useful as well then is if you could bind such a repeatedly-used AI command to button/menu item/keyboard shortcut in a way that it can still be used with pointing “this” and “that”.
> What is the AI adding?
More $$$ for the PM who launched the product.
They’re going to take your abilities to do anything and spread it across many places so you have to run around to do them, same as all the moneyed technology.
Hype-flavored surveillance!
Yeah it's a gimmick
It tracks what's on the screen and sends it back to Alphabet. If you're watching a video about BBQ, enjoy a bunch of ads for Omaha steaks and big green egg in your Gmail.
On a less serious note, the audience for this is people who want to optimize for what seems like the least amount of effort.
It feels like everything modern is like this. No value added, just the appearance of it.
do not want
Just seven hours ago there was a plea on HN [0] to please not do this. Seriously, what are they smoking at Google right now?
[0] https://news.ycombinator.com/item?id=48107027
Its like watching a demo from the old Xerox PARC, except everybody has only bad ideas. Like an opposite Xerox PARC.
It's like a demo from Xerox PARC in an alternate universe where everything is run by marketing department MBAs. Oh wait, that's the one we live in now.
Nice, cute, silly little feel-good demo so that we can all pretend like we’re all going to be making decisions and micro-managing AIs by pointing at things in 5 years. It’s going to be great! The future is bright!
I don't understand why we need to move from an explicit operation like, say, circling something, to a fuzzy one where you have to hope the machine understands what you're pointing at.
I also don't think people want to constantly talk to their computers.
Like a dream come true...
Nightmares are dreams as well and this is a nightmare like Windows Recall.
Technically wonderful though.
This is pretty neat
> We’ve been exploring new AI-powered capabilities to help the pointer not only understand what it’s pointing at, but also why it matters to the user.
We couldn't quite track you well enough before. So we're fixing that under the guise of "AI powered capabilities."
being able to make precise edits would be huge for AI
really interesting! This change now fits a faster UX and use.
Maybe I'm misunderstanding, but what is new about the pointer itself? Seems to be functionally the same as selecting + tooltips / context menus.
Shush, how is anyone going to get promoted with that kind of talk!?
> but what is new about the pointer itself?
I'm hoping for a const-reference joke.
please leave the pointer alone. Hes been with us so long without enshittification.
There's already a product that does this lol
Aaaaand now I can't remember the name of it
Thank I hate it.
If its offline I love it. Otherwise I hate it.
Thanks, I hate it
what the hell is going on at google
AI.