One thing I have found very odd about the current wave of AI tools is that there seems to be an unspoken element of giving up and admitting failure in other areas of computing.
Programming copilots are often sold on how they can automate drudgery and boilerplate, which implies we are incapable of or uninterested in designing programming languages, tools, and patterns which do not require boilerplate or drudgery.
Teaching models to use traditional GUI apps implies we have given up on or are not even bothering to create proper hooks for an automation system to utilise.
Something about it feels wrong to me, because it bakes existing inefficiencies into the system. Can we really not solve the inefficiencies instead of pouring unfathomable amounts of compute into working around them?
This is not a computer problem, it's a human one. It's not that we don't have APIs and hooks because they're so difficult to implement - we don't have them because software producers don't want or care for us us to have them.
Enabling automation will never be zero effort and anything more than zero effort for something with such a low ROI is a no-go by default. But increasingly, automation is actually seen as a danger to their business models and companies sometimes even go out of their way to prevent it.
Looking at the screen the same way a user does is the only way to win.
Focusing on the GUI applications: There have been a few GUI automation solutions over the years - since the post's software is from MS, I'll take UI Automation as an example. Works well with Win32 controls, not sure how well it works with the XAML-based toolkits.
But not all software is written with those UI frameworks. Some use different widget frameworks, some immediate GUIs, others just render a webpage and either use HTML or fully render the controls themselves. And without everybody using the same standard, the only standard we have for parsing their output is the pixels they render to.
Computer based agents have no limits, that’s the advantage. Sure a proper automation hook is better if its available, but a lot of the time it isn’t, either due to lack of resources or monopolistic behavior
I’m reminded of Permutation City where your personal AI intercepts ads sent to you, but ad companies of course have their own AI for tricking your AI, so of course you have a countermeasure AI to intercept that, and so on and so forth
I have a little bit of a vice of enjoying some "idle" games. I have intended to do some very basic manual screen carving & ocr & computer vision to try to "read" my state in these games, & have multi-actor "play" models for them, just for fun really & to decrease time sunk gaming (by spending significant time coding/learning).
This certainly seems like it has a lot of promise to make that much much much easier. Game UI's are less uniform so maybe this might be harder or not easily be applicable, but hopefully
Depends what you consider fun, and how far you take it. Some people enjoy programming more than repetitive clicking in a GUI. For a clicker game, writing a bot lets you iterate on strategies easier - is it faster to get to level 2 if I buy the upgrade for A or B first? For Trackmania, it lets you get a world record and a YouTube video with 14M views.
Yeah. I appreciate the warning & enjoy the personal tail, but it's just that guys story & it's being projected as an absolute.
If I don't enjoy the experience anymore that's fine with me too. I think I'd still feel a sense of accomplishment, feel like I'd advanced as a human and mastered my environment and machines for diving in here.
I don't feel the agency I want to have. These games make me want to extend myself, my agency. Playing them manually offers some very low grade enjoyment but that sense of missing out gnaws at me, and I'm not at all dissuaded by parent trying to ward me off, and if I do end up winning so hard I don't care anymore, me right now would regard that as a victory condition & rief from this pressure I feel about ineffectively plodding through as I do now.
Since this is a research paper with promising ideas but non-functional code, what are people using as the best-in-class agents for computer automation? For example:
1. Claude for computer use
2. Various startup offerings—if you have recommendations, please list them
3. Established tools like Playwright, Selenium, and WebDriver, combined with screenshots and LLM-based guidance
What tools or approaches are actually working for building useful automation solutions?
I confirm it works: I got the gradio demo working locally and it's pretty reasonable.
Slight rough edges (to be expected) and you do need to read the README with attention but it's all par for the course. I had to install einops which wasn't in the requirements.txt and even though I had downloaded the HF models they released, it still needed to pull in another model when I first ran the demo.
Computer Use, Agent.exe and so on, but nothing actually is useful yet. It's all very terrible. And then to think we had perfection already (and Claude is good at it); emacs... No need for any of this; everything can be scripted.
:( I was so, so excited to try this when I found it yesterday, it has like 3 star emojis in my list of models. I’ll post here if I get it working tomorrow, I guess. I doubt they’d release a model on HF without intending to make it useable
EDIT; surely it’s just broken, the repo does include .safetensor weights. Maybe the problem is the “suspicious”-flagged PyTorch weight for “icon detection”, whatever that means?
This literally just published, IMHO it's a little premature to be accusing them of that at such an early stage.
More likely they just slipped up with getting everything uploaded properly - it's easily done, and luckily easily corrected, so we'll likely see issues get resolved fairly swiftly.
One thing I have found very odd about the current wave of AI tools is that there seems to be an unspoken element of giving up and admitting failure in other areas of computing.
Programming copilots are often sold on how they can automate drudgery and boilerplate, which implies we are incapable of or uninterested in designing programming languages, tools, and patterns which do not require boilerplate or drudgery.
Teaching models to use traditional GUI apps implies we have given up on or are not even bothering to create proper hooks for an automation system to utilise.
Something about it feels wrong to me, because it bakes existing inefficiencies into the system. Can we really not solve the inefficiencies instead of pouring unfathomable amounts of compute into working around them?
This is not a computer problem, it's a human one. It's not that we don't have APIs and hooks because they're so difficult to implement - we don't have them because software producers don't want or care for us us to have them.
Enabling automation will never be zero effort and anything more than zero effort for something with such a low ROI is a no-go by default. But increasingly, automation is actually seen as a danger to their business models and companies sometimes even go out of their way to prevent it.
Looking at the screen the same way a user does is the only way to win.
Focusing on the GUI applications: There have been a few GUI automation solutions over the years - since the post's software is from MS, I'll take UI Automation as an example. Works well with Win32 controls, not sure how well it works with the XAML-based toolkits.
But not all software is written with those UI frameworks. Some use different widget frameworks, some immediate GUIs, others just render a webpage and either use HTML or fully render the controls themselves. And without everybody using the same standard, the only standard we have for parsing their output is the pixels they render to.
Computer based agents have no limits, that’s the advantage. Sure a proper automation hook is better if its available, but a lot of the time it isn’t, either due to lack of resources or monopolistic behavior
To a considerable extent, we are stuck in the world we live in; but I am reminded of a quote by Guillaume Allais:
> My entire job seems to be repeating variations of "never start by forgetting the user's stated intent only to then attempt to guess it".
This is awesome, can't wait for evals against Claude Computer Use!
Can we first test this with basic sysadmin work in a simple shell?
Can't wait to replace "apt get install" by "gpt get install" and then have it solve all the dependency errors by itself.
This had been possible for a year already. My project gptme does it just fine (like many other tools), especially now with Claude 3.5.
I know that it exists. I was just hoping we can make such interactions (practically) bug-free before we move on to the next big thing.
Threat actors can't wait for you to start doing this either.
how can you write metrics against something that's non deterministic?
Can it detect ads and mask them out?
I’m reminded of Permutation City where your personal AI intercepts ads sent to you, but ad companies of course have their own AI for tricking your AI, so of course you have a countermeasure AI to intercept that, and so on and so forth
If these sorts of tools kill the ad business, it would be so incredibly cool, and justify Nvidia’s half-of-the-economy-or-whatever market cap.
Let's hope so! But now that I'm thinking about it more: nvidia might go into the advertisement business themselves :(
Only like 6%
I have a little bit of a vice of enjoying some "idle" games. I have intended to do some very basic manual screen carving & ocr & computer vision to try to "read" my state in these games, & have multi-actor "play" models for them, just for fun really & to decrease time sunk gaming (by spending significant time coding/learning).
This certainly seems like it has a lot of promise to make that much much much easier. Game UI's are less uniform so maybe this might be harder or not easily be applicable, but hopefully
As someone who has done this to many games over a few decades, I can definitively say: 100% of the time, it ruins the fun of the game.
I can't say exactly why. Maybe you feel like you haven't earned it. Maybe it's the idle nature of farming that we really enjoy...
Depends what you consider fun, and how far you take it. Some people enjoy programming more than repetitive clicking in a GUI. For a clicker game, writing a bot lets you iterate on strategies easier - is it faster to get to level 2 if I buy the upgrade for A or B first? For Trackmania, it lets you get a world record and a YouTube video with 14M views.
https://youtu.be/Dw3BZ6O_8LY
Yeah. I appreciate the warning & enjoy the personal tail, but it's just that guys story & it's being projected as an absolute.
If I don't enjoy the experience anymore that's fine with me too. I think I'd still feel a sense of accomplishment, feel like I'd advanced as a human and mastered my environment and machines for diving in here.
I don't feel the agency I want to have. These games make me want to extend myself, my agency. Playing them manually offers some very low grade enjoyment but that sense of missing out gnaws at me, and I'm not at all dissuaded by parent trying to ward me off, and if I do end up winning so hard I don't care anymore, me right now would regard that as a victory condition & rief from this pressure I feel about ineffectively plodding through as I do now.
You might want to look at Serpent AI: granted the repo is now in an archived state, but it did similar things to those you mention.
https://github.com/SerpentAI/SerpentAI
Since this is a research paper with promising ideas but non-functional code, what are people using as the best-in-class agents for computer automation? For example:
1. Claude for computer use
2. Various startup offerings—if you have recommendations, please list them
3. Established tools like Playwright, Selenium, and WebDriver, combined with screenshots and LLM-based guidance
What tools or approaches are actually working for building useful automation solutions?
Are you sure about the non-working code point?
I've yet to try it but my understanding is the repo here has got working code along with installation instructions:
https://github.com/microsoft/OmniParser
I confirm it works: I got the gradio demo working locally and it's pretty reasonable.
Slight rough edges (to be expected) and you do need to read the README with attention but it's all par for the course. I had to install einops which wasn't in the requirements.txt and even though I had downloaded the HF models they released, it still needed to pull in another model when I first ran the demo.
Thanks for the tip, will try again.
our agent is available via NPM: http://testdriver.ai
Computer Use, Agent.exe and so on, but nothing actually is useful yet. It's all very terrible. And then to think we had perfection already (and Claude is good at it); emacs... No need for any of this; everything can be scripted.
Does it also tell the coordinates (x,y) of the annotated box w.r.t. the screenshot dimensions?
Has anyone gotten this to work?
Copying the repo and downloading the models through HuggingFace or manually does not seem to work, you get errors indicating missing files.
I tried as well. Seems like it is a proprietary model.
:( I was so, so excited to try this when I found it yesterday, it has like 3 star emojis in my list of models. I’ll post here if I get it working tomorrow, I guess. I doubt they’d release a model on HF without intending to make it useable
EDIT; surely it’s just broken, the repo does include .safetensor weights. Maybe the problem is the “suspicious”-flagged PyTorch weight for “icon detection”, whatever that means?
FOSS-washing?
This literally just published, IMHO it's a little premature to be accusing them of that at such an early stage.
More likely they just slipped up with getting everything uploaded properly - it's easily done, and luckily easily corrected, so we'll likely see issues get resolved fairly swiftly.
See my more detailed comments above but I confirm this is working.
Looks like a few tweaks made to the github repo ~13 hours ago which may explain the issues those had earlier and why it's now fine for me.