> but the ultimate goal is to crowdsource a high quality set of browser sessions to train an open source local model.
Could you say more on this? I see that it's an open-source implementation of PLAN with Selenium and Claude's Cursor, but where will the "successes" of browser sessions be stored? Also, will it include an anonymization feature to remove PII from authenticated use cases?
The next step will be adding functionality to convert and save a BrowserStep[] into a portable file format and addition conversation functions to turn those files into .jsonl that can be fed into the transformers library etc. For the PII piece, there's no current plans to introduce anonymization features but open to suggestions.
Not at the moment, since you need a local model with strong segmentation capabilities (x, y) and none exist ATM. We hope to train one in the future and one of Cerebellum's roadmap items is to create a the ability to save your sessions as a training dataset.
Any idea on how does Sonnet does this, is the image annotated with bounding boxes on text boxes etc. along with its coordinates before sending to sonnet and it responds with box name back or co-ordinate back or ? is SAM2 used for segmenting everything before sending to sonnet ?
They don't discuss this at all on their blog other than "Training Claude to count pixels accurately was critical." My speculation on how they accomplished it is either explicit tokenizer support with spacial encoding similar to how single-digit tokenization improves math abilities or an extensive pretraining like Molmo.
Selenium works on webdriver v4 and the screenshot is transferred as an image by the webdriver protocol. Perhaps modifying DOM before triggering the screenshot and then reverting the changes can work. PRs are welcome!
What do you think about this tool changing the landscape of software testing?
I think you could change the roles of SDETs and other quality assurance jobs dominated by Selenium and Playwright. I mean think about it. It would half the number of testers needed to do the same work.
I think if you added additional function calls to detect visual bugs or breaking flows, tools such as this could automate much of QA in addition to detecting non-intuitive UI design patterns.
> but the ultimate goal is to crowdsource a high quality set of browser sessions to train an open source local model.
Could you say more on this? I see that it's an open-source implementation of PLAN with Selenium and Claude's Cursor, but where will the "successes" of browser sessions be stored? Also, will it include an anonymization feature to remove PII from authenticated use cases?
The next step will be adding functionality to convert and save a BrowserStep[] into a portable file format and addition conversation functions to turn those files into .jsonl that can be fed into the transformers library etc. For the PII piece, there's no current plans to introduce anonymization features but open to suggestions.
Can this work with local models ?
Not at the moment, since you need a local model with strong segmentation capabilities (x, y) and none exist ATM. We hope to train one in the future and one of Cerebellum's roadmap items is to create a the ability to save your sessions as a training dataset.
Any idea on how does Sonnet does this, is the image annotated with bounding boxes on text boxes etc. along with its coordinates before sending to sonnet and it responds with box name back or co-ordinate back or ? is SAM2 used for segmenting everything before sending to sonnet ?
They don't discuss this at all on their blog other than "Training Claude to count pixels accurately was critical." My speculation on how they accomplished it is either explicit tokenizer support with spacial encoding similar to how single-digit tokenization improves math abilities or an extensive pretraining like Molmo.
Do you not think it could work with a shim layer that handled the browser interaction via code and selenium?
Selenium works on webdriver v4 and the screenshot is transferred as an image by the webdriver protocol. Perhaps modifying DOM before triggering the screenshot and then reverting the changes can work. PRs are welcome!
OP here, happy to answer any questions you may have!
What do you think about this tool changing the landscape of software testing?
I think you could change the roles of SDETs and other quality assurance jobs dominated by Selenium and Playwright. I mean think about it. It would half the number of testers needed to do the same work.
I think if you added additional function calls to detect visual bugs or breaking flows, tools such as this could automate much of QA in addition to detecting non-intuitive UI design patterns.
Any plans for a python version?
Update: We had a contributor start a Python port, stay tuned!
It's on the roadmap! A few other priorities are higher at the moment, but we'll be excited to see a PR for it in the meantime.
Thanks for using Selenium!
You don't need LLM.
Build interface to build knowledge graph.
Nodes containing words, verbs are action, nouns are past verb. Action is movement on space.
Very cool!