And the other thing is tacit or tribal knowledge. Ai system is good when data is structured and available. Not so much when data is scattered and largely the connect the dots information is in dev or testers head.
My recipe is memory + context combine with seamless ui to capture dev / tester mindset will make any ai system customizable. It doesn’t have to be LLM system, it can 90% rag or some kind of graph tag and 10% LLM usage. That will create a moat easily defensible otherwise a new LLM upgrade wipe out all the moat you might have.
I think Claude Code can write very good end to end tests given the right constructs.
I have been building a desktop app (electron-based) which interacts with Anthropic’s AgentSDK and the local file system.
It’s 100% spec driven and Claude Code has written every line. I do large features instead of small ones (spec in issue around 300 lines of markdown).
I have had it generate playwright tests from the start. It was doing okay but one thing made it do amazing. I created a spec driven pull request to use data-testid attributes for selectors.
Every new feature adds tests, and verifies it hasn’t broken existing features.
I don’t even bother with unit tests. It’s working amazing.
I tried claude code, and it did write some good quality e2e tests but my biggest worry was the full coverage. Its really difficult to quantify e2e test coverage the way developers do unit test coverage. its really impossible. specs is just one artifact just like code is just one of many artifacts that full system wide e2e coverage needs. addng production logs + producton incidents which I tried also give me some sense of full e2e coverage. if you are using claude code for dev and testing both, its like having cake and eat it too. If claude for whatever reason misrepresent or misinterpret a requirement, that will percolate in code and testing as well. having a 3rd party testing tool is appropiate with allthe data flowing in it like specs, legacy tests, prod incidents, code and then perhaps we can expect full unbiased test coevrage. I am not talking about wanna be enterprise apps or hobby apps, i am talking about >v0 enterprise apps that have real customers and real downside if they go down with rich data set of past incidents and not so perfect code but now they are increasingly using agentic ai to produce more non-human code. they need a 3rd party tool that ingest their data, create a KG understanding of their data and prevent crtical bugs leak into production by geenrating small number of high quality high coverage tests.
Interesting approach. I have noticed the same issue — AI tools generate a lot of code and unit tests, but real user-flow or edge-case testing often gets skipped. Having something that reads the PR context and suggests missing scenarios could actually catch problems earlier.
i agree, but i wwant to add that perhaps just specs might not give you full testing coverage, have to add other artifacts too, like prod logss and incidents and using some layer of ontology + KG to produce meaningful data connectins and understanding. vector db alone will only give semantic search and grossly incompetent to connect data artifacts. for example for vector db, word apple and company apple might both be same without outlininig the context.
I always translate specs into Gherkin BDD scenarios and drive the tests off that; without this linkage test coverage diverges from user flows.
And the other thing is tacit or tribal knowledge. Ai system is good when data is structured and available. Not so much when data is scattered and largely the connect the dots information is in dev or testers head. My recipe is memory + context combine with seamless ui to capture dev / tester mindset will make any ai system customizable. It doesn’t have to be LLM system, it can 90% rag or some kind of graph tag and 10% LLM usage. That will create a moat easily defensible otherwise a new LLM upgrade wipe out all the moat you might have.
I think Claude Code can write very good end to end tests given the right constructs.
I have been building a desktop app (electron-based) which interacts with Anthropic’s AgentSDK and the local file system.
It’s 100% spec driven and Claude Code has written every line. I do large features instead of small ones (spec in issue around 300 lines of markdown).
I have had it generate playwright tests from the start. It was doing okay but one thing made it do amazing. I created a spec driven pull request to use data-testid attributes for selectors.
Every new feature adds tests, and verifies it hasn’t broken existing features.
I don’t even bother with unit tests. It’s working amazing.
I tried claude code, and it did write some good quality e2e tests but my biggest worry was the full coverage. Its really difficult to quantify e2e test coverage the way developers do unit test coverage. its really impossible. specs is just one artifact just like code is just one of many artifacts that full system wide e2e coverage needs. addng production logs + producton incidents which I tried also give me some sense of full e2e coverage. if you are using claude code for dev and testing both, its like having cake and eat it too. If claude for whatever reason misrepresent or misinterpret a requirement, that will percolate in code and testing as well. having a 3rd party testing tool is appropiate with allthe data flowing in it like specs, legacy tests, prod incidents, code and then perhaps we can expect full unbiased test coevrage. I am not talking about wanna be enterprise apps or hobby apps, i am talking about >v0 enterprise apps that have real customers and real downside if they go down with rich data set of past incidents and not so perfect code but now they are increasingly using agentic ai to produce more non-human code. they need a 3rd party tool that ingest their data, create a KG understanding of their data and prevent crtical bugs leak into production by geenrating small number of high quality high coverage tests.
Interesting approach. I have noticed the same issue — AI tools generate a lot of code and unit tests, but real user-flow or edge-case testing often gets skipped. Having something that reads the PR context and suggests missing scenarios could actually catch problems earlier.
i agree, but i wwant to add that perhaps just specs might not give you full testing coverage, have to add other artifacts too, like prod logss and incidents and using some layer of ontology + KG to produce meaningful data connectins and understanding. vector db alone will only give semantic search and grossly incompetent to connect data artifacts. for example for vector db, word apple and company apple might both be same without outlininig the context.
Interesting Man!
lets connect if you like to see some lessons learned?