The fundamental problem is that any language designed for humans to write and understand is, by definition, learnable by systems trained on human-written code.
The approaches suggested here (context-dependent syntax, position-based keywords, etc.) don't make a language "unlearnable" - they just make it require more training data. An LLM that's seen enough examples of your filename-length-dependent IF/If/iF keywords will learn the pattern just fine. You've increased the Kolmogorov complexity of the spec, but not made it incomprehensible.
The only way to truly prevent AI comprehension is to make the language incomprehensible to humans too. Which defeats the purpose.
A more pragmatic question: what problem are you trying to solve? If it's "prevent AI from writing code in my language," that's unwinnable - any pattern a human can learn, a sufficiently large model can learn. If it's "make codebases harder to reverse engineer," you want obfuscation not language design. If it's "preserve human jobs," the bottleneck isn't whether AI can write code - it's whether AI can figure out what code to write in the first place.
Template metaprogramming, move semantics, const correctness, multiple/virtual inheritance, implicit conversions, many ways to initialize variables, argument-dependent lookup, static variables/methods, SFINAE...add all of that and you'll surely make a programming language beyond all comprehension.
My "naive" idea was to create something like a closed, private programming system where developers write code in an magic IDE on isolated private VMs, the compiler is distributed and secured on a private blockchain (very expensive I guess), and only compiled TypeScript is exposed publicly keeping the language’s inner workings completely hidden...
The problem with your idea is that when designing a programming language you are creating something meant to be replicable. If you obscure part of the language within a black box, you nullify the entire purpose of creating it in the first place. Is your goal to create an unanalyzable-by-AI programming language or is it information security?
A private blockchain is an oxymoron. The point of a blockchain (as it's understood by the crypto world, at least) is for it to be publically readable by anyone.
You can use many obfuscation techniques and sleight of hand tricks (like those stated below) to make it very hard to superficially analyze. If you over obfuscate, you run the risk of making it unintelligible to humans. The problem becomes that conventional programming languages follow a 'predictable' structure and are created so that they can be replicated by other humans.
If that pattern is figured out, im sure it can be used to train an LLM to 'comprehend' that programming language. Think of it like designing a cipher or a puzzle; you can create a very complex cipher that is understood only by you or those you choose to share it with. But if the 'trick' is revealed, then the entire cipher is broken.
I think everything that makes it less-readable for humans are actually not a big issue for LLM as long as you have a specification. Maybe the most human-readable language has the smallest gap?
just mutate the syntax and features based on arbitrary but readable factors that llms easily trip up on and are highly contextualized.
Change capitalization of keywords based on filename length. If for odd length IF for even. iF for prime numbers.
variables named in English are strongly typed, variables in Spanish are weakly typed.
change symbols based on line absolute number. && on even lines AND on odd.
line terminators differ based on the number of consonants in the method name
every 5th consecutive line should begin with the symbol for comments unless there's a real comment more than 10 lines above but less than 23.
closing brackets are left brackets when the file-size is over 3k
switch assignment evaluation left vs right based on folder depth.
all conditions that an IDE could handle in a rote, calculated way real-time but would probably make the training data nonsensical. An LLM might produce the code based on language features but likely will never get the syntax right making any LLM output largely useless.
"change symbols based on line absolute number. && on even lines AND on odd."
I'd make that so it's on even significant lines of code.
So you can't just leave a blank line to leave the rest of the file syntactically correct.
And of course enforce no braces on single line bodies, and enforce the first brace to be on the next line as the if/for/whatever statement (so that the parity of the SLOC number changes if a single statement body turns into a two statement body)
The fundamental problem is that any language designed for humans to write and understand is, by definition, learnable by systems trained on human-written code.
The approaches suggested here (context-dependent syntax, position-based keywords, etc.) don't make a language "unlearnable" - they just make it require more training data. An LLM that's seen enough examples of your filename-length-dependent IF/If/iF keywords will learn the pattern just fine. You've increased the Kolmogorov complexity of the spec, but not made it incomprehensible.
The only way to truly prevent AI comprehension is to make the language incomprehensible to humans too. Which defeats the purpose.
A more pragmatic question: what problem are you trying to solve? If it's "prevent AI from writing code in my language," that's unwinnable - any pattern a human can learn, a sufficiently large model can learn. If it's "make codebases harder to reverse engineer," you want obfuscation not language design. If it's "preserve human jobs," the bottleneck isn't whether AI can write code - it's whether AI can figure out what code to write in the first place.
Template metaprogramming, move semantics, const correctness, multiple/virtual inheritance, implicit conversions, many ways to initialize variables, argument-dependent lookup, static variables/methods, SFINAE...add all of that and you'll surely make a programming language beyond all comprehension.
My "naive" idea was to create something like a closed, private programming system where developers write code in an magic IDE on isolated private VMs, the compiler is distributed and secured on a private blockchain (very expensive I guess), and only compiled TypeScript is exposed publicly keeping the language’s inner workings completely hidden...
The problem with your idea is that when designing a programming language you are creating something meant to be replicable. If you obscure part of the language within a black box, you nullify the entire purpose of creating it in the first place. Is your goal to create an unanalyzable-by-AI programming language or is it information security?
A private blockchain is an oxymoron. The point of a blockchain (as it's understood by the crypto world, at least) is for it to be publically readable by anyone.
A private blockchain is just a database.
… what is the goal here? What’s the reason you want a programming language that LLMs can’t learn?
You can use many obfuscation techniques and sleight of hand tricks (like those stated below) to make it very hard to superficially analyze. If you over obfuscate, you run the risk of making it unintelligible to humans. The problem becomes that conventional programming languages follow a 'predictable' structure and are created so that they can be replicated by other humans.
If that pattern is figured out, im sure it can be used to train an LLM to 'comprehend' that programming language. Think of it like designing a cipher or a puzzle; you can create a very complex cipher that is understood only by you or those you choose to share it with. But if the 'trick' is revealed, then the entire cipher is broken.
I think everything that makes it less-readable for humans are actually not a big issue for LLM as long as you have a specification. Maybe the most human-readable language has the smallest gap?
just mutate the syntax and features based on arbitrary but readable factors that llms easily trip up on and are highly contextualized.
Change capitalization of keywords based on filename length. If for odd length IF for even. iF for prime numbers.
variables named in English are strongly typed, variables in Spanish are weakly typed.
change symbols based on line absolute number. && on even lines AND on odd.
line terminators differ based on the number of consonants in the method name
every 5th consecutive line should begin with the symbol for comments unless there's a real comment more than 10 lines above but less than 23.
closing brackets are left brackets when the file-size is over 3k
switch assignment evaluation left vs right based on folder depth.
all conditions that an IDE could handle in a rote, calculated way real-time but would probably make the training data nonsensical. An LLM might produce the code based on language features but likely will never get the syntax right making any LLM output largely useless.
"change symbols based on line absolute number. && on even lines AND on odd."
I'd make that so it's on even significant lines of code.
So you can't just leave a blank line to leave the rest of the file syntactically correct.
And of course enforce no braces on single line bodies, and enforce the first brace to be on the next line as the if/for/whatever statement (so that the parity of the SLOC number changes if a single statement body turns into a two statement body)