At least with Python, strings are strings and there is a minimal risk you store a git hash coming from the output of a git subprocess command as a number. The only way this bug could have happened is if doing manual string templating. If you were doing that, equally or worse kind of bugs are waiting for you in any serialization format.
> The only way this bug could have happened is if doing manual string templating.
I agree you should not do that, but it is common in the YAML world - if you're using YAML you've already decided you don't care if things are reliable.
Given that they were using string templating, this would have been caught earlier using a different format.
Of course there are other string templating mistakes that other formats would not catch (e.g. forgetting to escape strings), but they are still better than YAML.
> It's not really a solution though because you only notice you need to do it after it triggers a bug.
… the first couple of times.
After causing the same bug multiple times its the user at fault, not the tool.
YAML does a decent enough job.
I think these are the relevant YAML spec sections from a quick glance. If someone wants to correct me, feel free.
YAML spec on double quotes
> The double-quoted style is specified by surrounding “"” indicators. This is the only style capable of expressing arbitrary strings, by using “\” escape sequences. This comes at the cost of having to escape the “\” and “"” characters.
YAML spec on plain style
> The plain (unquoted) style has no identifying indicators and provides no form of escaping. It is therefore the most readable, most limited and most context sensitive style. In addition to a restricted character set, a plain scalar must not be empty or contain leading or trailing white space characters.
It’s a trade off in how the parsing works. No tool is perfect. :shrugs:
The accepted safe default is to use double quotes. If folks don’t know that then that’s on them, not the tool. It’s in the spec.
Good workmen don’t blame their tools.
Edit — I know the UX tenet. I’ve worked in places where they thought using YAML for end users was a good idea. It’s not. It never will be. It’s a backend tool for engineers. So I agree with you in that specific UX case.
But this isn’t an end-user UX case. It’s backend platform configuration.
Use the right tool for the job, and use it the way it’s been designed —> Double quote strings in YAML.
Good workmen talk shit of particular tools and brands all the time, they just avoid having them in their toolbox. They don’t blame their tools, not all tools.
What's a better alternative? This problem is annoying but to me not more annoying than having to read or write JSON, and TOML is terrible with nested configs. If there were an option with the structure of YAML minus the ridiculous string handling I'd switch to it for sure.
Looked at the sibling comments. Has everybody collectively forgotten that XML exists? This stuff was solved decades ago, XSDs can check types.
TOML nesting is indeed a joke. YAML & JSON "look clean", but try to match up nesting in a large document without help from text editor highlights, then see how easy it is in XML.
XML has universal support everywhere, has first-class tooling for everything that's being re-invented in these other languages, but it's not "cool" any more.
Maybe you have worked with a different XML than I but I have only terrible memories of working with XML. Starting from the parser inconsistencies (that even led to security vulnerabilities in Apple, see https://blog.siguza.net/psychicpaper/).
On a higher level, the fact that so much different kind of information could exist at each level was nothing but headaches. In YAML, or in JSON, it's pretty straight forward. You have an object, it has children, children have types/values etc.
In XML, you have to keep in mind what the tag of the element, what its attributes are, and then what the child elements are, and then whatever the heck CDATA is.
I think my fellow posters are looking at the past through nostalgic rose tinted glasses. XML was terrible and I am glad it's not used as widely anymore.
Features like schema feel very natural in XML whereas JSON schema feels outer worldly; and is seldomly used
Being able to clearly distinguish between multiple semantically different types of strings is a blessing
And coming back to your example, how is the knowledge in JSON of which attribute is an array, which is an object, which is a string and when order of attributes as well as duplicates matter and when not any easier than XML?
1. No consistency in whether a value is an attribute or a child element, they seem completely interchangeable and redundant. I'm sure there's some nuance I'm missing here but I've worked with my fair share of XML and it has yet to be relevant to me at least. This adds a lot of mostly arbitrary decisions to the design phase, makes it difficult to clearly refer to specific values, and means it's impossible to directly parse XML into an object structure in most languages like you can with JSON.
2. It's awful to read and write. Everything is just a mess of tags and any structure is quickly lost. Line splitting is awkward and often not even attempted (seriously I had coworkers push back when I applied basic formatting to a config file because they claimed it was easier to read with lines 3x the width of anyone's screen). Needing to name the object in both the opening and closing tag wastes space and is absolutely ridiculous when writing by hand (sure editors can kinda handle this now but not perfectly and sometimes I'm using vi over SSH because that's all there is).
I really don't understand why so many people are still so attached to it myself.
As someone who previously worked in an XML-heavy environment, I would rather have an NFL linebacker dropkick me in the head than deal with XML again. Tim Bray himself has had doubts [1] at one point.
XML is too big of a hammer for the space it fills.
Pkl (from Apple land) and Dhall (from Haskell land) both solve some of these pain points as well as some others, especially being more seamless about integrating schema with config.
Jsonnet, I haven't used personally but I know people who have raved about it.
Ones I know less about include KCL, CUE, and Nickel.
I don't believe that executable configuration languages are a good fit at a primary configuration source, I would prefer to have them spit out static config before use. From your list KCL fits that bill (and is a really nice config language).
I liked what I saw of Pkl, wanted to use it when it was released but it seemed the only parser was JVM-based and it was intended more to be transpiled into other config languages. If that's changing definitely worth revisiting it. Dhall I had to look up, it seems nice as long as the formatting used on website examples is not enforced, because to me that looks like an absolute nightmare but my problems are with the whitespace and not the structure itself.
I like HOCON, although it's a bit obscure. It's a JSON syntax superset with the same data model, designed for human written config.
It doesn't have schema support however you arguably don't need one, because the software that reads the config specifies the type of keys when the value is read and casting takes place then. If the software expects a string, it reads the value as a string, if it expects a number it's parsed as a number and so on.
HOCON has hierarchical merging, include files, a more convenient syntax, ability to read environment variables, substitutions, comments and a few other convenience features. In Conveyor, a tool for packaging desktop apps I wrote that uses it, it's also extended so you have "hash-bang includes". Those are includes that specify a program to run instead of a file, the output is then included and parsed at that point. This lets you escape from declarative config to a fully dynamic computation if you need to. You can disable this feature with a command line flag if you don't trust the config you're parsing (and also env var substitution).
You can also render the whole config to regular JSON if you need to.
I find that set of features to nicely balance config complexity with read/writeability. The main issue is that the main library is not well maintained, and the best implementation is for the JVM. You could give it a C API these days with Native Image but nobody has.
Main downside vs yaml is that IDEs and editors can use YAML schemas to give auto-completion whereas they don't do that for HOCON.
Does the HOCON parser complain and refuse to proceed if it encounters a comma right before a closing curly brace like the JSON parsers with which I've interacted?
No no. The syntax is designed for usability, training commas are fine.
There's a little tutorial here, with a slider that shows how you can start with json and transform it. The right hand side is valid HOCON at every step:
It can be hard to find an implementation of this if you're working in a not-so-popular language (looking at you Swift) but JSONC has had the best developer experience of anything I've tried so far.
It's basically JSON with single and multi line comments and trailing commas.
The better alternative to yaml are json (particularly json with comments), properties files and xml files.
Basically any mainstream configuration syntax.
The really big problem with xml wasn't actually the verbosity of xml itself but the fact that it was popular in a time before rails popularized "convention over configuration"'
YAML is alright when you create it by hand. Problems start when YAML is templated, and frankly it's just dumb, though common. Templating the source without proper escaping could bite even with JSON.
My last big enterprise job, we did a lot of YAML templating and of course ran into issues like this all the time, eventually though we solved it by requiring defined schemas for all configs and validating those schemas in our pipeline. More overhead but that validation also caught lots of issues aside from the YAML gotchas, it was a decent setup to work with in the end.
As soon as I saw the YAML one-line snippet I knew it was a quoting issue.
My first thought was that git hash started with a zero, and the parser stripped it off because it was parsing it as a number and not a string.
But then I saw "infinity" and was like "wait is the git hash all numbers except there's the letter 'e' in there somewhere"?
Ugh.
Always quote your YAML strings!
I just wrote a configuration format writer for an app I maintain, The C libyaml library (ugh, yes, I have to write this in C) lets you specify the quoting style it will emit for strings... always choose some form of quoting.... always.
Also it seems kinda short-sighted for this company to use the short git hash. The short hash is nice for display purposes, but there's always the possibility of a collision (git is smart and will print out a revision that has no current collisions within your repo, but that doesn't mean it won't happen in the future). But for a configuration setting like that, there's really no reason not to use the full hash.
We have a logfile parser that tries to parse a value and fell back to float().
But if you passed it a hex value that happened to only have decimals and 'e', then Python interpreted it a an exponentiated number, which happened to be larger than what it could natively represent, so returned `Infinity` !
Obviously the bug was on our end with the naive `float()`, which after deliberation should not be used in this generic parser case, but it was interesting tracking down this bug from the DB values all the way up to the parser.
Every time I read one of these types of stories, I keep telling myself that there needs to be a stricter subset of YAML with far fewer features and things like required string quoting. I feel like it has to exist, but I don't know where it'd be found.
Since such a subset would be valid YAML, it sounds like requiring string quoting would be a straightforward option to add to any YAML parser as a feature, without requiring much of a spec at all.
Extra fun: there's now JSONC which does support trailing commas and comments, and projects will use it with a .json extension still because Microsoft would hate for anything to ever be clear and make sense.
JSON but with the structure of YAML, would be nearly perfect to me. JSON was not designed for humans to read and write directly, it happens to be easy enough that it's quite common but it's not exactly a nice experience. Take that base and remove the need to quote key values, enforce indentation for clear structure and to remove the need for braces and commas (the no trailing commas rule in JSON kills me too), and support the markdown-like list syntax alternative and I think really that's all most people want out of YAML anyways. I don't understand or ever want to work with their references or variables or whatever the hell they add, any of that is much better off in code than in config as far as I'm concerned.
However, this could've been caught by a validator. Whatever is loading this config file should know that `gameServerVersion` is a string and when it got a number, it should've thrown an error. It could also further validate that it gets a hex string lest something feed it "Infinity".
Not technically, but pratically yes. The chance of having a single letter "e" and everything else a number would be very rare for a full SHA1 hash. I'm not interested in doing the calculations but if anyone wants to do it, it sounds like a cool math problem to solve
While I haven't done the math, you're right that it would need to be an exponential of some kind (assuming this is JavaScript-based).
In JavaScript, Number.MAX_VALUE (the biggest number that can be represented without returning Infinity) is a 309-digit number, and Git SHA-1 hashes are only 40 characters long.
However, this does mean that the chance will be even smaller than you say. Assuming that all other conditions are met (sole "e", every other character is a digit)
* if the sole "e" is in any place from the 2nd to the 37th, the resulting number is almost guaranteed to be returned as Infinity, unless the number happens to begin with "0e". (Leading zeroes, either in the significand or the exponent, would be required for this to not be a guarantee. Any number beginning "0e", on the other hand, will evaluate to 0 and not become Infinity.)
* if it's in the 37th place, the resulting number (assuming no leading zeroes) has around a 27% chance to be representable without being returned as Infinity.
* if it's in the 38th or 39th place, the resulting number is guaranteed to be small enough that it won't be turned into Infinity.
* if it's in the 1st or 40th place, it probably won't be parsed as a number so will not return an error.
I'm also not interested in doing the full calculations, and in the end this isn't going to make a significant difference to the chances IMO (except in the "0e" case mentioned above, which seems like it'd make a noticeable difference), but it's interesting to think about the edge cases.
I've run into this exact YAML problem when trying to use the git hash as a label in a templated kubernetes config file. Same solution — wrap it in quotes — and exact same frustration.
Here's to you, and to whoever else runs into this problem next.
Nice one! I have hit something similar in the past around text input parsing. I bet that `e` notation for entering numbers has caused more bugs than it has actually been used as intended, at least in non-scientific code.
I do quote strings to try and avoid this problem, though usually I'll leave single-word strictly alphabetical values bare--the only gotcha I'm aware of here is true/false/yes/no which I never use as string values anyways. But what's the point of two sets of quotes?
Actually, yes. We use yamerl parser for parsing YAML in Erlang and Erlang has both strings and atoms; yamerl generally turns unquoted YAML strings into atoms and double-quoted YAML strings into strings.
But nobody else cares about how yamerl parses YAML for Erlang and lots of tool try to re-normalize YAML as they process it; so if we put e.g.
config:
option: "this_must_be_a_string"
into a Helm chart, and then run it through helm, the resulting YAML will be
config:
option: this_must_be_a_string
and when fed to our Erlang application, the option's value will end up being an atom, and it will blow up when we try to e.g. concatenate it with another string. But '"this_must_be_a_string"' resists all such normalization attempts, gets to be passed around as-is, and ends up being parsed as a string with embedded starting and ending quotes but those can be easily stripped away as a global post-pass on a parsed config, before consuming it any further.
Heh, maybe I missed the joke too. Thing is all of the problems people generally think of with YAML strings are solved with just one set of quotes so I don't quite get it either way. Maybe this would have some effect on strings that themselves contain quotes? But that's a special case in every single language with quoted strings, not a YAML problem.
See the sibling comment. But it is a joke, in a sense that this is all very silly. Maybe we shouldn't have been using such a fringe programming language in the first place but who knew 15 years ago that Go would exist?
I once half-assed a tool that summarized deployed builds, importing a CSV to Google Sheets as a team UI. Well, one day the import noticed one of those commit IDs must be a number because it doesn't contain any letters, and then a formula used exponential notation to create a hyperlink that didn't work.
Edit: Now that I think about it, it might have been only digits with a leading zero that was dropped.
That's because YAML is mainly intended for human interaction. Machine interaction is bungee-strapped later.
Same goes for CLI (shell), same for SQL. All the many bugs with them is because they are not for machines. Proper formats for machines don't have those issues (MsgPack, Protobuf, hell even XML/HTML is better).
I definitely had "string parsed as number" on my Yaml bugs bingo card. Did not exactly have "floats and hex have the same alphabet and overlapping grammars" but I'll put that on my bingo card for the next Yaml bugs game, which will be... tomorrow. :(
It’s easy to point at YAML and shake one’s head: “ya did it again old chap!”. Yet, software engineering is full of foot guns like that.
I tend to joke that working in the industry made me lose all the trust I ever had in me. I don’t believe any deployed code unless it can be confirmed by two independent sources and a runtime telemetry.
Here, I expected hash collision which is rare, but in my lifetime I’ve seen my share of “it really shouldn’t happen but somehow it happened”. Couple weeks ago I’ve seen author at a random library implementing queue (with ACKs!) for UDP packet transmission.
I wonder if you could fix this specific case by keeping track of the last game version to be interpreted this way and using that when querying for `Infinity`. That wouldn't be a platonically correct solution but it should work no?
The solution to this nonsense is and has always been s-expressions. XML reinvented s-expressions but they made a bad job of it and made them much too verbose.
I always use s-exprs for problem-free config files but sadly everybody else takes one look and says "but all those parentheses! I'd rather have my strings interpreted as very large numbers, thank you."
Meta: I don't understand "RNG" (random number generation, right?) in the title, since the article is about a quoting issue causing a value to be interpreted as a floating-point value in scientific notation, rather than a string. Is that the "random number generation" the title refers to, perhaps? A like puns, but maybe I'm just tired. :)
In video gamer parlance any non human predictable output or action gets called "RNG". Anything from true random loot drops to deterministic but chaotic systems can be called "RNG". This the blog title they describe the exact resulting git hash as having the quality of being RNG because you can't really predict if your git rev hash will have a certain characters. They got unlucky and triggered a bug from the chaotic git hash.
Not a gamer so first time I heard this. For some reason it bothers me so much. It's not random, it's perfectly reproducible and has a crystal clear explanation. They are using short hashes for version numbers which is looking for problems in the first place.
I don't think it's fair to characterize it as "sloppy" or "clickbait" at all. The author is a game developer, and they use terminology in a way that is common and accepted within their field -- sure, it's nonstandard and "wrong" outside of the game industry...but I'd say it's a little hypocritical for software developers to complain about jargon.
As a supporting example: the game Super Metroid alternates every frame between checking collisions from left-to-right or right-to-left. The direction of collision checks can make a difference for speedrunning tricks, and there's no (practical) way to control for it, so it's referred to as "RNG". From a player's perspective, the fact that it's just a frame counter rather than an LCG or something is irrelevant -- it's a luck-based factor that's outside of the player's control.
And the same frame counter is used as a source of entropy elsewhere in the game, so there's an argument that it's not even wrong to call it RNG. Similarly, a Git hash is a SHA-1 of the repo contents commit message, commit date, etc., and a cryptographic hash and pseudo-random generator are very similar constructions...so calling it RNG is a little cute but not exactly inaccurate.
I suppose it's "RNG" if the commit has exactly one 'e' and otherwise only numbers, so that YAML interprets it as scientific notation. I assume otherwise it's always interpreted as a String, as a fallback.
I was also confused by this as a commit id is deterministic. I suspect that the OP may be confused about how git hashes are computed.
But thinking deeper, given that the hash is computed from a small amount of entropy (the commit time), plus a seed (the committed code changes), and the previous value (previous commit hash), this is actually fairly similar to the definition of a PRNG.
That is to say, it's not an RNG, but for some approximations it's indistinguishable from one.
> I suspect that the OP may be confused about how git hashes are computed.
I don't think they are. The use of "RNG" is referring to how a "random" commit sha lead to this bug. Commit SHAs are, for all intents and purposes, random.
You could test this against thousands of SHAs and never encounter one that meets the criteria to trigger the bug.
To get this specific bug - one 'e' in the 7th digit and 9 digits less than 'a' - is 1/16*(10/16)^9 = 0.009095. Just under 1 in 1000. But if the 'e' is in the 2nd,3rd,4th,5th,6th you'd get the same bug, so actually about 6 in 1000.
But around 1% of the hashes will be all decimal digits and will parse as numbers ((10/16)^10 = 0.009). That's common enough that I've seen that one cause errors in our code too (somebody had code reading the hash then trying to append it to a string).
I agree it isn't a bug in the RNG itself, but it a bug in randomness propagation which is part of randomness generation.
For instance consider this bug:
secretKey = Hexadecimal(Crypto.Rand())[0:16]
The person likely intended to generate a secret_key with 16 bytes of entropy. Instead they generated a secretKey which is 16 bytes long, but only contains 8 bytes of entropy. I would call this a RNG bug.
YAML Norway problem strikes again!
I wish YAML wasn't so common in Python ecosystem...
Solution: quote values. Like in JSON.
It's not really a solution though because you only notice you need to do it after it triggers a bug.
You need a format that doesn't let you make this mistake in the first place, like [everything except YAML].
If the yaml is generated by code it should be quoted automatically, given that you have a sane yaml library.
try this:
It will correctly print the quoted version At least with Python, strings are strings and there is a minimal risk you store a git hash coming from the output of a git subprocess command as a number. The only way this bug could have happened is if doing manual string templating. If you were doing that, equally or worse kind of bugs are waiting for you in any serialization format.> The only way this bug could have happened is if doing manual string templating.
I agree you should not do that, but it is common in the YAML world - if you're using YAML you've already decided you don't care if things are reliable.
Given that they were using string templating, this would have been caught earlier using a different format.
Of course there are other string templating mistakes that other formats would not catch (e.g. forgetting to escape strings), but they are still better than YAML.
The problem obviously is using string templating without proper escaping
Any format would fail if used like that, be it YAML, JSON, XML...
This problem is specific to YAML because it doesn't require quoting.
> It's not really a solution though because you only notice you need to do it after it triggers a bug.
… the first couple of times.
After causing the same bug multiple times its the user at fault, not the tool.
YAML does a decent enough job.
I think these are the relevant YAML spec sections from a quick glance. If someone wants to correct me, feel free.
YAML spec on double quotes
> The double-quoted style is specified by surrounding “"” indicators. This is the only style capable of expressing arbitrary strings, by using “\” escape sequences. This comes at the cost of having to escape the “\” and “"” characters.
YAML spec on plain style
> The plain (unquoted) style has no identifying indicators and provides no form of escaping. It is therefore the most readable, most limited and most context sensitive style. In addition to a restricted character set, a plain scalar must not be empty or contain leading or trailing white space characters.
> After causing the same bug multiple times its the user at fault, not the tool.
Absolutely not. If the tool is so bad that users commonly make a mistake then the tool should prevent that mistake.
This is a basic UX tenet that unfortunately many people do not know.
It’s a trade off in how the parsing works. No tool is perfect. :shrugs:
The accepted safe default is to use double quotes. If folks don’t know that then that’s on them, not the tool. It’s in the spec.
Good workmen don’t blame their tools.
Edit — I know the UX tenet. I’ve worked in places where they thought using YAML for end users was a good idea. It’s not. It never will be. It’s a backend tool for engineers. So I agree with you in that specific UX case.
But this isn’t an end-user UX case. It’s backend platform configuration.
Use the right tool for the job, and use it the way it’s been designed —> Double quote strings in YAML.
> Good workmen don’t blame their tools.
Good workmen that are given shoddy tools absolutely blame them.
Good workmen don’t blame their tools.
Good workmen talk shit of particular tools and brands all the time, they just avoid having them in their toolbox. They don’t blame their tools, not all tools.
the first couple of times
Times every yaml user in the multiverse.
It’s a right of passage.
Just like convincing your boss to switch the backend over to K8s and then realising a year later it was a mistake.
Everyone does it.
What's a better alternative? This problem is annoying but to me not more annoying than having to read or write JSON, and TOML is terrible with nested configs. If there were an option with the structure of YAML minus the ridiculous string handling I'd switch to it for sure.
Looked at the sibling comments. Has everybody collectively forgotten that XML exists? This stuff was solved decades ago, XSDs can check types.
TOML nesting is indeed a joke. YAML & JSON "look clean", but try to match up nesting in a large document without help from text editor highlights, then see how easy it is in XML.
XML has universal support everywhere, has first-class tooling for everything that's being re-invented in these other languages, but it's not "cool" any more.
Maybe you have worked with a different XML than I but I have only terrible memories of working with XML. Starting from the parser inconsistencies (that even led to security vulnerabilities in Apple, see https://blog.siguza.net/psychicpaper/).
On a higher level, the fact that so much different kind of information could exist at each level was nothing but headaches. In YAML, or in JSON, it's pretty straight forward. You have an object, it has children, children have types/values etc.
In XML, you have to keep in mind what the tag of the element, what its attributes are, and then what the child elements are, and then whatever the heck CDATA is.
I think my fellow posters are looking at the past through nostalgic rose tinted glasses. XML was terrible and I am glad it's not used as widely anymore.
I too really don't get the hate for XML
Features like schema feel very natural in XML whereas JSON schema feels outer worldly; and is seldomly used
Being able to clearly distinguish between multiple semantically different types of strings is a blessing
And coming back to your example, how is the knowledge in JSON of which attribute is an array, which is an object, which is a string and when order of attributes as well as duplicates matter and when not any easier than XML?
My two biggest issues:
1. No consistency in whether a value is an attribute or a child element, they seem completely interchangeable and redundant. I'm sure there's some nuance I'm missing here but I've worked with my fair share of XML and it has yet to be relevant to me at least. This adds a lot of mostly arbitrary decisions to the design phase, makes it difficult to clearly refer to specific values, and means it's impossible to directly parse XML into an object structure in most languages like you can with JSON.
2. It's awful to read and write. Everything is just a mess of tags and any structure is quickly lost. Line splitting is awkward and often not even attempted (seriously I had coworkers push back when I applied basic formatting to a config file because they claimed it was easier to read with lines 3x the width of anyone's screen). Needing to name the object in both the opening and closing tag wastes space and is absolutely ridiculous when writing by hand (sure editors can kinda handle this now but not perfectly and sometimes I'm using vi over SSH because that's all there is).
I really don't understand why so many people are still so attached to it myself.
XML is a great generic framework for markup languages. When you aren't doing markup, XML is terrible. However, JSON fits the role.
The opposite problem exists in Minecraft chat message formatting - they used JSON for markup, when it should have been XML.
As someone who previously worked in an XML-heavy environment, I would rather have an NFL linebacker dropkick me in the head than deal with XML again. Tim Bray himself has had doubts [1] at one point.
XML is too big of a hammer for the space it fills.
[1] https://www.tbray.org/ongoing/When/200x/2003/03/16/XML-Prog
It's not "cool" anymore because it's painful to work with despite all the tooling and support
> It's not "cool" anymore because it's painful to work with
That's kinda weird way to phrase it; being painful to work with has nothing to do with "coolness"... that's a legitimate complaint.
Despite all the tooling and support, XML is painful to work with.
I think they agree with you and meant that sarcastically.
Pkl (from Apple land) and Dhall (from Haskell land) both solve some of these pain points as well as some others, especially being more seamless about integrating schema with config.
Jsonnet, I haven't used personally but I know people who have raved about it.
Ones I know less about include KCL, CUE, and Nickel.
I don't believe that executable configuration languages are a good fit at a primary configuration source, I would prefer to have them spit out static config before use. From your list KCL fits that bill (and is a really nice config language).
I liked what I saw of Pkl, wanted to use it when it was released but it seemed the only parser was JVM-based and it was intended more to be transpiled into other config languages. If that's changing definitely worth revisiting it. Dhall I had to look up, it seems nice as long as the formatting used on website examples is not enforced, because to me that looks like an absolute nightmare but my problems are with the whitespace and not the structure itself.
I like HOCON, although it's a bit obscure. It's a JSON syntax superset with the same data model, designed for human written config.
It doesn't have schema support however you arguably don't need one, because the software that reads the config specifies the type of keys when the value is read and casting takes place then. If the software expects a string, it reads the value as a string, if it expects a number it's parsed as a number and so on.
HOCON has hierarchical merging, include files, a more convenient syntax, ability to read environment variables, substitutions, comments and a few other convenience features. In Conveyor, a tool for packaging desktop apps I wrote that uses it, it's also extended so you have "hash-bang includes". Those are includes that specify a program to run instead of a file, the output is then included and parsed at that point. This lets you escape from declarative config to a fully dynamic computation if you need to. You can disable this feature with a command line flag if you don't trust the config you're parsing (and also env var substitution).
You can also render the whole config to regular JSON if you need to.
I find that set of features to nicely balance config complexity with read/writeability. The main issue is that the main library is not well maintained, and the best implementation is for the JVM. You could give it a C API these days with Native Image but nobody has.
Main downside vs yaml is that IDEs and editors can use YAML schemas to give auto-completion whereas they don't do that for HOCON.
Does the HOCON parser complain and refuse to proceed if it encounters a comma right before a closing curly brace like the JSON parsers with which I've interacted?
No no. The syntax is designed for usability, training commas are fine.
There's a little tutorial here, with a slider that shows how you can start with json and transform it. The right hand side is valid HOCON at every step:
https://conveyor.hydraulic.dev/15.0/configs/hocon/
JSONC. https://onury.io/jsonc/
It can be hard to find an implementation of this if you're working in a not-so-popular language (looking at you Swift) but JSONC has had the best developer experience of anything I've tried so far.
It's basically JSON with single and multi line comments and trailing commas.
JSON5 has a heck of a lot broader support AFAICT. It's basically JavaScript's notation: https://json5.org/
Json5 is fine. Of the boring config options, it's the best.
The better alternative to yaml are json (particularly json with comments), properties files and xml files.
Basically any mainstream configuration syntax.
The really big problem with xml wasn't actually the verbosity of xml itself but the fact that it was popular in a time before rails popularized "convention over configuration"'
https://rcl-lang.org/
I'm not a fan of that import function.
Protobuf/thrift?
YAML is alright when you create it by hand. Problems start when YAML is templated, and frankly it's just dumb, though common. Templating the source without proper escaping could bite even with JSON.
My last big enterprise job, we did a lot of YAML templating and of course ran into issues like this all the time, eventually though we solved it by requiring defined schemas for all configs and validating those schemas in our pipeline. More overhead but that validation also caught lots of issues aside from the YAML gotchas, it was a decent setup to work with in the end.
or anywhere
As soon as I saw the YAML one-line snippet I knew it was a quoting issue.
My first thought was that git hash started with a zero, and the parser stripped it off because it was parsing it as a number and not a string.
But then I saw "infinity" and was like "wait is the git hash all numbers except there's the letter 'e' in there somewhere"?
Ugh.
Always quote your YAML strings!
I just wrote a configuration format writer for an app I maintain, The C libyaml library (ugh, yes, I have to write this in C) lets you specify the quoting style it will emit for strings... always choose some form of quoting.... always.
Also it seems kinda short-sighted for this company to use the short git hash. The short hash is nice for display purposes, but there's always the possibility of a collision (git is smart and will print out a revision that has no current collisions within your repo, but that doesn't mean it won't happen in the future). But for a configuration setting like that, there's really no reason not to use the full hash.
We had a similar problem in Python with float().
We have a logfile parser that tries to parse a value and fell back to float().
But if you passed it a hex value that happened to only have decimals and 'e', then Python interpreted it a an exponentiated number, which happened to be larger than what it could natively represent, so returned `Infinity` !
Obviously the bug was on our end with the naive `float()`, which after deliberation should not be used in this generic parser case, but it was interesting tracking down this bug from the DB values all the way up to the parser.
Every time I read one of these types of stories, I keep telling myself that there needs to be a stricter subset of YAML with far fewer features and things like required string quoting. I feel like it has to exist, but I don't know where it'd be found.
StrictYAML https://hitchdev.com/strictyaml/
Since such a subset would be valid YAML, it sounds like requiring string quoting would be a straightforward option to add to any YAML parser as a feature, without requiring much of a spec at all.
What else would be needed?
JSON?
The trailing comma thing gets me nearly every time I edit a json config file.
I don't think json supports comments, which makes it a non-starter for a lot of folks.
Extra fun: there's now JSONC which does support trailing commas and comments, and projects will use it with a .json extension still because Microsoft would hate for anything to ever be clear and make sense.
JSON but with the structure of YAML, would be nearly perfect to me. JSON was not designed for humans to read and write directly, it happens to be easy enough that it's quite common but it's not exactly a nice experience. Take that base and remove the need to quote key values, enforce indentation for clear structure and to remove the need for braces and commas (the no trailing commas rule in JSON kills me too), and support the markdown-like list syntax alternative and I think really that's all most people want out of YAML anyways. I don't understand or ever want to work with their references or variables or whatever the hell they add, any of that is much better off in code than in config as far as I'm concerned.
Yaml should be used as a lesson learned, not sure why people still use it as a data format. You should stop even tolerating it, refuse on sight.
So, yes, YAML's implicit typing is a landmine.
However, this could've been caught by a validator. Whatever is loading this config file should know that `gameServerVersion` is a string and when it got a number, it should've thrown an error. It could also further validate that it gets a hex string lest something feed it "Infinity".
Good thing this wasn't a kernel-mode driver for an anti-virus program.
A number is also a string. Crazy, right?
The following code will only validate when gameServerVersion is a hex string between 9 and 32 characters:
It would have prevented the Git Hash Bug originally described: It's just good practice to validate things on the way in. Even if they were using JSON as their config file, they should still validate it.Always use the full hash everywhere.
Never store the short hash anywhere. It's good only for immediate commandline use, and nothing else.
This doesn't really solve the underlying issue, though.
Not technically, but pratically yes. The chance of having a single letter "e" and everything else a number would be very rare for a full SHA1 hash. I'm not interested in doing the calculations but if anyone wants to do it, it sounds like a cool math problem to solve
While I haven't done the math, you're right that it would need to be an exponential of some kind (assuming this is JavaScript-based).
In JavaScript, Number.MAX_VALUE (the biggest number that can be represented without returning Infinity) is a 309-digit number, and Git SHA-1 hashes are only 40 characters long.
However, this does mean that the chance will be even smaller than you say. Assuming that all other conditions are met (sole "e", every other character is a digit)
* if the sole "e" is in any place from the 2nd to the 37th, the resulting number is almost guaranteed to be returned as Infinity, unless the number happens to begin with "0e". (Leading zeroes, either in the significand or the exponent, would be required for this to not be a guarantee. Any number beginning "0e", on the other hand, will evaluate to 0 and not become Infinity.)
* if it's in the 37th place, the resulting number (assuming no leading zeroes) has around a 27% chance to be representable without being returned as Infinity.
* if it's in the 38th or 39th place, the resulting number is guaranteed to be small enough that it won't be turned into Infinity.
* if it's in the 1st or 40th place, it probably won't be parsed as a number so will not return an error.
I'm also not interested in doing the full calculations, and in the end this isn't going to make a significant difference to the chances IMO (except in the "0e" case mentioned above, which seems like it'd make a noticeable difference), but it's interesting to think about the edge cases.
Doesn't matter if it hits Infinity or not, if it can be parsed as a number that is a already a bug
For a hash of N hex digit. The probably to have a single e and the rest decimal is N*10^(N-1)/16^N
With 10 hex digit like on the blog, that'd be 0.9% so about once every 100 builds
With full 40 digit: 2.7e-8 so about once every 100M builds
Assuming my math is correct.
Technically no, but realistically yes.
I've run into this exact YAML problem when trying to use the git hash as a label in a templated kubernetes config file. Same solution — wrap it in quotes — and exact same frustration.
Here's to you, and to whoever else runs into this problem next.
Serendipitously, in this other thread sebstefan just mentioned that in PHP, also,
because PHP, also, treats "0e1234" and "0e4567" as numerically equal when you use the weaker-typed '==' instead of the stronger-typed '==='.https://news.ycombinator.com/item?id=41510252#41510814
https://stackoverflow.com/questions/22140204/why-md524061070...
Nice one! I have hit something similar in the past around text input parsing. I bet that `e` notation for entering numbers has caused more bugs than it has actually been used as intended, at least in non-scientific code.
That's why I always put several quotes in YAML values, just to be safe:
I do quote strings to try and avoid this problem, though usually I'll leave single-word strictly alphabetical values bare--the only gotcha I'm aware of here is true/false/yes/no which I never use as string values anyways. But what's the point of two sets of quotes?
does several actually help or are you just being funny?
Actually, yes. We use yamerl parser for parsing YAML in Erlang and Erlang has both strings and atoms; yamerl generally turns unquoted YAML strings into atoms and double-quoted YAML strings into strings.
But nobody else cares about how yamerl parses YAML for Erlang and lots of tool try to re-normalize YAML as they process it; so if we put e.g.
into a Helm chart, and then run it through helm, the resulting YAML will be and when fed to our Erlang application, the option's value will end up being an atom, and it will blow up when we try to e.g. concatenate it with another string. But '"this_must_be_a_string"' resists all such normalization attempts, gets to be passed around as-is, and ends up being parsed as a string with embedded starting and ending quotes but those can be easily stripped away as a global post-pass on a parsed config, before consuming it any further.Heh, maybe I missed the joke too. Thing is all of the problems people generally think of with YAML strings are solved with just one set of quotes so I don't quite get it either way. Maybe this would have some effect on strings that themselves contain quotes? But that's a special case in every single language with quoted strings, not a YAML problem.
See the sibling comment. But it is a joke, in a sense that this is all very silly. Maybe we shouldn't have been using such a fringe programming language in the first place but who knew 15 years ago that Go would exist?
Or that Go would be so popular ;)
It's always a gamble when using a programming language from with a small community. Will this one get big?
I once half-assed a tool that summarized deployed builds, importing a CSV to Google Sheets as a team UI. Well, one day the import noticed one of those commit IDs must be a number because it doesn't contain any letters, and then a formula used exponential notation to create a hyperlink that didn't work.
Edit: Now that I think about it, it might have been only digits with a leading zero that was dropped.
That's because YAML is mainly intended for human interaction. Machine interaction is bungee-strapped later.
Same goes for CLI (shell), same for SQL. All the many bugs with them is because they are not for machines. Proper formats for machines don't have those issues (MsgPack, Protobuf, hell even XML/HTML is better).
I definitely had "string parsed as number" on my Yaml bugs bingo card. Did not exactly have "floats and hex have the same alphabet and overlapping grammars" but I'll put that on my bingo card for the next Yaml bugs game, which will be... tomorrow. :(
It’s easy to point at YAML and shake one’s head: “ya did it again old chap!”. Yet, software engineering is full of foot guns like that.
I tend to joke that working in the industry made me lose all the trust I ever had in me. I don’t believe any deployed code unless it can be confirmed by two independent sources and a runtime telemetry.
Here, I expected hash collision which is rare, but in my lifetime I’ve seen my share of “it really shouldn’t happen but somehow it happened”. Couple weeks ago I’ve seen author at a random library implementing queue (with ACKs!) for UDP packet transmission.
Truly an amazing career for chaos keepers.
YAML is egregiously terrible. JSON is slightly better but still bad.
I want an interchange format that lets you declare the types on either side:
Just let me explicitly declare the types and a kajillion problems go away forever.I wonder if you could fix this specific case by keeping track of the last game version to be interpreted this way and using that when querying for `Infinity`. That wouldn't be a platonically correct solution but it should work no?
> This value is set dynamically by a TeamCity deployment job.
There you have it. Whoever is using teamcity is doomed to fail like this. Not a YAML problem, but a management problem.
1. Anything that in YAML that is not explicitly a number, add quotes to ensure it is treated as a string.
2. Use monotonically increasing variables for versions of software.
Now you wait for the short hash collision. Why?
Just quote your strings.
The solution to this nonsense is and has always been s-expressions. XML reinvented s-expressions but they made a bad job of it and made them much too verbose.
I always use s-exprs for problem-free config files but sadly everybody else takes one look and says "but all those parentheses! I'd rather have my strings interpreted as very large numbers, thank you."
NestedText + Pydantic.
Yelling At My Laptop
Meta: I don't understand "RNG" (random number generation, right?) in the title, since the article is about a quoting issue causing a value to be interpreted as a floating-point value in scientific notation, rather than a string. Is that the "random number generation" the title refers to, perhaps? A like puns, but maybe I'm just tired. :)
In video gamer parlance any non human predictable output or action gets called "RNG". Anything from true random loot drops to deterministic but chaotic systems can be called "RNG". This the blog title they describe the exact resulting git hash as having the quality of being RNG because you can't really predict if your git rev hash will have a certain characters. They got unlucky and triggered a bug from the chaotic git hash.
Not a gamer so first time I heard this. For some reason it bothers me so much. It's not random, it's perfectly reproducible and has a crystal clear explanation. They are using short hashes for version numbers which is looking for problems in the first place.
Sloppy technical writing.
Or clickbait since the title sounds more interesting than “oh YAML is dumb”.
I don't think it's fair to characterize it as "sloppy" or "clickbait" at all. The author is a game developer, and they use terminology in a way that is common and accepted within their field -- sure, it's nonstandard and "wrong" outside of the game industry...but I'd say it's a little hypocritical for software developers to complain about jargon.
As a supporting example: the game Super Metroid alternates every frame between checking collisions from left-to-right or right-to-left. The direction of collision checks can make a difference for speedrunning tricks, and there's no (practical) way to control for it, so it's referred to as "RNG". From a player's perspective, the fact that it's just a frame counter rather than an LCG or something is irrelevant -- it's a luck-based factor that's outside of the player's control.
And the same frame counter is used as a source of entropy elsewhere in the game, so there's an argument that it's not even wrong to call it RNG. Similarly, a Git hash is a SHA-1 of the repo contents commit message, commit date, etc., and a cryptographic hash and pseudo-random generator are very similar constructions...so calling it RNG is a little cute but not exactly inaccurate.
sorry, I work in game development so whenever we hit something that doesn't happen every time we call it RNG for fun
I suppose it's "RNG" if the commit has exactly one 'e' and otherwise only numbers, so that YAML interprets it as scientific notation. I assume otherwise it's always interpreted as a String, as a fallback.
I was also confused by this as a commit id is deterministic. I suspect that the OP may be confused about how git hashes are computed.
But thinking deeper, given that the hash is computed from a small amount of entropy (the commit time), plus a seed (the committed code changes), and the previous value (previous commit hash), this is actually fairly similar to the definition of a PRNG.
That is to say, it's not an RNG, but for some approximations it's indistinguishable from one.
> I suspect that the OP may be confused about how git hashes are computed.
I don't think they are. The use of "RNG" is referring to how a "random" commit sha lead to this bug. Commit SHAs are, for all intents and purposes, random.
You could test this against thousands of SHAs and never encounter one that meets the criteria to trigger the bug.
It's way less than thousands.
To get this specific bug - one 'e' in the 7th digit and 9 digits less than 'a' - is 1/16*(10/16)^9 = 0.009095. Just under 1 in 1000. But if the 'e' is in the 2nd,3rd,4th,5th,6th you'd get the same bug, so actually about 6 in 1000.
But around 1% of the hashes will be all decimal digits and will parse as numbers ((10/16)^10 = 0.009). That's common enough that I've seen that one cause errors in our code too (somebody had code reading the hash then trying to append it to a string).
This was my thinking, yeah; been playing too much Hearthstone lately so I had RNG on my mind o_o
OK, we've de-RNG'd the title above.
I agree it isn't a bug in the RNG itself, but it a bug in randomness propagation which is part of randomness generation.
For instance consider this bug:
secretKey = Hexadecimal(Crypto.Rand())[0:16]
The person likely intended to generate a secret_key with 16 bytes of entropy. Instead they generated a secretKey which is 16 bytes long, but only contains 8 bytes of entropy. I would call this a RNG bug.
And this kids, is why dynamic typing is inherently bad for purposes other then quick scripting/prototyping.
infinite is a valid ieee float value. this is a yaml issue, not a dynamic typing one