"My greatest concern was what to call it. I thought of calling it 'information,' but the word was overly used, so I decided to call it 'uncertainty.' When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, 'You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage.'"

In Spain used to be as low as 13 a few decades ago; but that law was obviously written before
the rural exodus of inner Spain into the cities (from the 60's to almost the 80's), as children since early puberty got to work/help in the farm/fields or at home and by age 14 they had far more duties and accountabilities than today. And yes, that yielded more maturity.

Thus, the law had to be fixed for more urban/civilized times up to 16. Altough depending on the age/mentality closeness (such as 15-19 as it happened with a recent case), the young adult had its charges totally dropped.

He was really brilliant, made contributions all over the place in the math/physics/tech field, and had a sort of wild and quirky personality that people love telling stories about.

A funny quote about him from a Edward “a guy with multiple equations named after him” Teller:

> Edward Teller observed "von Neumann would carry on a conversation with my 3-year-old son, and the two of them would talk as equals, and I sometimes wondered if he used the same principle when he talked to the rest of us."

Are there many von-Neumann-like multidisciplinaries nowadays? It feels like unless one is razor sharp fully into one field one is not to be treated seriously by those who made careers in it (and who have the last word on it).

IMO they do exist, but the popular attitude that it's not possible anymore is the issue, not a lack of genius. If everyone has a built in assumption that it can't happen anymore, then we will naturally prune away social pathways that enable it.

I think there are none. The world has gotten too complicated for that. It was early days in quantum physics, information theory, and computer science. I don’t think it is early days in anything that consequential anymore.

Centuries ago, the limitation of most knowledge was the difficulty in discovery; once known, it was accessible to most scholars. Take Calculus, which is taught in every high school in America. The problem is, we're getting to a point where new fields are built on such extreme requirements, that even the known knowledge is extremely hard for talented university students to learn, let alone what is required to discover and advance that field. Until we are able to augment human intelligence, the days of the polymath advancing multiple fields are mostly over. I would also argue that the standards for peer reviewed whitepapers and obtaining PhDs has significantly dropped (due to the incentive structure to spam as many papers as possible), which is only hurting the advancement of knowledge.

Sounds like the increased difficulty could be addressed with new models and right abstraction layers. E.g., there’s incredible complexity in modern computing, but you don’t need to know assembly in order build a Web app, to reason about architecture, to operate functional paradigms, etc. However, this doesn’t seem to happen in natural sciences. I wonder if adopting better models runs into the gatekeepers protecting their status, tenures, and status quo.

Neither does a Web app developer need to know how to use CNC or make a transistor. Your example is about different levels of abstraction than what I meant.

I was replying to “even the known knowledge is extremely hard for talented university students to learn”. If complexity of the known knowledge one must learn to substantially contribute is the reason becoming an accomplished multidisciplinary is impossible nowadays, then it sounds like we could use some better models and levels of abstraction.

More than that, as professionals' career paths in fields develop, the organisations they work for specialize, becoming less amenable to the generalist. ('Why should we hire this mathematician who is also an expert in legal research? Their attention is probably divided, and meanwhile we have a 100% mathematician in the candidate pool fresh from an expensive dedicated PhD program with a growing family to feed.')

I'm obviously using the archetype of Leibniz here as an example but pick your favorite polymath.

Is it fair to say that the number of publicly accomplished multidisciplinaries alive at a particular moment is not rising as it may be expected, proportionally to the total number of suitably educated people?

My favorite Von Neumann anecdote/quote is this one:

John Von Neumann once said to Felix Smith:
"Young man, in mathematics you don't understand things. You just get used to them."
This was a response to Smith's fear about the method of characteristics.

It took me a while to fully grasp what he meant, but after diving into Mathematics and Physics for a while, I now hold it as one of the capital T truths of learning.

I felt like I finally understood Shannon entropy when I realized that it's a subjective quantity -- a property of the observer, not the observed.

The entropy of a variable X is the amount of information required to drive the observer's uncertainty about the value of X to zero. As a correlate, your uncertainty and mine about the value of the same variable X could be different. This is trivially true, as we could each have received different information that about X. H(X) should be H_{observer}(X), or even better, H_{observer, time}(X).

As clear as Shannon's work is in other respects, he glosses over this.

What's often lost in the discussions about whether entropy is subjective or objective is that, if you dig a little deeper, information theory gives you powerful tools for relating the objective and the subjective.

Consider cross entropy of two distributions H[p, q] = -Σ p_i log q_i. For example maybe p is the real frequency distribution over outcomes from rolling some dice, and q is your belief distribution. You can see the p_i as representing the objective probabilities (sampled by actually rolling the dice) and the q_i as your subjective probabilities. The cross entropy is measuring something like how surprised you are on average when you observe an outcome.

The interesting thing is that H[p, p] <= H[p, q], which means that if your belief distribution is wrong, your cross entropy will be higher than it would be if you had the right beliefs, q=p. This is guaranteed by the concavity of the logarithm. This gives you a way to compare beliefs: whichever q gets the lowest H[p,q] is closer to the truth.

You can even break cross entropy into two parts, corresponding to two kinds of uncertainty: H[p, q] = H[p] + D[q||p]. The first term is the entropy of p and it is the aleatoric uncertainty, the inherent randomness in the phenomenon you are trying to model. The second term is KL divergence and it tells you how much additional uncertainty you have as the result of having wrong beliefs, which you could call epistemic uncertainty.

Thanks, that's an interesting perspective. It also highlights one of the weak points in the concept, I think, which is that this is only a tool for updating beliefs to the extent that the underlying probability space ("ontology" in this analogy) can actually "model" the phenomenon correctly!

It doesn't seem to shed much light on when or how you could update the underlying probability space itself (or when to change your ontology in the belief setting).

This kind of thinking will lead you to ideas like algorithmic probability, where distributions are defined using universal Turing machines that could model anything.

I think what you're getting at is the construction of the sample space - the space of outcomes over which we define the probability measure (e.g. {H,T} for a coin, or {1,2,3,4,5,6} for a die).

Let's consider two possibilities:

1. Our sample space is "incomplete"

2. Our sample space is too "coarse"

Let's discuss 1 first. Imagine I have a special die that has a hidden binary state which I can control, which forces the die to come up either even or odd. If your sample space is only which side faces up, and I randomize the hidden state appropriately, it appears like a normal die. If your sample space is enlarged to include the hidden state, the entropy of each roll is reduced by one bit. You will not be able to distinguish between a truly random coin and a coin with a hidden state if your sample space is incomplete. Is this the point you were making?

On 2: Now let's imagine I can only observe whether the die comes up even or odd. This is a coarse-graining of the sample space (we get strictly less information - or, we only get some "macro" information). Of course, a coarse-grained sample space is necessarily an incomplete one! We can imagine comparing the outcomes from a normal die, to one which with equal probability rolls an even or odd number, except it cycles through the microstates deterministically e.g. equal chance of {odd, even}, but given that outcome, always goes to next in sequence {(1->3->5), (2->4->6)}.

Incomplete or coarse sample spaces can indeed prevent us from inferring the underlying dynamics. Many processes can have the same apparent entropy on our sample space from radically different underlying processes.

Right, this is exactly what I'm getting at - learning a distribution over a fixed sample space can be done with Bayesian methods, or entropy-based methods like the OP suggested, but I'm wondering if there are methods that can automatically adjust the sample space as well.

For well-defined mathematical problems like dice rolling and fixed classical mechanics scenarios and such, you don't need this I guess, but for any real-world problem I imagine half the problem is figuring out a good sample space to begin with. This kind of thing must have been studied already, I just don't know what to look for!

There are some analogies to algorithms like NEAT, which automatically evolves a neural network architecture while training. But that's obviously a very different context.

We could discuss completeness of the sample space, and we can also discuss completeness of the hypothesis space.

In Solomonoff Induction, which purports to be a theory of universal inductive inference, the "complete hypothesis space" consists of all computable programs (note that all current physical theories are computable, so this hypothesis space is very general). Then induction is performed by keeping all programs consistent with the observations, weighted by 2 terms: the programs prior likelihood, and the probability that program assigns to the observations (the programs can be deterministic and assign probability 1).

The "prior likelihood" in Solomonoff Induction is the program's complexity (well, 2^(-Complexity), where the complexity is the length of the shortest representation of that program.

Altogether, the procedure looks like: maintain a belief which is a mixture of all programs consistent with the observations, weighted by their complexity and the likelihood they assign to the data. Of course, this procedure is still limited by the sample/observation space!

That's our best formal theory of induction in a nutshell.

Someone else pointed me to Solomonoff induction too, which looks really cool as an "idealised" theory of induction and it definitely solves my question in abstract. But there are obvious difficulties with that in practice, like the fact that it's probably uncomputable, right?

I mean I think even the "Complexity" coefficient should be uncomputable in general, since you could probably use a program which computes it to upper bound "Complexity", and if there was such an upper bound you could use it to solve the halting problem etc. Haven't worked out the details though!

Would be interesting if there are practical algorithms for this. Either direct approximations to SI or maybe something else entirely that approaches SI in the limit, like a recursive neural-net training scheme? I'll do some digging, thanks!

Correct anything thats wrong here. Cross entropy is the comparison of two distributions right? Is the objectivity sussed out in relation to the overlap cross section. And is the subjectivity sussed out not on average but deviations on average? Just trying to understand it in my framework which might be wholly off the mark.

Cross entropy lets you compare two probability distributions. One way you can apply it is to let the distribution p represent "reality" (from which you can draw many samples, but whose numerical value you might not know) and to let q represent "beliefs" (whose numerical value is given by a model). Then by finding q to minimize cross-entropy H[p, q] you can move q closer to reality.

I'm not sure what you mean by objectivity and subjectivity in this case.

With the example of beliefs, you can think of cross entropy as the negative expected value of the log probability you assigned to an outcome, weighted by the true probability of each outcome. If you assign larger log probabilities to more likely outcomes, the cross entropy will be lower.

This doesn't really make entropy itself observer dependent. (Shannon) entropy is a property of a distribution. It's just that when you're measuring different observers' beliefs, you're looking at different distributions (which can have different entropies the same way they can have different means, variances, etc).

Entropy is a property of a distribution, but since math does sometimes get applied, we also attach distributions to things (eg. the entropy of a random number generator, the entropy of a gas...). Then when we talk about the entropy of those things, those entropies are indeed subjective, because different subjects will attach different probability distributions to that system depending on their information about that system.

Some probability distributions are objective. The probability that my random number generator gives me a certain number is given by a certain formula. Describing it with another distribution would be wrong.

Another example, if you have an electron in a superposition of half spin-up and half spin-down, then the probability to measure up is objectively 50%.

Another example, GPT-2 is a probability distribution on sequences of integers. You can download this probability distribution. It doesn't represent anyone's beliefs. The distribution has a certain entropy. That entropy is an objective property of the distribution.

Of those, the quantum superposition is the only one that has a chance at being considered objective, and it's still only "objective" in the sense that (as far as we know) your description provided as much information as anyone can possibly have about it, so nobody can have a more-informed opinion and all subjects agree.

The others are both partial-information problems which are very sensitive to knowing certain hidden-state information. Your random number generator gives you a number that you didn't expect, and for which a formula describes your best guess based on available incomplete information, but the computer program that generated knew which one to choose and it would not have picked any other. Anyone who knew the hidden state of the RNG would also have assigned a different probability to that number being chosen.

You might have some probability distribution in your head for what will come out of GPT-2 on your machine at a certain time, based on your knowledge of the random seed. But that is not the GPT-2 probability distribution, which is objectively defined by model weights that you can download, and which does not correspond to anyone’s beliefs.

I'm of the view that strictly speaking, even a fair die doesn't have a probability distribution until you throw it. It just so happens that, unless you know almost every detail about the throw, the best you can usually do is uniform.

So I would say the same of GPT-2. It's not a random variable unless you query it. But unless you know unreasonably many details, the best you can do to predict the query is the distribution that you would call "objective."

I think this gets into unanswerable metaphysical questions about when we can say mathematical objects, propositions, etc. really exist.

But I think if we take the view that it's not a random variable until we query it, that makes it awkward to talk about how GPT-2 (and similar models) is trained. No one ever draws samples from the model during training, but the whole justification for the cross-entropy-minimizing training procedure is based on thinking about the model as a random variable.

A more plausible way to argue for objectiveness is to say that some probability distributions are objectively more rational than others given the same information. E.g. when seeing a symmetrical die it would be irrational to give 5 a higher probability than the others. Or it seems irrational to believe that the sun will explode tomorrow.

The probability distribution is subjective for both parts -- because it, once again, depends on the observer observing the events in order to build a probability distribution.

E.g. your random number generator generates 1, 5, 7, 8, 3 when you run it. It generates 4, 8, 8, 2, 5 when I run it. I.e. we have received different information about the random number generator to build our subjective probability distributions. The level of entropy of our probability distributions is high because we have so little information to be certain about the representativeness of our distribution sample.

If we continue running our random number generator for a while, we will gather more information, thus reducing entropy, and our probability distributions will both start converging towards an objective "truth." If we ran our random number generators for a theoretically infinite amount of time, we will have reduced entropy to 0 and have a perfect and objective probability distribution.

Would you say that all claims about the world are subjective, because they have to be based on someone’s observations?

For example my cat weighs 13 pounds. That seems objective, in the sense that if two people disagree, only one can be right. But the claim is based on my observations. I think your logic leads us to deny that anything is objective.

I do believe in objective reality, but probabilities are subjective. Your cat weighs 13 pounds, and now that you've told me, I know it too. If you asked me to draw a probability distribution for the weight of your cat, I'd draw a tight gaussian distribution around that, representing the accuracy of your scale. My cat weighs a different amount, but I won't tell you how much, so if we both draw a probability distribution, they'll be different. And the key thing is that neither of us has an objectively correct probability distribution, not even me. My cat's weight has an objectively correct value which even I don't know, because my scale isn't good enough.

All right now, here's the big question:
how do you know that the evidence your
sensory apparatus reveals to you is correct?
What I'm getting at is this: the only
experience that is directly available to
you is your sensory data. And this sensory
data is merely a stream of electrical
impulses which stimulate your computing
center.
In other words, all that I really know
about the outside universe is relayed to
me through my electrical connections.

Why, that would mean that...I really don't
know what the outside universe is like at
all, for certain.

Sorry, this is a major misinterpretation, or at least a completely different one. I don't know how to put it in a more productive way; I think your comment is very confused. You don't need to run a random number generator "for a while" in order to build up a probability distribution.

This might be a frequentist vs bayesian thing, and I am bayesian. So maybe other people would have a different view.

I don't think you need to have any information to have a probability distribution; your distribution already represents your degree of ignorance about an outcome. So without even sampling it once, you already should have a uniform probability distribution for a random number generator or a coin flip. If you do personally have additional information to help you predict the outcome -- you're skilled at coin-flipping, or you wrote the RNG and know an exploit -- then you can compress that distribution to a lower-entropy one.

But you don't need to sample the distribution to do this. You can have that information before the first coin toss. Sampling can be one way to get information but it won't necessarily even help. If samples are independent, then each sample really teaches you barely anything about the next. RNGs eventually do repeat so if you sample it enough you might be able to find the pattern and reduce the entropy to zero, but in that case you're not learning the statistical distribution, you're deducing the exact internal state of the RNG and predicting the exact next outcome, because the samples are not actually independent. If you do enough coin flips you might eventually find that there's a slight bias to the coin, but that really takes an extreme number of tosses and only reduces the entropy a tiny tiny bit; not at all if the coin-tossing procedure had no bias to begin with.

However the objective truth is just that the next toss will land heads. That's the only truth that experiment can objectively determine. Any other doubt that it might-have-counterfactually-landed-tails is subjective, due to a subjective lack of sufficient information to predict the outcome. We can formalize a correct procedure to convert prior information into a corresponding probability distribution, we can get a unanimous consensus by giving everybody the same information, but the probability distribution is still subjective because it is a function of that prior information.

The best introduction that I can recommend is this type-written PDF from E.T. Jaynes, called "probability theory with applications in science and engineering": https://bayes.wustl.edu/etj/science.pdf.html

It requires a lot of attention to read and follow the math, but it's worthwhile. Jaynes is a pretty passionate writer, and in his writing he's clearly battling against some enemies (who might be ghosts), but on the other hand this also makes for more entertaining reading and I find that's usually a benefit when it comes to a textbook.

"Entropy is a property of matter that measures the degree of randomization or disorder at the microscopic level", at least when considering the second law.

Right, but the very interesting thing is it turns out that what's random to me might not be random to you! And the reason that "microscopic" is included is because that's a shorthand for "information you probably don't have about a system, because your eyes aren't that good, or even if they are, your brain ignored the fine details anyway."

Entropy in physics is usually the Shannon entropy of the probability distribution over system microstates given known temperature and pressure. If the system is in equilibrium then this is objective.

That's not a problem, as the GP's post is trying to state a mathematical relation not a historical attribution. Often newer concepts shed light on older ones. As Baez's article says, Gibbs entropy is Shannon's entropy of an associated distribution(multiplied by the constant k).

It is a problem because all three come with a bagage. Almost none of the things discussed in this thread are invalid when discussing actual physical entropy even though the equations are superficially similar. And then there are lots of people being confidently wrong because they assume that it’s just one concept. It really is not.

Don't see how the connection is superficial. Even the classical macroscopic definition of entropy as ΔS=∫TdQ can be derived from the information theory perspective as Baez shows in article(using entropy maximizing distributions and Lagrange multipliers). If you have a more specific critique, it would be good to discuss.

In classical physics there is no real objective randomness. Particles have a defined position and momentum and those evolve deterministically. If you somehow learned these then the shannon entropy is zero. If entropy is zero then all kinds of things break down.

So now you are forced to consider e.g. temperature an impossibility without quantum-derived randomness, even though temperature does not really seem to be a quantum thing.

> If entropy is zero then all kinds of things break down.

Entropy is a macroscopic variable and if you allow microscopic information, strange things can happen! One can move from a high entropy macrostate to a low entropy macrostate if you choose the initial microstate carefully. But this is not a reliable process which you can reproduce experimentally, ie. it is not a thermodynamic process.

A thermodynamics process P is something which takes a macrostate A to a macrostate B, independent of which microstate a0, a1, a2.. in A you started off with it. If the process depends on microstate, then it wouldn't be something we would recognize as we are looking from the macro perspective.

Which we don’t know precisely. Entropy is about not knowing.

> If you somehow learned these then the shannon entropy is zero.

Minus infinity. Entropy in classical statistical mechanics is proportional to the logarithm of the volume in phase space. (You need an appropriate extension of Shannon’s entropy to continuous distributions.)

> So now you are forced to consider e.g. temperature an impossibility without quantum-derived randomness

> Which we don’t know precisely. Entropy is about not knowing.

No, it is not about not knowing. This is an instance of the intuition from Shannon’s entropy does not translate to statistical Physics.

It is about the number of possible microstates, which is completely different. In Physics, entropy is a property of a bit of matter, it is not related to the observer or their knowledge. We can measure the enthalpy change of a material sample and work out its entropy without knowing a thing about its structure.

> Minus infinity. Entropy in classical statistical mechanics is proportional to the logarithm of the volume in phase space.

No, 0. In this case, there is a single state with p=1 and and S = - k Σ p ln(p) = 0.

This is the same if you consider the phase space because then it is reduced to a single point (you need a bit of distribution theory to prove it rigorously but it is somewhat intuitive).

The probability p of an microstate is always between 0 and 1, therefore p ln(p) is always negative and S is always positive.

You get the same using Boltzmann’s approach, in which case Ω = 1 and S = k ln(Ω) is also 0.

> (You need an appropriate extension of Shannon’s entropy to continuous distributions.)

>>> Particles have a defined position and momentum [...] If you somehow learned these then the shannon entropy is zero.

>> Entropy in classical statistical mechanics is proportional to the logarithm of the volume in phase space [and diverges to minus infinity if you define precisely the position and momentum of the particles and the volume in phase sphere goes to zero]

> [It's zero also] if you consider the phase space because then it is reduced to a single point (you need a bit of distribution theory to prove it rigorously but it is somewhat intuitive).

> The probability p of an microstate is always between 0 and 1, therefore p ln(p) is always negative and S is always positive.

The points in the phase space are not "microstates" with probability between 0 and 1. It's a continuous distribution and if it collapses to a point (i.e. you somehow learned the exact positions and momentums) the density at that point is unbounded. The entropy is also unbounded and goes to minus infinity as the volume in phase space collapses to zero.

You can avoid the divergence by dividing the continuous phase space into discrete "microstates" but having a well-defined "microstate" corresponding to some finite volume in phase space is not the same as what was written above about "particles having a defined position and momentum" that is "somehow learned". The microstates do not have precisely defined positions and momentums. The phase space is not reduced to a single point in that case.

If the phase space is reduced to a single point I'd like to see your proof that S(ρ) = −k ∫ ρ(x) log ρ(x) dx = 0

I hadn't realized that "differential" entropy and shannon entropy are actually different and incompatible, huh.

So the case I mentioned, where you know all the positions and momentums has 0 shannon entropy and -Inf differential entropy. And a typical distribution will instead have Inf shannon entropy and finite differential entropy.

Wikipedia has some pretty interesting discussion about Differential Entropy vs Limiting density of Points, but I can't claim to understand it and whether it could bridge the gap here.

Quantum mechanics solves the issue of the continuity of the state space. However, as you probably know, in quantum mechanics all the positions and momentums cannot simultaneously have definite values.

> In Physics, entropy is a property of a bit of matter, it is not related to the observer or their knowledge. We can measure the enthalpy change of a material sample and work out its entropy without knowing a thing about its structure.

Enthalpy is also dependent on your choice of state variables, which is in turn dictated by which observables you want to make predictions about: whether two microstates are distinguishable, and thus whether the part of the same macrostate, depends on the tools you have for distinguishing them.

A calorimeter does not care about anyone’s choice of state variables. Entropy is not only something that exists in abstract theoretical constructs, it is something we can get experimentally.

If information-theoretical and statistical mechanics entropies are NOT the same (or at least, deeply connected) then what stops us from having a little guy[0] sort all the particles in a gas to extract more energy from them?

Sounds like a non-sequitur to me; what are you implying about the Maxwell's demon thought experiment vs the comparison between Shannon and stat-mech entropy?

Yeah but distributions are just the accounting tools to keep track of your entropy. If you are missing one bit of information about a system, your understanding of the system is some distribution with one bit of entropy. Like the original comment said, the entropy is the number of bits needed to fill in the unknowns and bring the uncertainty down to zero. Your coin flips may be unknown in advance to you, and thus you model it as a 50/50 distribution, but in a deterministic universe the bits were present all along.

It's an objective quantity, but you have to be very precise in stating what the quantity describes.

Unbroken egg? Low entropy. There's only one way the egg can exist in an unbroken state, and that's it. You could represent the state of the egg with a single bit.

Broken egg? High entropy. There are an arbitrarily-large number of ways that the pieces of a broken egg could land.

A list of the locations and orientations of each piece of the broken egg, sorted by latitude, longitude, and compass bearing? Low entropy again; for any given instance of a broken egg, there's only one way that list can be written.

Zip up the list you made? High entropy again; the data in the .zip file is effectively random, and cannot be compressed significantly further. Until you unzip it again...

Likewise, if you had to transmit the (uncompressed) list over a bandwidth-limited channel. The person receiving the data can make no assumptions about its contents, so it might as well be random even though it has structure. Its entropy is effectively high again.

Entropy is calculated as a function of a probability distribution over possible messages or symbols. The sender might have a distribution P over possible symbols, and the receiver might have another distribution Q over possible symbols. Then the "true" distribution over possible symbols might be another distribution yet, call it R. The mismatch between these is what leads to various inefficiencies in coding, decoding, etc [1]. But both P and Q are beliefs about R -- that is, they are properties of observers.

the subjectivity doesn't stem from the definition of the channel but from the model of the information source. what's the prior probability that you intended to say 'weave', for example? that depends on which model of your mind we are using. frequentists argue that there is an objectively correct model of your mind we should always use, and bayesians argue that it depends on our prior knowledge about your mind

(i mean, your information about what the channel does is also potentially incomplete, so the same divergence in definitions could arise there too, but the subjectivity doesn't just stem from the definition of the channel; and shannon entropy is a property that can be imputed to a source independent of any channel)

I really liked the approach my stat mech teacher used. In nearly all situations, entropy just ends up being the log of the number of ways a system can be arranged (https://en.wikipedia.org/wiki/Boltzmann%27s_entropy_formula) although I found it easiest to think in terms of pairs of dice rolls.

And this is what I prefer too, although with the clarification that its the number of ways that a system can be arranged without changing its macroscopic properties.

Its, unfortunately, not very compatible with Shannon's usage in any but the shallowest sense, which is why it stays firmly in the land of physics.

> not very compatible with Shannon's usage in any but the shallowest sense

The connection is not so shallow, there are entire books based on it.

“The concept of information, intimately connected with that of probability, gives indeed insight on questions of statistical mechanics such as the meaning of irreversibility. This concept was introduced in statistical physics by Brillouin (1956) and Jaynes (1957) soon after its discovery by Shannon in 1948 (Shannon and Weaver, 1949). An immense literature has since then been published, ranging from research articles to textbooks. The variety of topics that belong to this field of science makes it impossible to give here a bibliography, and special searches are necessary for deepening the understanding of one or another aspect. For tutorial introductions, somewhat more detailed than the present one, see R. Balian (1991-92; 2004).”

I don't dispute that the math is compatible. The problem is the interpretation thereof. When I say "shallowest", I mean the implications of each are very different.

Insofar as I'm aware, there is no information-theoretic equivalent to the 2nd or 3rd laws of thermodynamics, so the intuition a student works up from physics about how and why entropy matters just doesn't transfer. Likewise, even if an information science student is well versed in the concept of configuration entropy, that's 15 minutes of one lecture in statistical thermodynamics. There's still the rest of the course to consider.

Assuming each of the N microstates for a given macrostate are equally possible with probability p=1/N, the Shannon Entropy is -Σp.log(p) = -N.p.log(p)=-1.log(1/N)=log(N), which is the physics interpretation.

In the continuous version, you would get log(V) where V is the volume in phase space occupied by the microstates for a given macrostate.

Liouville's theorem that the volume is conserved in phase space implies that any macroscopic process can only move all the microstates from a macrostate A into a macrostate B only if the volume of B is bigger than the volume of A. This implies that the entropy of B should be bigger than the entropy of A which is the Second Law.

The second law of thermodynamics is time-asymmetric, but the fundamental physical laws are time-symmetric, so from them you can only predict that the entropy of B should be bigger than the entropy of A irrespective of whether B is in the future or the past of A. You need the additional assumption (Past Hypothesis) that the universe started in a low entropy state in order to get the second law of thermodynamics.

> If our goal is to predict the future, it suffices to choose a distribution that is uniform in the Liouville measure given to us by classical mechanics (or its quantum analogue). If we want to reconstruct the past, in contrast, we need to conditionalize over trajectories that also started in a low-entropy past state — that the “Past Hypothesis” that is required to get stat mech off the ground in a world governed by time-symmetric fundamental laws.

The second law of thermodynamics is about systems that are well described by a small set of macroscopic variables. The evolution of an initial macrostate prepared by an experimenter who can control only the macrovariables is reproducible. When a thermodynamical system is prepared in such a reproducible way the preparation is happening in the past, by definition.

The second law is about how part of the information that we had about a system - constrained to be in a macrostate - is “lost” when we “forget” the previous state and describe it using just the current macrostate. We know more precisely the past than the future - the previous state is in the past by definition.

The "can be arranged" is the tricky part. E.g. you might know from context that some states are impossible (where the probability distribution is zero), even though they combinatorially exist. That changes the entropy to you.

That is why information and entropy are different things. Entropy is what you know you do not know. That knowledge of the magnitude of the unknown is what is being quantified.

Also, the point where I think the article is wrong (or not concise enough) as it would include the unknown unknowns, which are not entropy IMO:

> I claim it’s the amount of information we don’t know about a situation

For information theory, I've always thought of entropy as follows:

"If you had a really smart compression algorithm, how many bits would it take to accurately represent this file?"

i.e., Highly repetitive inputs compress well because they don't have much entropy per bit. Modern compression algorithms are good enough on most data to be used as a reasonable approximation for the true entropy.

I've always favored this down-to-earth characterization of the entropy of a discrete probability distribution. (I'm a big fan of John Baez's writing, but I was surprised glancing through the PDF to find that he doesn't seem to mention this viewpoint.)

Think of the distribution as a histogram over some bins. Then, the entropy is a measurement of, if I throw many many balls at random into those bins, the probability that the distribution of balls over bins ends up looking like that histogram. What you usually expect to see is a uniform distribution of balls over bins, so the entropy measures the probability of other rare events (in the language of probability theory, "large deviations" from that typical behavior).

More specifically, if P = (P1, ..., Pk) is some distribution, then the probability that throwing N balls (for N very large) gives a histogram looking like P is about 2^(-N * [log(k) - H(P)]), where H(P) is the entropy. When P is the uniform distribution, then H(P) = log(k), the exponent is zero, and the estimate is 1, which says that by far the most likely histogram is the uniform one. That is the largest possible entropy, so any other histogram has probability 2^(-c*N) of appearing for some c > 0, i.e., is very unlikely and exponentially moreso the more balls we throw, but the entropy measures just how much. "Less uniform" distributions are less likely, so the entropy also measures a certain notion of uniformity. In large deviations theory this specific claim is called "Sanov's theorem" and the role the entropy plays is that of a "rate function."

The counting interpretation of entropy that some people are talking about is related, at least at a high level, because the probability in Sanov's theorem is the number of outcomes that "look like P" divided by the total number, so the numerator there is indeed counting the number of configurations (in this case of balls and bins) having a particular property (in this case looking like P).

There are lots of equivalent definitions and they have different virtues, generalizations, etc, but I find this one especially helpful for dispelling the air of mystery around entropy.

Hey did you want to say relative entropy ~ rate function ~ KL divergence. Might be more familiar to ML enthusiasts here, get them to be curious about Sanov or large deviations.

That's right, here log(k) - H(p) is really the relative entropy (or KL divergence) between p and the uniform distribution, and all the same stuff is true for a different "reference distribution" of the probabilities of balls landing in each bin.

For discrete distributions the "absolute entropy" (just sum of -p log(p) as it shows up in Shannon entropy or statistical mechanics) is in this way really a special case of relative entropy. For continuous distributions, say over real numbers, the analogous quantity (integral of -p log(p)) isn't a relative entropy since there's no "uniform distribution over all real numbers". This still plays an important role in various situations and calculations...but, at least to my mind, it's a formally similar but conceptually separate object.

Information entropy is literally the strict lower bound on how efficiently information can be communicated (expected number of transmitted bits) if the probability distribution which generates this information is known, that's it. Even in contexts such as calculating the information entropy of a bit string, or the English language, you're just taking this data and constructing some empirical probability distribution from it using the relative frequencies of zeros and ones or letters or n-grams or whatever, and then calculating the entropy of that distribution.

I can't say I'm overly fond of Baez's definition, but far be it from me to question someone of his stature.

"I have largely avoided the second law of thermodynamics, which says that entropy always increases. While fascinating, this is so problematic that a good explanation would require another book!"

For those interested I am currently reading "Entropy Demystified" by Arieh Ben-Naim which tackles this side of things from much the same direction.

I sometimes ponder where new entropy/randomness is coming from, like if we take the earliest state of universe as an infinitely dense point particle which expanded. So there must be some randomness or say variety which led it to expand in a non uniform way which led to the dominance of matter over anti-matter, or creation of galaxies, clusters etc.
If we take an isolated system in which certain static particles are present, will there be the case that a small subset of the particles will get motion and this introduce entropy? Can entropy be induced automatically, atleast on a quantum level?
If anyone can help me explain that it will be very helpful and thus can help explain origin of universe in a better way.

Symmetry breaking is the general phenomenon that underlies most of that.

The classic example is this:

Imagine you have a perfectly symmetrical sombrero[1], and there's a ball balanced on top of the middle of the hat. There's no preferred direction it should fall in, but it's _unstable_. Any perturbation will make it roll down hill and come to rest in a stable configuration on the brim of the hat. The symmetry of the original configuration is now broken, but it's stable.

He argues that the randomness you are looking for comes from quantum fluctuations, and if this randomness did not exist, the universe would probably never have "happened".

Thanks for the reference will take some time before I see the whole video.
Can you tell me what those quantum fluctuations are in short? Are they part of some physical law?

Am I only one that can't download the pdf, or is the file server down? I can see the blog page but when I try downloading the ebook it just doesn't work..

If the file server is down.. anyone could upload the ebook for download?

Hmmm that list of things that contribute to entropy I've noticed omits particles which under "normal circumstances" on earth exist in bound states, for example it doesn't mentions W bosons or gluons. But in some parts of the universe they're not bound but in different state of matter, e.g. quark gluon plasma. I wonder how or if this was taken I to account.

I like the formulation of 'the amount of information we don't know about a system that we could in theory learn'. I'm surprised there's no mention of the Copenhagen interpretation's interaction with this definition, under a lot of QM theories 'unavailable information' is different from available information.

>I have largely avoided the second law of thermodynamics ... Thus, the aspects of entropy most beloved by physics popularizers will not be found here.

But personally, this bit is the most exciting to me.

>I have tried to say as little as possible about quantum mechanics, to keep the physics prerequisites low. However, Planck’s constant shows up in the formulas for the entropy of the three classical systems mentioned above. The reason for this is fascinating: Planck’s constant provides a unit of volume in position-momentum space, which is necessary to define the entropy of these systems. Thus, we need a tiny bit of quantum mechanics to get a good approximate formula for the entropy of hydrogen, even if we are trying our best to treat this gas classically.

There's fundamental nature of entropy, but as usual it's not very enlightening for poor monkey brain, so to explain you need to enumerate all its high level behavior, but its high level behavior is accidental and can't be summarized in a concise form.

My definition: Entropy is a measure of the accumulation of non-reversible energy transfers.

Side note: All reversible energy transfers involve an increase in potential energy. All non-reversible energy transfers involve a decrease in potential energy.

That definition doesn't work well because you can have changes in entropy even if no energy is transferred, e.g. by exchanging some other conserved quantity.

The side note is wrong in letter and spirit; turning potential energy into heat is one way for something to be irreversible, but neither of those statements is true.

For example, consider an iron ball being thrown sideways. It hits a pile of sand and stops. The iron ball is not affected structurally, but its kinetic energy is transferred (almost entirely) to heat energy. If the ball is thrown slightly upwards, potential energy increases but the process is still irreversible.

Also, the changes of potential energy in corresponding parts of two Carnot cycles are directionally the same, even if one is ideal (reversible) and one is not (irreversible).

After years of thought I dare to say the 2nd TL is a tautology. Entropy is increasing means every system tends to higher probability means the most probable is the most probable.

If I would write a book with that title then I would get to the point a bit faster, probably as follows.

Entropy is just a number you can associate with a probability distribution. If the distribution is discrete, so you have a set p_i, i = 1..n, which are each positive and sum to 1, then the definition is:

S = - sum_i p_i log( p_i )

Mathematically we say that entropy is a real-valued function on the space of probability distributions. (Elementary exercises: show that S >= 0 and it is maximized on the uniform distribution.)

That is it. I think there is little need for all the mystery.

So the only thing you need to know about entropy is that it's a real-valued number you can associate with a probability distribution? And that's it? I disagree. There are several numbers that can be associated with probability distribution, and entropy is an especially useful one, but to understand why entropy is useful, or why you'd use that function instead of a different one, you'd need to know a few more things than just what you've written here.

In particular, the expectation (or variance) of a real-valued random variable can also be seen as "a real-valued number you can associate with a probability distribution".

Thus, GP's statement is basically: "entropy is like expectation, but different".

Exactly, saying that's all there is to know about entropy is like saying all you need to know about chess are the rules and all you need to know about programming is the syntax/semantics.

Knowing the plain definition or the rules is nothing but a superficial understanding of the subject. Knowing how to use the rules to actually do something meaningful, having a strategy, that's where meaningful knowledge lies.

The problem is that this doesn't get at many of the intuitive properties of entropy.

A different explanation (based on macro- and micro-states) makes it intuitively obvious why entropy is non-decreasing with time or, with a little more depth, what entropy has to do with temperature.

That doesn't strike me as a problem. Definitions are often highly abstract and counterintuitive, with much study required to understand at an intuitive level what motivates them. Rigour and intuition are often competing concerns, and I think definitions should favour the former. The definition of compactness in topology, or indeed just the definition of a topological space, are examples of this - at face value, they're bizarre. You have to muck around a fair bit to understand why they cut so brilliantly to the heart of the thing.

The above evidently only suffices as a definition, not as an entire course. My point was just that I don't think any other introduction beats this one, especially for a book with the given title.

In particular it has always been my starting point whenever I introduce (the entropy of) macro- and micro-states in my statistical physics course.

Correct! And it took me just one paragraph, not the 18 pages of meandering (and I think confusing) text that it takes the author of the pdf to introduce the same idea.

Haha you reminded me of that idea in software engineering that "it's easy to make an algorithm faster if you accept that at times it might output the wrong result; in fact you can make infinitely fast"

Thanks for defining it rigorously. I think people are getting offended on John Baez's behalf because his book obviously covers a lot more - like why does this particular number seem to be so useful in so many different contexts? How could you have motivated it a priori? Etcetera, although I suspect you know all this already.

But I think you're right that a clear focus on the maths is useful for dispelling misconceptions about entropy.

Misconceptions about entropy are misconceptions about physics. You can’t dispell them focusing on the maths and ignoring the physics entirely - especially if you just write an equation without any conceptual discussion, not even mathematical.

I didn't say to only focus on the mathematics. Obviously wherever you apply the concept (and it's applied to much more than physics) there will be other sources of confusion. But just knowing that entropy is a property of a distribution, not a state, already helps clarify your thinking.

For instance, you know that the question "what is the entropy of a broken egg?" is actually meaningless, because you haven't specified a distribution (or a set of micro/macro states in the stat mech formulation).

Ok, I don’t think we disagree. But knowing that entropy is a property of a distribution given by that equation is far from “being it” as a definition of the concept of entropy in physics.

Anyway, it seems that - like many others - I just misunderstood the “little need for all the mystery” remark.

> is far from “being it” as a definition of the concept of entropy in physics.

I simply do not understand why you say this. Entropy in physics is defined using exactly the same equation. The only thing I need to add is the choice of probability distribution (i.e. the choice of ensemble).

I really do not see a better "definition of the concept of entropy in physics".

(For quantum systems one can nitpick a bit about density matrices, but in my view that is merely a technicality on how to extend probability distributions to Hilbert spaces.)

I’d say that the concept of entropy “in physics” is about (even better: starts with) the choice of a probability distribution. Without that you have just a number associated with each probability distribution - distributions without any physical meaning so those numbers won’t have any physical meaning either.

But that’s fine, I accept that you may think that it’s just a little detail.

(Quantum mechanics has no mystery either.

ih/2pi dA/dt = AH - HA

That’s it. The only thing one needs to add is a choice of operators.)

Sarcasm aside, I really do not think you are making much sense.

Obviously one first introduces the relevant probability distributions (at least the micro-canonical ensemble). But once you have those, your comment still does not offer a better way to introduce entropy other than what I wrote. What did you have in mind?

In other words, how did you think I should change this part of my course?

Many students will want to know where the minus sign comes from. I like to write the formula instead as S = sum_i p_i log( 1 / p_i ), where (1 / p_i) is the "surprise" (i.e., expected number of trials before first success) associated with a given outcome (or symbol), and we average it over all outcomes (i.e., weight it by the probability of the outcome). We take the log of the "surprise" because entropy is an extensive quantity, so we want it to be additive.

As of this moment there are six other top-level comments which each try to define entropy, and frankly they are all wrong, circular, or incomplete. Clearly the very definition of entropy is confusing, and the definition is what my comment provides.

I never said that all the other properties of entropy are now immediately visible. Instead I think it is the only universal starting point of any reasonable discussion or course on the subject.

And lastly I am frankly getting discouraged by all the dismissive responses. So this will be my last comment for the day, and I will leave you in the careful hands of, say, the six other people who are obviously so extremely knowledgeable about this topic. /s

One could also say that it’s just a consequence of the passage of time (as in getting away from a boundary condition). The decay of radioactive atoms is also a measure of the arrow of time - of course we can say that’s the same thing.

CP violation may (or may not) be more relevant regarding the arrow of time.

My first contact with entropy was in chemistry and thermodynamics and I didn't get it. Actually I didn't get anything from engineering thermodynamics books such as Çengel and so.

This seems like a great resource for referencing the various definitions. I've tried my hand at developing an intuitive understanding: https://spacechimplives.substack.com/p/observers-and-entropy. TLDR - it's an artifact of the model we're using. In the thermodynamic definition, the energy accounted for in the terms of our model is information. The energy that's not is entropic energy. Hence why it's not "useable" energy, and the process isn't reversible.

Entropy is the distribution of potential over negative potential.

This could be said "the distribution of what ever may be over the surface area of where it may be."

This is erroneously taught in conventional information theory as "the number of configurations in a system" or the available information that has yet to be retrieved. Entropy includes the unforseen, and out of scope.

Entropy is merely the predisposition to flow from high to low pressure (potential). That is it. Information is a form of potential.

Philosophically what are entropy's guarantees?

- That there will always be a super-scope, which may interfere in ways unanticipated;

- everything decays the only mystery is when and how.

It sounds like log-probability is the manifold surface area.

Distribution of potential over negative potential. Negative potential is the "surface area", and available potential distributes itself "geometrically". All this is iterative obviously, some periodicity set by universal speed limit.

It really doesn't sound like you disagree with me.

Baez seems to use the definition you call erroneous: "It’s easy to wax poetic about entropy, but what is it? I claim it’s the amount of
information we don’t know about a situation, which in principle we could learn."

But it is possible to account for the unforseen (or out-of-vocabulary) by, for example, a Good-Turing estimate. This satisfies your demand for a fully defined state space while also being consistent with GP's definition.

You are referring to the conceptual device you believe bongs to you and your equations. Entropy creates attraction and repulsion, even causing working bias. We rely upon it for our system functions.

All definitions of entropy stem from one central, universal definition: Entropy is the amount of energy unable to be used for useful work. Or better put grammatically: entropy describes the effect that not all energy consumed can be used for work.

There's a good case to be made that the information-theoretic definition of entropy is the most fundamental one, and the version that shows up in physics is just that concept as applied to physics.

My favorite course I took as part of my physics degree was statistical mechanics. It leaned way closer to information theory than I would have expected going in, but in retrospect should have been obvious.

Unrelated: my favorite bit from any physics book is probably still the introduction of the first chapter of "States of Matter" by David Goodstein: "Ludwig Boltzmann, who spent much of his life studying statistical mechanics, died in 1906, by his own hand. Paul Ehrenfest, carrying on the work, died similarly in 1933. Now it is our turn to study statistical mechanics."

Not really. Information theory applies to anything probability applies to, including many situations that aren't "physics" per se. For instance it has a lot to do with algorithms and data as well. I think of it as being at the level of geometry and calculus.

Yeah, people seemingly misunderstand that the entropy applied to thermodynamics is simply an aggregate statistic that summarizes the complex state of the thermodynamic system as a single real number.

The fact that entropy always rises etc, has nothing to do with the statistical concept of entropy itself. It simply is an easier way to express the physics concept that individual atoms spread out their kinetic energy across a large volume.

I'm not sure that's quite the right perspective. It's not a coincidence that entropy increases over time; the increase in entropy seems to be very fundamental to the way physics goes. I prefer the interpretation "physics doesn't care what direction the arrow of time points, but we perceive it as pointing in the direction of increasing entropy". Although that's not totally satisfying either.

A well known anecdote reported by Shannon:

"My greatest concern was what to call it. I thought of calling it 'information,' but the word was overly used, so I decided to call it 'uncertainty.' When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, 'You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage.'"

See the answers to this MathOverflow SE question (https://mathoverflow.net/questions/403036/john-von-neumanns-...) for references on the discussion whether Shannon's entropy is the same as the one from thermodynamics.

Von Neumann was the king of kings

So much so, he has his own entropy!

https://en.wikipedia.org/wiki/Von_Neumann_entropy

I disagree... Von Neumann went beyond being a King of Kings, the man was a God (or a "Monster Mind" according to Feynman) :)

He's a certified Martian: https://en.wikipedia.org/wiki/The_Martians_(scientists).

I was hoping the Wikipedia might explain why this might have been.

https://emilkirkegaard.dk/en/2022/11/a-theory-of-ashkenazi-g...

Emil Kirkegaard is a self-described white nationalist eugenicist who thinks the age of consent is too high. I wouldn't trust anything he has to say.

No need for ad hominems. This suffices to place doubt on the article's premises (and therefore any conclusion):

>> This hasn’t been strictly shown mathematically, but I think it is true.

> Emil Kirkegaard is a self-described white nationalist

That's simply a lie.

> who thinks the age of consent is too high

Too high in which country? Such laws vary strongly, even by US state, and he is from Denmark. Anyway, this has nothing to do with the topic at hand.

In Spain used to be as low as 13 a few decades ago; but that law was obviously written before the rural exodus of inner Spain into the cities (from the 60's to almost the 80's), as children since early puberty got to work/help in the farm/fields or at home and by age 14 they had far more duties and accountabilities than today. And yes, that yielded more maturity.

Thus, the law had to be fixed for more urban/civilized times up to 16. Altough depending on the age/mentality closeness (such as 15-19 as it happened with a recent case), the young adult had its charges totally dropped.

[flagged]

[flagged]

[flagged]

[dead]

Its odd...as someone interested but not fully into the sciences I see his name pop up everywhere.

He was really brilliant, made contributions all over the place in the math/physics/tech field, and had a sort of wild and quirky personality that people love telling stories about.

A funny quote about him from a Edward “a guy with multiple equations named after him” Teller:

> Edward Teller observed "von Neumann would carry on a conversation with my 3-year-old son, and the two of them would talk as equals, and I sometimes wondered if he used the same principle when he talked to the rest of us."

Are there many von-Neumann-like multidisciplinaries nowadays? It feels like unless one is razor sharp fully into one field one is not to be treated seriously by those who made careers in it (and who have the last word on it).

IMO they do exist, but the popular attitude that it's not possible anymore is the issue, not a lack of genius. If everyone has a built in assumption that it can't happen anymore, then we will naturally prune away social pathways that enable it.

I think there are none. The world has gotten too complicated for that. It was early days in quantum physics, information theory, and computer science. I don’t think it is early days in anything that consequential anymore.

It’s the early days in a lot of fields, but they tend to be fiendishly difficult like molecular biology or neuroscience.

Centuries ago, the limitation of most knowledge was the difficulty in discovery; once known, it was accessible to most scholars. Take Calculus, which is taught in every high school in America. The problem is, we're getting to a point where new fields are built on such extreme requirements, that even the known knowledge is extremely hard for talented university students to learn, let alone what is required to discover and advance that field. Until we are able to augment human intelligence, the days of the polymath advancing multiple fields are mostly over. I would also argue that the standards for peer reviewed whitepapers and obtaining PhDs has significantly dropped (due to the incentive structure to spam as many papers as possible), which is only hurting the advancement of knowledge.

Sounds like the increased difficulty could be addressed with new models and right abstraction layers. E.g., there’s incredible complexity in modern computing, but you don’t need to know assembly in order build a Web app, to reason about architecture, to operate functional paradigms, etc. However, this doesn’t seem to happen in natural sciences. I wonder if adopting better models runs into the gatekeepers protecting their status, tenures, and status quo.

Of course it happens in natural sciences. The neuroscientist doesn't need to to quantum mechanical calculations to do research.

Neither does a Web app developer need to know how to use CNC or make a transistor. Your example is about different levels of abstraction than what I meant.

I was replying to “even the known knowledge is extremely hard for talented university students to learn”. If complexity of

the known knowledgeone must learn to substantially contribute is the reason becoming an accomplished multidisciplinary is impossible nowadays, then it sounds like we could use some better models and levels of abstraction.More than that, as professionals' career paths in fields develop, the organisations they work for specialize, becoming less amenable to the generalist. ('Why should we hire this mathematician who is also an expert in legal research? Their attention is probably divided, and meanwhile we have a 100% mathematician in the candidate pool fresh from an expensive dedicated PhD program with a growing family to feed.')

I'm obviously using the archetype of Leibniz here as an example but pick your favorite polymath.

Are they fiendishly difficult or do we just need a von Neumann to come along and do what he did for quantum mechanics to them?

There have been a very small number of thinkers as publicly accomplished as von Neumann

ever.One other who comes to mind is Carl F. Gauss.Is it fair to say that the number of publicly accomplished multidisciplinaries alive at a particular moment is not rising as it may be expected, proportionally to the total number of suitably educated people?

Genius Edward Teller Describes 1950s Genius John Von Neumann

https://youtu.be/Oh31I1F2vds?t=189 Describes Von Neumann's final days struggle when he couldn't think. Thinking, an activity which he loved the most.

Euler.

JVM was one of the smartest ever, but Euler was there centuries before and shows up in so many places.

If I had a Time Machine I'd love to get those two together for a stiff drink and a banter.

My favorite Von Neumann anecdote/quote is this one:

John Von Neumann once said to Felix Smith: "Young man, in mathematics you don't understand things. You just get used to them." This was a response to Smith's fear about the method of characteristics.

It took me a while to fully grasp what he meant, but after diving into Mathematics and Physics for a while, I now hold it as one of the capital T truths of learning.

Even mortals such as ourselves can apply some of Von Neumann's ideas in our everyday lives:

https://en.m.wikipedia.org/wiki/Fair_coin#Fair_results_from_...

I've seen many people arguing he's the most intelligent person that ever lived

Some say Hungarians are actually aliens.

https://slatestarcodex.com/2017/05/26/the-atomic-bomb-consid...

An Introduction here : https://www.youtube.com/watch?v=IPMjVcLiNKc

[flagged]

I felt like I finally understood Shannon entropy when I realized that it's a subjective quantity -- a property of the observer, not the observed.

The entropy of a variable X is the amount of information required to drive the observer's uncertainty about the value of X to zero. As a correlate, your uncertainty and mine about the value of the same variable X could be different. This is trivially true, as we could each have received different information that about X. H(X) should be H_{observer}(X), or even better, H_{observer, time}(X).

As clear as Shannon's work is in other respects, he glosses over this.

What's often lost in the discussions about whether entropy is subjective or objective is that, if you dig a little deeper, information theory gives you powerful tools for relating the objective and the subjective.

Consider cross entropy of two distributions H[p, q] = -Σ p_i log q_i. For example maybe p is the real frequency distribution over outcomes from rolling some dice, and q is your belief distribution. You can see the p_i as representing the objective probabilities (sampled by actually rolling the dice) and the q_i as your subjective probabilities. The cross entropy is measuring something like how surprised you are on average when you observe an outcome.

The interesting thing is that H[p, p] <= H[p, q], which means that if your belief distribution is wrong, your cross entropy will be higher than it would be if you had the right beliefs, q=p. This is guaranteed by the concavity of the logarithm. This gives you a way to compare beliefs: whichever q gets the lowest H[p,q] is closer to the truth.

You can even break cross entropy into two parts, corresponding to two kinds of uncertainty: H[p, q] = H[p] + D[q||p]. The first term is the entropy of p and it is the aleatoric uncertainty, the inherent randomness in the phenomenon you are trying to model. The second term is KL divergence and it tells you how much additional uncertainty you have as the result of having wrong beliefs, which you could call epistemic uncertainty.

Thanks, that's an interesting perspective. It also highlights one of the weak points in the concept, I think, which is that this is only a tool for updating beliefs to the extent that the underlying probability space ("ontology" in this analogy) can actually "model" the phenomenon correctly!

It doesn't seem to shed much light on when or how you could update the underlying probability space itself (or when to change your ontology in the belief setting).

This kind of thinking will lead you to ideas like algorithmic probability, where distributions are defined using universal Turing machines that could model anything.

Amazing! I had actually heard about solomonoff induction before but my brain didn't make the connection. Thanks for the shortcut =)

You can sort of do this over a suitably large (or infinite) family of models all mixed, but from an epistemological POV that’s pretty unsatisfying.

From a practical POV it’s pretty useful and common (if you allow it to describe non- and semi-parametric models too).

Couldn't you just add a control (PID/Kalman filter/etc) to coverage on a stability of some local "most" truth?

Could you elaborate? To be honest I have no idea what that means.

I think what you're getting at is the construction of the sample space - the space of outcomes over which we define the probability measure (e.g. {H,T} for a coin, or {1,2,3,4,5,6} for a die).

Let's consider two possibilities:

1. Our sample space is "incomplete"

2. Our sample space is too "coarse"

Let's discuss 1 first. Imagine I have a special die that has a hidden binary state which I can control, which forces the die to come up either even or odd. If your sample space is only which side faces up, and I randomize the hidden state appropriately, it appears like a normal die. If your sample space is enlarged to include the hidden state, the entropy of each roll is reduced by one bit. You will not be able to distinguish between a truly random coin and a coin with a hidden state if your sample space is incomplete. Is this the point you were making?

On 2: Now let's imagine I can only observe whether the die comes up even or odd. This is a coarse-graining of the sample space (we get strictly less information - or, we only get some "macro" information). Of course, a coarse-grained sample space is necessarily an incomplete one! We can imagine comparing the outcomes from a normal die, to one which with equal probability rolls an even or odd number, except it cycles through the microstates deterministically e.g. equal chance of {odd, even}, but given that outcome, always goes to next in sequence {(1->3->5), (2->4->6)}.

Incomplete or coarse sample spaces can indeed prevent us from inferring the underlying dynamics. Many processes can have the same apparent entropy on our sample space from radically different underlying processes.

Right, this is exactly what I'm getting at - learning a distribution over a fixed sample space can be done with Bayesian methods, or entropy-based methods like the OP suggested, but I'm wondering if there are methods that can automatically adjust the sample space as well.

For well-defined mathematical problems like dice rolling and fixed classical mechanics scenarios and such, you don't need this I guess, but for any real-world problem I imagine half the problem is figuring out a good sample space to begin with. This kind of thing must have been studied already, I just don't know what to look for!

There are some analogies to algorithms like NEAT, which automatically evolves a neural network architecture while training. But that's obviously a very different context.

We could discuss completeness of the sample space, and we can also discuss completeness of the

hypothesis space.In Solomonoff Induction, which purports to be a theory of universal inductive inference, the "complete hypothesis space" consists of all computable programs (note that all current physical theories are computable, so this hypothesis space is very general). Then induction is performed by keeping all programs consistent with the observations, weighted by 2 terms: the programs prior likelihood, and the probability that program assigns to the observations (the programs can be deterministic and assign probability 1).

The "prior likelihood" in Solomonoff Induction is the program's complexity (well, 2^(-Complexity), where the complexity is the length of the shortest representation of that program.

Altogether, the procedure looks like: maintain a belief which is a mixture of all programs consistent with the observations, weighted by their complexity and the likelihood they assign to the data. Of course, this procedure is still limited by the sample/observation space!

That's our best formal theory of induction in a nutshell.

Someone else pointed me to Solomonoff induction too, which looks really cool as an "idealised" theory of induction and it definitely solves my question in abstract. But there are obvious difficulties with that in practice, like the fact that it's probably uncomputable, right?

I mean I think even the "Complexity" coefficient should be uncomputable in general, since you could probably use a program which computes it to upper bound "Complexity", and if there was such an upper bound you could use it to solve the halting problem etc. Haven't worked out the details though!

Would be interesting if there are practical algorithms for this. Either direct approximations to SI or maybe something else entirely that approaches SI in the limit, like a recursive neural-net training scheme? I'll do some digging, thanks!

Correct anything thats wrong here. Cross entropy is the comparison of two distributions right? Is the objectivity sussed out in relation to the overlap cross section. And is the subjectivity sussed out not on average but deviations on average? Just trying to understand it in my framework which might be wholly off the mark.

Cross entropy lets you compare two probability distributions. One way you can apply it is to let the distribution p represent "reality" (from which you can draw many samples, but whose numerical value you might not know) and to let q represent "beliefs" (whose numerical value is given by a model). Then by finding q to minimize cross-entropy H[p, q] you can move q closer to reality.

You can apply it other ways. There are lots of interpretations and uses for these concepts. Here's a cool blog post if you want to find out more: https://blog.alexalemi.com/kl-is-all-you-need.html

I'm not sure what you mean by objectivity and subjectivity in this case.

With the example of beliefs, you can think of cross entropy as the negative expected value of the log probability you assigned to an outcome, weighted by the true probability of each outcome. If you assign larger log probabilities to more likely outcomes, the cross entropy will be lower.

This doesn't really make entropy itself observer dependent. (Shannon) entropy is a property of a distribution. It's just that when you're measuring different observers' beliefs, you're looking at different distributions (which can have different entropies the same way they can have different means, variances, etc).

Entropy is a property of a distribution, but since math does sometimes get applied, we also attach distributions to

things(eg. the entropy of a random number generator, the entropy of a gas...). Then when we talk about the entropy of those things, those entropies are indeed subjective, because different subjects will attach different probability distributions to that system depending on their information about that system.Some probability distributions are objective. The probability that my random number generator gives me a certain number is given by a certain formula. Describing it with another distribution would be wrong.

Another example, if you have an electron in a superposition of half spin-up and half spin-down, then the probability to measure up is objectively 50%.

Another example, GPT-2 is a probability distribution on sequences of integers. You can download this probability distribution. It doesn't represent anyone's beliefs. The distribution has a certain entropy. That entropy is an objective property of the distribution.

Of those, the quantum superposition is the only one that has a chance at being considered objective, and it's still only "objective" in the sense that (as far as we know) your description provided as much information as anyone can possibly have about it, so nobody can have a more-informed opinion and all subjects agree.

The others are both partial-information problems which are very sensitive to knowing certain hidden-state information. Your random number generator gives you a number that

youdidn't expect, and for which a formula describes your best guess based on available incomplete information, but the computer program that generated knew which one to choose and it would not have picked any other. Anyone who knew the hidden state of the RNG would also have assigned a different probability to that number being chosen.You might have some probability distribution in your head for what will come out of GPT-2 on your machine at a certain time, based on your knowledge of the random seed. But that is not the GPT-2 probability distribution, which is objectively defined by model weights that you can download, and which does not correspond to anyone’s beliefs.

I'm of the view that strictly speaking, even a fair die doesn't have a probability distribution until you throw it. It just so happens that, unless you know almost every detail about the throw, the best you can usually do is uniform.

So I would say the same of GPT-2. It's not a random variable unless you query it. But unless you know unreasonably many details, the best you can do to predict the query is the distribution that you would call "objective."

I think this gets into unanswerable metaphysical questions about when we can say mathematical objects, propositions, etc. really exist.

But I think if we take the view that it's not a random variable until we query it, that makes it awkward to talk about how GPT-2 (and similar models) is trained. No one ever draws samples from the model during training, but the whole justification for the cross-entropy-minimizing training procedure is based on thinking about the model as a random variable.

A more plausible way to argue for objectiveness is to say that some probability distributions are objectively more rational than others given the same information. E.g. when seeing a symmetrical die it would be irrational to give 5 a higher probability than the others. Or it seems irrational to believe that the sun will explode tomorrow.

The probability distribution is subjective for both parts -- because it, once again, depends on the observer observing the events

in order to build a probability distribution.E.g. your random number generator generates 1, 5, 7, 8, 3 when you run it. It generates 4, 8, 8, 2, 5 when I run it. I.e. we have received different information about the random number generator to build our

subjectiveprobability distributions. The level of entropy of our probability distributions is high because we have so little information to be certain about the representativeness of our distribution sample.If we continue running our random number generator for a while, we will gather more information, thus reducing entropy, and our probability distributions will both start converging

towardsan objective "truth." If we ran our random number generators for a theoretically infinite amount of time, we will have reduced entropy to 0 and have a perfect and objective probability distribution.But this is impossible.

Would you say that all claims about the world are subjective, because they have to be based on someone’s observations?

For example my cat weighs 13 pounds. That seems objective, in the sense that if two people disagree, only one can be right. But the claim is based on my observations. I think your logic leads us to deny that anything is objective.

I do believe in objective reality, but probabilities are subjective. Your cat weighs 13 pounds, and now that you've told me, I know it too. If you asked me to draw a probability distribution for the weight of your cat, I'd draw a tight gaussian distribution around that, representing the accuracy of your scale. My cat weighs a different amount, but I won't tell you how much, so if we both draw a probability distribution, they'll be different. And the key thing is that neither of us has an objectively correct probability distribution, not even me. My cat's weight has an objectively correct value which even I don't know, because my scale isn't good enough.

All right now, here's the big question: how do you know that the evidence your sensory apparatus reveals to you is correct? What I'm getting at is this: the only experience that is directly available to you is your sensory data. And this sensory data is merely a stream of electrical impulses which stimulate your computing center. In other words, all that I really know about the outside universe is relayed to me through my electrical connections.

Sorry, this is a major misinterpretation, or at least a completely different one. I don't know how to put it in a more productive way; I think your comment is very confused. You don't need to run a random number generator "for a while" in order to build up a probability distribution.

A representative sample then? Please tell me where I went wrong -- I mean this sincerely.

This might be a frequentist vs bayesian thing, and I am bayesian. So maybe other people would have a different view.

I don't think you need to have

anyinformation to have a probability distribution; your distribution already represents your degree of ignorance about an outcome. So without even sampling it once, you already should have a uniform probability distribution for a random number generator or a coin flip. If you do personally have additional information to help you predict the outcome -- you're skilled at coin-flipping, or you wrote the RNG and know an exploit -- then you can compress that distribution to a lower-entropy one.But you don't need to

samplethe distribution to do this. You can have that information before the first coin toss. Sampling can be one way to get information but it won't necessarily even help. If samples are independent, then each sample really teaches you barely anything about the next. RNGs eventually do repeat so if you sample it enough you might be able to find the pattern and reduce the entropy to zero, but in that case you're not learning the statistical distribution, you're deducing the exact internal state of the RNG and predicting the exact next outcome, because the samples are not actually independent. If you do enough coin flips you might eventually find that there's a slight bias to the coin, but that really takes an extreme number of tosses and only reduces the entropy a tiny tiny bit; not at all if the coin-tossing procedure had no bias to begin with.However the objective truth is just that the next toss will land heads. That's the only truth that experiment can objectively determine. Any other doubt that it might-have-counterfactually-landed-tails is subjective, due to a subjective lack of sufficient information to predict the outcome. We can formalize a correct procedure to convert prior information into a corresponding probability distribution, we can get a unanimous consensus by giving everybody the same information, but the probability distribution is still subjective because it is a function of that prior information.

I only slightly understand, I'm sorry; I'm not educated enough to understand much of this.

Did you take stats at MIT? I'm going to through their online material, because I very much

amvery confused.I appreciate your curiosity!

The best introduction that I can recommend is this type-written PDF from E.T. Jaynes, called "probability theory with applications in science and engineering": https://bayes.wustl.edu/etj/science.pdf.html

It requires a lot of attention to read and follow the math, but it's worthwhile. Jaynes is a pretty passionate writer, and in his writing he's clearly battling against some enemies (who might be ghosts), but on the other hand this also makes for more entertaining reading and I find that's usually a benefit when it comes to a textbook.

I read through the first "lecture" yesterday. I'll devote some time for (hopefully) the rest today.

Thank you!

"Entropy is a property of matter that measures the degree of randomization or disorder at the microscopic level", at least when considering the second law.

Right, but the very interesting thing is it turns out that what's random to me might not be random to you! And the reason that "microscopic" is included is because that's a shorthand for "information you probably don't have about a system, because your eyes aren't that good, or even if they are, your brain ignored the fine details anyway."

Right but in chemistry class the way it’s taught via Gibbs free energy etc. makes it seem as if it’s an intrinsic property.

Entropy in physics is usually the Shannon entropy of the probability distribution over system microstates given known temperature and pressure. If the system is in equilibrium then this is objective.

Entropy in Physics is usually either the Boltzmann or Gibbs entropy, both of whom were dead before Shannon was born.

That's not a problem, as the GP's post is trying to state a mathematical relation not a historical attribution. Often newer concepts shed light on older ones. As Baez's article says, Gibbs entropy is Shannon's entropy of an associated distribution(multiplied by the constant k).

It is a problem because all three come with a bagage. Almost none of the things discussed in this thread are invalid when discussing actual physical entropy even though the equations are superficially similar. And then there are lots of people being confidently wrong because they assume that it’s just one concept. It really is not.

Don't see how the connection is superficial. Even the classical macroscopic definition of entropy as ΔS=∫TdQ can be derived from the information theory perspective as Baez shows in article(using entropy maximizing distributions and Lagrange multipliers). If you have a more specific critique, it would be good to discuss.

In classical physics there is no real objective randomness. Particles have a defined position and momentum and those evolve deterministically. If you somehow learned these then the shannon entropy is zero. If entropy is zero then all kinds of things break down.

So now you are forced to consider e.g. temperature an impossibility without quantum-derived randomness, even though temperature does not really seem to be a quantum thing.

> If entropy is zero then all kinds of things break down.

Entropy is a macroscopic variable and if you allow microscopic information, strange things can happen! One can move from a high entropy macrostate to a low entropy macrostate if you choose the initial microstate carefully. But this is not a reliable process which you can reproduce experimentally, ie. it is not a thermodynamic process.

A thermodynamics process P is something which takes a macrostate A to a macrostate B, independent of which microstate a0, a1, a2.. in A you started off with it. If the process depends on microstate, then it wouldn't be something we would recognize as we are looking from the macro perspective.

> Particles have a defined position and momentum

Which we don’t know precisely. Entropy is about not knowing.

> If you somehow learned these then the shannon entropy is zero.

Minus infinity. Entropy in classical statistical mechanics is proportional to the logarithm of the volume in phase space. (You need an appropriate extension of Shannon’s entropy to continuous distributions.)

> So now you are forced to consider e.g. temperature an impossibility without quantum-derived randomness

Or you may study statistical mechanics :-)

> Which we don’t know precisely. Entropy is about not knowing.

No, it is not about not knowing. This is an instance of the intuition from Shannon’s entropy does not translate to statistical Physics.

It is about the number of possible microstates, which is completely different. In Physics, entropy is a property of a bit of matter, it is not related to the observer or their knowledge. We can measure the enthalpy change of a material sample and work out its entropy without knowing a thing about its structure.

> Minus infinity. Entropy in classical statistical mechanics is proportional to the logarithm of the volume in phase space.

No, 0. In this case, there is a single state with p=1 and and S = - k Σ p ln(p) = 0.

This is the same if you consider the phase space because then it is reduced to a single point (you need a bit of distribution theory to prove it rigorously but it is somewhat intuitive).

The probability p of an microstate is always between 0 and 1, therefore p ln(p) is always negative and S is always positive.

You get the same using Boltzmann’s approach, in which case Ω = 1 and S = k ln(Ω) is also 0.

> (You need an appropriate extension of Shannon’s entropy to continuous distributions.)

Gibbs’ entropy.

> Or you may study statistical mechanics

Indeed.

>>> Particles have a defined position and momentum [...] If you somehow learned these then the shannon entropy is zero.

>> Entropy in classical statistical mechanics is proportional to the logarithm of the volume in phase space [and diverges to minus infinity if you define precisely the position and momentum of the particles and the volume in phase sphere goes to zero]

> [It's zero also] if you consider the phase space because then it is reduced to a single point (you need a bit of distribution theory to prove it rigorously but it is somewhat intuitive).

> The probability p of an microstate is always between 0 and 1, therefore p ln(p) is always negative and S is always positive.

The points in the phase space are not "microstates" with probability between 0 and 1. It's a continuous distribution and if it collapses to a point (i.e. you somehow learned the exact positions and momentums) the density at that point is unbounded. The entropy is also unbounded and goes to minus infinity as the volume in phase space collapses to zero.

You can avoid the divergence by dividing the continuous phase space into discrete "microstates" but having a well-defined "microstate" corresponding to some finite volume in phase space is not the same as what was written above about "particles having a defined position and momentum" that is "somehow learned". The microstates do not have precisely defined positions and momentums. The phase space is not reduced to a single point in that case.

If the phase space is reduced to a single point I'd like to see your proof that S(ρ) = −k ∫ ρ(x) log ρ(x) dx = 0

I hadn't realized that "differential" entropy and shannon entropy are actually different and incompatible,

huh.So the case I mentioned, where you know all the positions and momentums has 0 shannon entropy and -Inf differential entropy. And a typical distribution will instead have Inf shannon entropy and finite differential entropy.

Wikipedia has some pretty interesting discussion about Differential Entropy vs Limiting density of Points, but I can't claim to understand it and whether it could bridge the gap here.

> So the case I mentioned, where you know all the positions and momentums has 0 shannon entropy

No, Shannon entropy is not applicable in that case.

https://en.wikipedia.org/wiki/Entropy_(statistical_thermodyn...

Quantum mechanics solves the issue of the continuity of the state space. However, as you probably know, in quantum mechanics all the positions and momentums cannot simultaneously have definite values.

> possible microstates

Conditional on the known macrostate. Because we don’t know the precise microstate - only which microstates are possible.

If your reasoning is that « experimental entropy can be measured so it’s not about that » then it’s not about macrostates and microstates either!

> In Physics, entropy is a property of a bit of matter, it is not related to the observer or their knowledge. We can measure the enthalpy change of a material sample and work out its entropy without knowing a thing about its structure.

Enthalpy is also dependent on your choice of state variables, which is in turn dictated by which observables you want to make predictions about: whether two microstates are distinguishable, and thus whether the part of the same macrostate, depends on the tools you have for distinguishing them.

A calorimeter does not care about anyone’s choice of state variables. Entropy is not only something that exists in abstract theoretical constructs, it is something we can get experimentally.

that's actually the normal view, with saying both info and stat mech entropy are the same is an outlier, most popularized by Jaynes.

If information-theoretical and statistical mechanics entropies are NOT the same (or at least, deeply connected) then what stops us from having a little guy[0] sort all the particles in a gas to extract more energy from them?

[0] https://en.wikipedia.org/wiki/Maxwell%27s_demon

Sounds like a non-sequitur to me; what are you implying about the Maxwell's demon thought experiment vs the comparison between Shannon and stat-mech entropy?

Yeah but distributions are just the accounting tools to keep track of your entropy. If you are missing one bit of information about a system, your understanding of the system is some distribution with one bit of entropy. Like the original comment said, the entropy is the number of bits needed to fill in the unknowns and bring the uncertainty down to zero. Your coin flips may be unknown in advance to you, and thus you model it as a 50/50 distribution, but in a deterministic universe the bits were present all along.

Trivial example: if you know the seed of a pseudo-random number generator, a sequence generated by it has very low entropy.

But if you don't know the seed, the entropy is very high.

Theoretically, it's still only the entropy of the sneed-space + time-space it could have been running in, right?

To shorten this for you with my own (identical) understanding: "entropy is just the name for the bits you don't have".

Entropy + Information = Total bits in a complete description.

It's an objective quantity, but you have to be very precise in stating what the quantity describes.

Unbroken egg? Low entropy. There's only one way the egg can exist in an unbroken state, and that's it. You could represent the state of the egg with a single bit.

Broken egg? High entropy. There are an arbitrarily-large number of ways that the pieces of a broken egg could land.

A list of the locations and orientations of each piece of the broken egg, sorted by latitude, longitude, and compass bearing? Low entropy again; for any given instance of a broken egg, there's only one way that list can be written.

Zip up the list you made? High entropy again; the data in the .zip file is effectively random, and cannot be compressed significantly further. Until you unzip it again...

Likewise, if you had to transmit the (uncompressed) list over a bandwidth-limited channel. The person receiving the data can make no assumptions about its contents, so it might as well be random even though it has structure. Its entropy is effectively high again.

Baez has a video (accompanying, imho), with slides

https://m.youtube.com/watch?v=5phJVSWdWg4&t=17m

He illustrates the derivation of Shannon entropy with pictures of trees

>

it's a subjective quantity -- a property of the observer, not the observedShannon's entropy is a property of the source-channel-receiver system.

Can you explain this in more detail?

Entropy is calculated as a function of a probability distribution over possible messages or symbols. The sender might have a distribution P over possible symbols, and the receiver might have another distribution Q over possible symbols. Then the "true" distribution over possible symbols might be another distribution yet, call it R. The mismatch between these is what leads to various inefficiencies in coding, decoding, etc [1]. But both P and Q are beliefs about R -- that is, they are properties of observers.

[1] https://en.wikipedia.org/wiki/Kullback–Leibler_divergence#Co...

> he glosses over this

All of information theory is relative to the channel. This bit is well communicated.

What he glosses over is the definition of "channel", since it's obvious for electromagnetic communications.

https://archive.is/9vnVq

shannon entropy is subjective for bayesians and objective for frequentists

The entropy is objective if you completely define the communication channel, and subjective if you weave the definition away.

the subjectivity doesn't stem from the definition of the channel but from the model of the information source. what's the prior probability that you

intendedto say 'weave', for example? that depends on which model of your mind we are using. frequentists argue that there is an objectively correct model of your mind we should always use, and bayesians argue that it depends on ourprior knowledgeabout your mind(i mean, your information about what the channel does is also potentially incomplete, so the same divergence in definitions could arise there too, but the subjectivity doesn't

juststem from the definition of the channel; and shannon entropy is a property that can be imputed to a source independent of any channel)I really liked the approach my stat mech teacher used. In nearly all situations, entropy just ends up being the log of the number of ways a system can be arranged (https://en.wikipedia.org/wiki/Boltzmann%27s_entropy_formula) although I found it easiest to think in terms of pairs of dice rolls.

And this is what I prefer too, although with the clarification that its the number of ways that a system can be arranged

without changing its macroscopic properties.Its, unfortunately, not very compatible with Shannon's usage in any but the shallowest sense, which is why it stays firmly in the land of physics.

> not very compatible with Shannon's usage in any but the shallowest sense

The connection is not so shallow, there are entire books based on it.

“The concept of information, intimately connected with that of probability, gives indeed insight on questions of statistical mechanics such as the meaning of irreversibility. This concept was introduced in statistical physics by Brillouin (1956) and Jaynes (1957) soon after its discovery by Shannon in 1948 (Shannon and Weaver, 1949). An immense literature has since then been published, ranging from research articles to textbooks. The variety of topics that belong to this field of science makes it impossible to give here a bibliography, and special searches are necessary for deepening the understanding of one or another aspect. For tutorial introductions, somewhat more detailed than the present one, see R. Balian (1991-92; 2004).”

https://arxiv.org/pdf/cond-mat/0501322

I don't dispute that the math is compatible. The problem is the interpretation thereof. When I say "shallowest", I mean the implications of each are very different.

Insofar as I'm aware, there is no information-theoretic equivalent to the 2nd or 3rd laws of thermodynamics, so the intuition a student works up from physics about how and why entropy matters just doesn't transfer. Likewise, even if an information science student is well versed in the concept of configuration entropy, that's 15 minutes of one lecture in statistical thermodynamics. There's still the rest of the course to consider.

Assuming each of the N microstates for a given macrostate are equally possible with probability p=1/N, the Shannon Entropy is -Σp.log(p) = -N.p.log(p)=-1.log(1/N)=log(N), which is the physics interpretation.

In the continuous version, you would get log(V) where V is the volume in phase space occupied by the microstates for a given macrostate.

Liouville's theorem that the volume is conserved in phase space implies that any macroscopic process can only move all the microstates from a macrostate A into a macrostate B only if the volume of B is bigger than the volume of A. This implies that the entropy of B should be bigger than the entropy of A which is the Second Law.

The second law of thermodynamics is time-asymmetric, but the fundamental physical laws are time-symmetric, so from them you can only predict that the entropy of B should be bigger than the entropy of A

irrespective of whether B is in the future or the past of A.You need the additional assumption (Past Hypothesis) that the universe started in a low entropy state in order to get the second law of thermodynamics.> If our goal is to predict the future, it suffices to choose a distribution that is uniform in the Liouville measure given to us by classical mechanics (or its quantum analogue). If we want to reconstruct the past, in contrast, we need to conditionalize over trajectories that also started in a low-entropy past state — that the “Past Hypothesis” that is required to get stat mech off the ground in a world governed by time-symmetric fundamental laws.

https://www.preposterousuniverse.com/blog/2013/07/09/cosmolo...

The second law of thermodynamics is about systems that are well described by a small set of macroscopic variables. The evolution of an initial macrostate prepared by an experimenter who can control only the macrovariables is reproducible. When a thermodynamical system is prepared in such a reproducible way the preparation is happening in the past, by definition.

The second law is about how part of the information that we had about a system - constrained to be in a macrostate - is “lost” when we “forget” the previous state and describe it using just the current macrostate. We know more precisely the past than the future - the previous state is in the past by definition.

The "can be arranged" is the tricky part. E.g. you might know from context that some states are impossible (where the probability distribution is zero), even though they combinatorially exist. That changes the entropy to you.

That is why information and entropy are different things. Entropy is what you know you do not know. That knowledge of the magnitude of the unknown is what is being quantified.

Also, the point where I think the article is wrong (or not concise enough) as it would include the unknown unknowns, which are not entropy IMO:

> I claim it’s the amount of information we don’t know about a situation

Exactly. If you want to reuse the term "entropy" in information theory, then fine. Just stop trying to make a physical analogy. It's not rigorous.

I spend time just staring at the graph on this page.

https://en.wikipedia.org/wiki/Thermodynamic_beta

Also known as "the number of bits to describe a system". For example, 2^N equally probable states, N bits to describe each state.

For information theory, I've always thought of entropy as follows:

"If you had a really smart compression algorithm, how many bits would it take to accurately represent this file?"

i.e., Highly repetitive inputs compress well because they don't have much entropy per bit. Modern compression algorithms are good enough on most data to be used as a reasonable approximation for the true entropy.

The essence of entropy as a measure of information content

I've always favored this down-to-earth characterization of the entropy of a discrete probability distribution. (I'm a big fan of John Baez's writing, but I was surprised glancing through the PDF to find that he doesn't seem to mention this viewpoint.)

Think of the distribution as a histogram over some bins. Then, the entropy is a measurement of, if I throw many many balls at random into those bins, the probability that the distribution of balls over bins ends up looking like that histogram. What you usually expect to see is a uniform distribution of balls over bins, so the entropy measures the probability of other rare events (in the language of probability theory, "large deviations" from that typical behavior).

More specifically, if P = (P1, ..., Pk) is some distribution, then the probability that throwing N balls (for N very large) gives a histogram looking like P is about 2^(-N * [log(k) - H(P)]), where H(P) is the entropy. When P is the uniform distribution, then H(P) = log(k), the exponent is zero, and the estimate is 1, which says that by far the most likely histogram is the uniform one. That is the largest possible entropy, so any other histogram has probability 2^(-c*N) of appearing for some c > 0, i.e., is very unlikely and exponentially moreso the more balls we throw, but the entropy measures just how much. "Less uniform" distributions are less likely, so the entropy also measures a certain notion of uniformity. In large deviations theory this specific claim is called "Sanov's theorem" and the role the entropy plays is that of a "rate function."

The counting interpretation of entropy that some people are talking about is related, at least at a high level, because the probability in Sanov's theorem is the number of outcomes that "look like P" divided by the total number, so the numerator there is indeed counting the number of configurations (in this case of balls and bins) having a particular property (in this case looking like P).

There are lots of equivalent definitions and they have different virtues, generalizations, etc, but I find this one especially helpful for dispelling the air of mystery around entropy.

Hey did you want to say

relative entropy~ rate function ~ KL divergence. Might be more familiar to ML enthusiasts here, get them to be curious about Sanov or large deviations.That's right, here log(k) - H(p) is really the relative entropy (or KL divergence) between p and the uniform distribution, and all the same stuff is true for a different "reference distribution" of the probabilities of balls landing in each bin.

For discrete distributions the "absolute entropy" (just sum of -p log(p) as it shows up in Shannon entropy or statistical mechanics) is in this way really a special case of relative entropy. For continuous distributions, say over real numbers, the analogous quantity (integral of -p log(p)) isn't a relative entropy since there's no "uniform distribution over all real numbers". This still plays an important role in various situations and calculations...but, at least to my mind, it's a formally similar but conceptually separate object.

PBS Spacetime‘s entropy playlist: https://youtube.com/playlist?list=PLsPUh22kYmNCzNFNDwxIug8q1...

A bit off-color but classic: https://www.youtube.com/watch?v=wgltMtf1JhY

Ah JCB, how I love your writing, you are always so very generous.

Your This Week's Finds were a hugely enjoyable part of my undergraduate education and beyond.

Thank you again.

Information entropy is literally the strict lower bound on how efficiently information can be communicated (expected number of transmitted bits) if the probability distribution which generates this information is known, that's it. Even in contexts such as calculating the information entropy of a bit string, or the English language, you're just taking this data and constructing some empirical probability distribution from it using the relative frequencies of zeros and ones or letters or n-grams or whatever, and then calculating the entropy of that distribution.

I can't say I'm overly fond of Baez's definition, but far be it from me to question someone of his stature.

"I have largely avoided the second law of thermodynamics, which says that entropy always increases. While fascinating, this is so problematic that a good explanation would require another book!"

For those interested I am currently reading "Entropy Demystified" by Arieh Ben-Naim which tackles this side of things from much the same direction.

I sometimes ponder where new entropy/randomness is coming from, like if we take the earliest state of universe as an infinitely dense point particle which expanded. So there must be some randomness or say variety which led it to expand in a non uniform way which led to the dominance of matter over anti-matter, or creation of galaxies, clusters etc. If we take an isolated system in which certain static particles are present, will there be the case that a small subset of the particles will get motion and this introduce entropy? Can entropy be induced automatically, atleast on a quantum level? If anyone can help me explain that it will be very helpful and thus can help explain origin of universe in a better way.

Symmetry breaking is the general phenomenon that underlies most of that.

The classic example is this:

Imagine you have a perfectly symmetrical sombrero[1], and there's a ball balanced on top of the middle of the hat. There's no preferred direction it should fall in, but it's _unstable_. Any perturbation will make it roll down hill and come to rest in a stable configuration on the brim of the hat. The symmetry of the original configuration is now broken, but it's stable.

1: https://m.media-amazon.com/images/I/61M0LFKjI9L.__AC_SX300_S...

Yes but what will intimate that perturbation?

I saw this video, which explained it for me (it's german, maybe the automatic subtitles will work for you): https://www.youtube.com/watch?v=hrJViSH6Klo

He argues that the randomness you are looking for comes from quantum fluctuations, and if this randomness did not exist, the universe would probably never have "happened".

Thanks for the reference will take some time before I see the whole video. Can you tell me what those quantum fluctuations are in short? Are they part of some physical law?

My goto source for understanding entropy: http://philsci-archive.pitt.edu/8592/1/EntropyPaperFinal.pdf

Am I only one that can't download the pdf, or is the file server down? I can see the blog page but when I try downloading the ebook it just doesn't work..

If the file server is down.. anyone could upload the ebook for download?

Hmmm that list of things that contribute to entropy I've noticed omits particles which under "normal circumstances" on earth exist in bound states, for example it doesn't mentions W bosons or gluons. But in some parts of the universe they're not bound but in different state of matter, e.g. quark gluon plasma. I wonder how or if this was taken I to account.

I like the formulation of 'the amount of information we don't know about a system that we could in theory learn'. I'm surprised there's no mention of the Copenhagen interpretation's interaction with this definition, under a lot of QM theories 'unavailable information' is different from available information.

The book might disappoint some..

>I have largely avoided the second law of thermodynamics ... Thus, the aspects of entropy most beloved by physics popularizers will not be found here.

But personally, this bit is the most exciting to me.

>I have tried to say as little as possible about quantum mechanics, to keep the physics prerequisites low. However, Planck’s constant shows up in the formulas for the entropy of the three classical systems mentioned above. The reason for this is fascinating: Planck’s constant provides a unit of volume in position-momentum space, which is necessary to define the entropy of these systems. Thus, we need a tiny bit of quantum mechanics to get a good approximate formula for the entropy of hydrogen, even if we are trying our best to treat this gas classically.

There's fundamental nature of entropy, but as usual it's not very enlightening for poor monkey brain, so to explain you need to enumerate all its high level behavior, but its high level behavior is accidental and can't be summarized in a concise form.

This complexity underscores the richness of the concept

I'd say it underscores its accidental nature.

My definition: Entropy is a measure of the accumulation of non-reversible energy transfers.

Side note: All reversible energy transfers involve an increase in potential energy. All non-reversible energy transfers involve a decrease in potential energy.

That definition doesn't work well because you can have changes in entropy even if no energy is transferred, e.g. by exchanging some other conserved quantity.

The side note is wrong in letter and spirit; turning potential energy into heat is one way for something to be irreversible, but neither of those statements is true.

For example, consider an iron ball being thrown sideways. It hits a pile of sand and stops. The iron ball is not affected structurally, but its kinetic energy is transferred (almost entirely) to heat energy. If the ball is thrown slightly upwards, potential energy increases but the process is still irreversible.

Also, the changes of potential energy in corresponding parts of two Carnot cycles are directionally the same, even if one is ideal (reversible) and one is not (irreversible).

However, while your definition effectively captures a significant aspect of entropy, it might be somewhat limited in scope

Closely related recent discussion on The Second Law of Thermodynamics (2011) (franklambert.net):

https://news.ycombinator.com/item?id=40972589

After years of thought I dare to say the 2nd TL is a tautology. Entropy is increasing means every system tends to higher probability means the most probable is the most probable.

I think that’s right, though it’s non-obvious that more probable systems are disordered. At least as non-obvious as Pascal’s triangle is.

Which is to say, worth saying from a first principles POV, but not all that startling.

Closely related recent discussion: https://news.ycombinator.com/item?id=40972589

If I would write a book with that title then I would get to the point a bit faster, probably as follows.

Entropy is

justa number you can associate with a probability distribution. If the distribution is discrete, so you have a set p_i, i = 1..n, which are each positive and sum to 1, then the definition is:S = - sum_i p_i log( p_i )

Mathematically we say that entropy is a real-valued function on the space of probability distributions. (Elementary exercises: show that S >= 0 and it is maximized on the uniform distribution.)

That is it. I think there is little need for all the mystery.

So the only thing you need to know about entropy is that it's

a real-valued number you can associate with a probability distribution? And that's it? I disagree. There are several numbers that can be associated with probability distribution, and entropy is an especially useful one, but to understand why entropy is useful, or why you'd use that function instead of a different one, you'd need to know a few more things than just what you've written here.In particular, the expectation (or variance) of a real-valued random variable can also be seen as "a real-valued number you can associate with a probability distribution".

Thus, GP's statement is basically: "entropy is like expectation, but different".

Exactly, saying that's all there is to know about entropy is like saying all you need to know about chess are the rules and all you need to know about programming is the syntax/semantics.

Knowing the plain definition or the rules is nothing but a superficial understanding of the subject. Knowing how to use the rules to actually do something meaningful, having a strategy, that's where meaningful knowledge lies.

Of course that is not my statement. See all my other replies to identical misinterpretations of my comment.

The problem is that this doesn't get at many of the intuitive properties of entropy.

A different explanation (based on macro- and micro-states) makes it intuitively obvious why entropy is non-decreasing with time or, with a little more depth, what entropy has to do with temperature.

That doesn't strike me as a problem. Definitions are often highly abstract and counterintuitive, with much study required to understand at an intuitive level what motivates them. Rigour and intuition are often competing concerns, and I think definitions should favour the former. The definition of compactness in topology, or indeed just the definition of a topological space, are examples of this - at face value, they're bizarre. You have to muck around a fair bit to understand why they cut so brilliantly to the heart of the thing.

The above evidently only suffices as a definition, not as an entire course. My point was just that I don't think any other introduction beats this one, especially for a book with the given title.

In particular it has always been my starting point whenever I introduce (the entropy of) macro- and micro-states in my statistical physics course.

That definition is on page 18, I agree it could've been reached a bit faster but a lot of the preceding material is motivation, puzzles, and examples.

This definition isn't the end goal, the physics things are.

That covers one and a half of the twelve points he discusses.

Correct! And it took me just one paragraph, not the 18 pages of meandering (and I think confusing) text that it takes the author of the pdf to introduce the same idea.

You didn’t introduce any idea. You said it’s “just a number” and wrote down a formula without any explanation or justification.

I concede that it was much shorter though. Well done!

Haha you reminded me of that idea in software engineering that "it's easy to make an algorithm faster if you accept that at times it might output the wrong result; in fact you can make infinitely fast"

Thanks for defining it rigorously. I think people are getting offended on John Baez's behalf because his book obviously covers a lot more - like

whydoes this particular number seem to be so useful in so many different contexts? How could you have motivated it a priori? Etcetera, although I suspect you know all this already.But I think you're right that a clear focus on the maths is useful for dispelling misconceptions about entropy.

Misconceptions about entropy are misconceptions about physics. You can’t dispell them focusing on the maths and ignoring the physics entirely - especially if you just write an equation without any conceptual discussion, not even mathematical.

I didn't say to

onlyfocus on the mathematics. Obviously wherever you apply the concept (and it's applied to much more than physics) there will be other sources of confusion. But just knowing that entropy is a property of a distribution, not a state, already helps clarify your thinking.For instance, you know that the question "what is the entropy of a broken egg?" is actually meaningless, because you haven't specified a distribution (or a set of micro/macro states in the stat mech formulation).

Ok, I don’t think we disagree. But knowing that entropy is a property of a distribution given by that equation is far from “being it” as a definition of the concept of entropy in physics.

Anyway, it seems that - like many others - I just misunderstood the “little need for all the mystery” remark.

> is far from “being it” as a definition of the concept of entropy in physics.

I simply do not understand why you say this. Entropy in physics is defined using

exactlythe same equation. The only thing I need to add is the choice of probability distribution (i.e. the choice of ensemble).I really do not see a better "definition of the concept of entropy in physics".

(For quantum systems one can nitpick a bit about density matrices, but in my view that is merely a technicality on how to extend probability distributions to Hilbert spaces.)

I’d say that the concept of entropy “in physics” is about (even better: starts with) the choice of a probability distribution. Without that you have just a number associated with each probability distribution - distributions without any physical meaning so those numbers won’t have any physical meaning either.

But that’s fine, I accept that you may think that it’s just a little detail.

(Quantum mechanics has no mystery either.

ih/2pi dA/dt = AH - HA

That’s it. The only thing one needs to add is a choice of operators.)

Sarcasm aside, I really do not think you are making much sense.

Obviously one first introduces the relevant probability distributions (at least the micro-canonical ensemble). But once you have those, your comment still does not offer a better way to introduce entropy other than what I wrote. What did you have in mind?

In other words, how did you think I should change this part of my course?

Right, I see what you're saying. I agree that there is a lot of subtlety in the way entropy is actually used in practice.

Many students will want to know where the minus sign comes from. I like to write the formula instead as S = sum_i p_i log( 1 / p_i ), where (1 / p_i) is the "surprise" (i.e., expected number of trials before first success) associated with a given outcome (or symbol), and we average it over all outcomes (i.e., weight it by the probability of the outcome). We take the log of the "surprise" because entropy is an extensive quantity, so we want it to be additive.

Everyone who sees that formula can immediately see that it leads to principle of maximum entropy.

Just like everyone seeing Maxwell's equations can immediately see that you can derive the the speed of light classically.

Oh dear. The joy of explaining the little you know.

As of this moment there are six other top-level comments which each try to define entropy, and frankly they are all wrong, circular, or incomplete. Clearly the very

definitionof entropy is confusing, and thedefinitionis what my comment provides.I never said that all the other properties of entropy are now immediately visible. Instead I think it is the only universal starting point of any reasonable discussion or course on the subject.

And lastly I am frankly getting discouraged by all the dismissive responses. So this will be my last comment for the day, and I will leave you in the careful hands of, say, the six other people who are obviously so extremely knowledgeable about this topic. /s

The definition by itself without intuition of application is of little use

Don’t forget it’s the only measure of the arrow of time.

One could also say that it’s just a consequence of the passage of time (as in getting away from a boundary condition). The decay of radioactive atoms is also a measure of the arrow of time - of course we can say that’s the same thing.

CP violation may (or may not) be more relevant regarding the arrow of time.

[flagged]

Please don't post comments just to be a dick.

[flagged]

You are completely right of course. I am merely a professor in theoretical physics who has been teaching this stuff for a number of years now.

[flagged]

The way I understand it is with an analogy to probability. To me, events are to microscopic states like random variable is to entropy.

My first contact with entropy was in chemistry and thermodynamics and I didn't get it. Actually I didn't get anything from engineering thermodynamics books such as Çengel and so.

You must go to statistical mechanics or information theory to understand entropy. Or trying these PRICELESS NOTES from Prof. Suo: https://docs.google.com/document/d/1UMwpoDRZLlawWlL2Dz6YEomy...

This seems like a great resource for referencing the various definitions. I've tried my hand at developing an intuitive understanding: https://spacechimplives.substack.com/p/observers-and-entropy. TLDR - it's an artifact of the model we're using. In the thermodynamic definition, the energy accounted for in the terms of our model is information. The energy that's not is entropic energy. Hence why it's not "useable" energy, and the process isn't reversible.

Hawking on the subject

https://youtu.be/wgltMtf1JhY

How do you get to the actual book / tweets? The link just takes me back to the forward...

http://math.ucr.edu/home/baez/what_is_entropy.pdf

MC Hawking already explained this

https://youtu.be/wgltMtf1JhY

ΔS = ΔQ/T

[flagged]

Entropy is the distribution of potential over negative potential.

This could be said "the distribution of what ever may be over the surface area of where it may be."

This is erroneously taught in conventional information theory as "the number of configurations in a system" or the available information that has yet to be retrieved. Entropy includes the unforseen, and out of scope.

Entropy is merely the predisposition to flow from high to low pressure (potential). That is it. Information is a form of potential.

Philosophically what are entropy's guarantees?

- That there will always be a super-scope, which may interfere in ways unanticipated;

- everything decays the only mystery is when and how.

This answer is as confident as it's wrong and full of gibberish.

Entropy is not a "distribution”, it's a functional that maps a probability distribution to a scalar value, i.e. a single number.

It's the mean log-probability of a distribution.

It's an elementary statistical concept, independent of physical concepts like “pressure”, “potential”, and so on.

It sounds like log-probability is the manifold surface area.

Distribution of potential over negative potential. Negative potential is the "surface area", and available potential distributes itself "geometrically". All this is iterative obviously, some periodicity set by universal speed limit.

It really doesn't sound like you disagree with me.

Baez seems to use the definition you call erroneous: "It’s easy to wax poetic about entropy, but what is it? I claim it’s the amount of information we don’t know about a situation, which in principle we could learn."

> Entropy includes the unforseen, and out of scope.

Mmh, no it doesn't. You need to define your state space, otherwise it's an undefined quantity.

But it is possible to account for the unforseen (or out-of-vocabulary) by, for example, a Good-Turing estimate. This satisfies your demand for a fully defined state space while also being consistent with GP's definition.

You are referring to the conceptual device you believe bongs to you and your equations. Entropy creates attraction and repulsion, even causing working bias. We rely upon it for our system functions.

Undefined is uncertainty is entropic.

Entropy is a measure, it doesn't create anything. This is highly misleading.

> bongs

indeed

All definitions of entropy stem from one central, universal definition: Entropy is the amount of energy unable to be used for useful work. Or better put grammatically: entropy describes the effect that not all energy consumed can be used for work.

There's a good case to be made that the information-theoretic definition of entropy is the most fundamental one, and the version that shows up in physics is just that concept as applied to physics.

My favorite course I took as part of my physics degree was statistical mechanics. It leaned way closer to information theory than I would have expected going in, but in retrospect should have been obvious.

Unrelated: my favorite bit from any physics book is probably still the introduction of the first chapter of "States of Matter" by David Goodstein: "Ludwig Boltzmann, who spent much of his life studying statistical mechanics, died in 1906, by his own hand. Paul Ehrenfest, carrying on the work, died similarly in 1933. Now it is our turn to study statistical mechanics."

That would mean that information-theory is not part of physics, right? So, Information Theory and Entropy, are part of metaphysics?

Well it's part of math, which physics is already based on.

Whereas metaphysics is, imo, "stuff that's made up and doesn't matter". Probably not the most standard take.

I'm wondering, isn't Information Theory as much part of physics as Thermodynamics is?

Would you say that Geometry is as much a part of physics as Optics is?

Not really. Information theory applies to anything probability applies to, including many situations that aren't "physics" per se. For instance it has a lot to do with algorithms and data as well. I think of it as being at the level of geometry and calculus.

Yeah, people seemingly misunderstand that the entropy applied to thermodynamics is simply an aggregate statistic that summarizes the complex state of the thermodynamic system as a single real number.

The fact that entropy always rises etc, has nothing to do with the statistical concept of entropy itself. It simply is an easier way to express the physics concept that individual atoms spread out their kinetic energy across a large volume.

I'm not sure that's quite the right perspective. It's not a

coincidencethat entropy increases over time; the increase in entropy seems to be very fundamental to the way physics goes. I prefer the interpretation "physics doesn't care what direction the arrow of time points, but we perceive it as pointing in the direction of increasing entropy". Although that's not totally satisfying either.This definition is far from universal.

I think what you describe is the application of entropy in the thermodynamic setting, which doesn't apply to "all definitions".