For those interested in looking slightly more into the characteristic function, it may be worth pointing out that the characteristic function is equal to the Fourier-transform (with the sign of the argument being reversed) of the probability distribution in question.
In my own experience teaching teaching probability theory to physicists and engineers, establishing this connection is often a good way of helping people build intuition for why characteristic functions are so useful, why they crop up everywhere in probability theory, and why we can extract so much useful information about a distribution by looking at the characteristic function (since this group of students tends to already be rather familiar with Fourier-transforms).
Yes, this provides good intuition about why it is useful: the PDF of the sum of two random variables is the convolution of the original PDFs. A convolution is awkward to work with, but by the convolution theorem it is a multiplication in the Fourier domain. This immediately suggests that the Fourier transform of a PDF would be a useful thing to work with.
If you don't say that this is what you are doing then it all seems quite mysterious.
This is a good place to use cumulants. Instead of working with joint characteristic functions, which gets messy, it lets you isolate the effects of correlation into a separate term. The only limitation is that this doesn't work if the moment doesn't exist.
As a physicist, the moment when everything just clicked was when I realised that connected Feynman diagrams were basically the cumulants of that distribution.
Then almost everything in physics is about "what is the characteristic/moment/cumulant generating function?" and associated Legendre transforms
A little known bit of history is Feynman developed a diagrammatic method for expressing the moments of PGFs in his study of the stochastic theory of fission chains. This was before his work on QED. See:
Wow. I am not a physicist, but I use pdfs and moments and cumulants all the time. I came up with my own method to calculate cumulants for affine processes using some recursions, and they work. But if I hear you right, I might have stumbled upon something that Feynman did 70 years ago, and he probably did it better. Any good links you can recommend?
> As a physicist, the moment when everything just clicked was when I realised that connected Feynman diagrams were basically the cumulants of that distribution.
And the generating function of the cumulants is the logarithm of the generating function of the distribution (Fourier transform).
I feel like it's almost criminal of textbook writers not to mention this when introducing the characteristic function... At least as an aside or a footnote, for readers already familiar with Fourier transforms.
but isn't a characteristic function just "the" way to bridge the gap between sets, functions, and logic(? ...a 3way bridge!?)
I mean, it was useful for me to think about like a translation between sets and logic (this variable x is in the set xor not) into functions (a function f(x) that returns 1 or true whenever x is in set S)
You're thinking of a "characteristic function" in the sense of "indicator function" of a subset (https://en.wikipedia.org/wiki/Indicator_function), which is different thing to the characteristic function of a probability density function.
“Characterstic function” is (was) an overloaded term.
What you described is more often referred to as an “indicator function” these days, with “characteristic functions” denoting the transform (Fourier, laplace, z - depending on context). Closely related to “moment generating functions” to the point of being almost interchangeable.
so the same thing but, characterisic function as I knew them before these posts is a rudimentary 2-variable finite version. point and line (but the line is a curve, a circle because e).
but the new and improved 21st century characteristic functions are n-variable and have a full continious spectrum of variables between zero (false) and one (true) but only potentially lest infinite realizes itself (which would make the theories illogical).
- The characteristic function of a random variable X is defined as the function that maps t --> ExpectedValue[ exp( i * t * X ) ]
- Computing this expected value is the same as regarding t as a constant and integrating the function x --> exp( i * t * x) with respect to the distribution of X, i.e. if X has the density f, we compute the integral of f(x) * exp( i * t * x) with respect to x over the domain of f.
- on the other hand: computing the Fourier transform of f (here representing the density of X) and evaluating it at point t (i.e. computing (F(f))(t) if F represents the Fourier transform) is the same as fixing t and computing the integral of f(x) * exp( -i * t * x) with respect to x.
- Rearranging the integrand in the previous expression to f(x) * exp( i * -t * x), we see that it is the same as the integrand used in the characteristic function, only with a -t instead of a t.
"I have long struggled with understanding what probability-generating functions are and how to intuit them. There were two pieces of the puzzle missing for me, and we’ll go through both in this article."
Great article. For more, I really recommend Analytic Combinatorics:
Second this. This class is a classical example of conceptual blockbuster. Once one learns it, the complexity analysis of algorithms will never be the same again. In general, if a techie wants to spend their spare time learning new stuff, they will be better off focusing more on such conceptual stuff, as the return will compound over the years.
Yes, but the polynomial form generalizes to coefficients of an arbitrary field, not just naturals. If your vector were, say, [1.3, 2.197656, pi, -1/2, 3*2i] then there wouldn’t be a reasonable base you could pick for a place-value representation.
Probably worth noting that as we know know, polynomials (over a field) are a vector space, not just convertible to one. The set of formal variables { x^0, x^1, x^2, … } is an orthogonal basis.
Okay, I think I get it… I thought one would/could define orthogonality of polynomials in terms of an inner product equivalent to the familiar K^n dot product, ie. sum of pairwise products of equal-degree coefficients, but I guess polynomials just don’t work that way. I can’t say I understand the hows and whys of the "correct" inner product given by Wikipedia :/
> I thought one would/could define orthogonality of polynomials in terms of an inner product equivalent to the familiar K^n dot product, ie. sum of pairwise products of equal-degree coefficients
Well, you can. But then you essentially get the real coordinate space, and them being polynomials and not just real-valued vectors is just an extra story that you slapped on top, but that has no relevance.
Polynomial vector spaces become useful, when we treat polynomials as functions, and the vector space of polynomials as a subspace of some other space of functions. And with function spaces the inner product is defined as an integral of the product of two functions (possibly with a weight function thrown in).
Or maybe the coefficient-based inner product spaces of polynomials have some uses, too. But they are less common than the function-based (integral) inner product.
> I’m not yet good enough to intuitively get why the curvature of the probability-generating would be related to variance, but I’d be happy to receive pointers here.
Here’s my intuition for this.
The characteristic function is the Fourier transform of the density.
If the density is in “t” units, the ch.f. is in f= 1/t units. It is the “inverse domain.” (I’m using “f” to suggest frequency, ie the Fourier coordinate.)
Of course it is not a simple coordinate transformation! But some intuition does carry over.
This is reflected in all sorts of ways. It’s one reason why the IFT formula is so functionally close to the FT formula.
Anyway.
Because of this, the behavior of the FT (ch.f.) very close to the origin (“f=0”) tells about the tails of the distribution (t = 1/f is large).
In particular, high curvature around the origin tells you the tails are heavy. That’s the variance.
This extends to the fourth moment. You can get even sharper curvature around the origin of the FT (ch.f. at f=0) with a large coefficient on the fourth order term. This corresponds to a large fourth moment of the pdf, or a high kurtosis.
It’s useful to recall that, because of analytic continuation, knowing all the derivatives at the one point f=0 determines the ch.f. everywhere, and thereby determines the complete density. This corresponds to the fact that knowing all the moments determines the full density.
So in a very real sense, you only need the ch.f. in a tight neighborhood of the origin!
Probably worth mentioning the moment-generating function as well, since it's a bit more elementary than the characteristic function and shares many of the properties. It's also more simply related to the probability generating function, you can go from one to the other with a basic change of coordinates (t -> log(x)). I also estimate calculating the moment generating function to be easier in most cases.
In fact most properties of the PGF come from the monent-generating/characteristic function. Including why the second derivative is related to the variance. The second derivative of the moment generating function is the second moment E[X^2]. The second derivative of the logarithm of the MGF is the variance by definition.
The one property that's somewhat unique to the PGF is how composition relates to drawing a randomly-sized sample, which I can see could be useful.
One can also think of probability generating functions as (flipped) Z transforms, moment generating functions as (flipped Laplace transforms), and characteristic functions as Fourier transforms of the respective PMF/PDF. Lot of their properties then follow from simple properties of Signals and Systems.
Don't have a reference on the top of my head, but the main idea is as follows:
The definition of MGF of a random variable with PDF f(x) is
E[e^{sX}] = int_{-inf}^{inf} f(x) e^{sx} dx
The definition of Laplace Transform of a signal f(t) is
F(s) = int _{-inf}^{inf} f(t) e^{-st} dt
Hence MGF is 'flipped' Laplace transform
Now for we know that the MGF of sum independent RVs is the product of their MGFs. So if we take the inverse Laplace transform, the density of the sum is convolution of the individual densities.
Similarly, if we take derivative in frequency domain, that is same as multiplying in time domain: So M'_X(s) is the 'flipped Laplace transform' of x f(x) and its value at s=0 is the 'DC-gain' of the signal.
And so on... the properties are all immediate consequence of the definition of MGF and since the definition is essentially the same as that of a Laplace transform , there is an equivalent property in signals and systems as well.
For that, I would look at early calculus/pre-calc, I think, examining infinite series and their properties and equivalencies.
There's certain forms like that that have well known values that they converge to as you continue adding terms into infinity. Sometimes that convergence is only possible if your domain is limited, eg. [0,1].
For those interested in looking slightly more into the characteristic function, it may be worth pointing out that the characteristic function is equal to the Fourier-transform (with the sign of the argument being reversed) of the probability distribution in question.
In my own experience teaching teaching probability theory to physicists and engineers, establishing this connection is often a good way of helping people build intuition for why characteristic functions are so useful, why they crop up everywhere in probability theory, and why we can extract so much useful information about a distribution by looking at the characteristic function (since this group of students tends to already be rather familiar with Fourier-transforms).
Yes, this provides good intuition about why it is useful: the PDF of the sum of two random variables is the convolution of the original PDFs. A convolution is awkward to work with, but by the convolution theorem it is a multiplication in the Fourier domain. This immediately suggests that the Fourier transform of a PDF would be a useful thing to work with.
If you don't say that this is what you are doing then it all seems quite mysterious.
> the PDF of the sum of two random variables is the convolution of the original PDFs
(Probably obvious to everyone reading, but the variables should be independent.)
But I'd rather assume the variables are independent and then blame statistics when I get the wrong answer!
This is a good place to use cumulants. Instead of working with joint characteristic functions, which gets messy, it lets you isolate the effects of correlation into a separate term. The only limitation is that this doesn't work if the moment doesn't exist.
As a physicist, the moment when everything just clicked was when I realised that connected Feynman diagrams were basically the cumulants of that distribution. Then almost everything in physics is about "what is the characteristic/moment/cumulant generating function?" and associated Legendre transforms
A little known bit of history is Feynman developed a diagrammatic method for expressing the moments of PGFs in his study of the stochastic theory of fission chains. This was before his work on QED. See:
https://www.osti.gov/biblio/1775045
Wow. I am not a physicist, but I use pdfs and moments and cumulants all the time. I came up with my own method to calculate cumulants for affine processes using some recursions, and they work. But if I hear you right, I might have stumbled upon something that Feynman did 70 years ago, and he probably did it better. Any good links you can recommend?
> As a physicist, the moment when everything just clicked was when I realised that connected Feynman diagrams were basically the cumulants of that distribution.
And the generating function of the cumulants is the logarithm of the generating function of the distribution (Fourier transform).
I feel like it's almost criminal of textbook writers not to mention this when introducing the characteristic function... At least as an aside or a footnote, for readers already familiar with Fourier transforms.
I had not made that connection and find that incredibly useful. Thank you for pointing that out.
but isn't a characteristic function just "the" way to bridge the gap between sets, functions, and logic(? ...a 3way bridge!?)
I mean, it was useful for me to think about like a translation between sets and logic (this variable x is in the set xor not) into functions (a function f(x) that returns 1 or true whenever x is in set S)
how the heck is that a fourier transform!??
You're thinking of a "characteristic function" in the sense of "indicator function" of a subset (https://en.wikipedia.org/wiki/Indicator_function), which is different thing to the characteristic function of a probability density function.
“Characterstic function” is (was) an overloaded term.
What you described is more often referred to as an “indicator function” these days, with “characteristic functions” denoting the transform (Fourier, laplace, z - depending on context). Closely related to “moment generating functions” to the point of being almost interchangeable.
so the same thing but, characterisic function as I knew them before these posts is a rudimentary 2-variable finite version. point and line (but the line is a curve, a circle because e).
but the new and improved 21st century characteristic functions are n-variable and have a full continious spectrum of variables between zero (false) and one (true) but only potentially lest infinite realizes itself (which would make the theories illogical).
this way of thinking about this makes sense to me, even if it's ever so slighly wrong by some nitpickable point https://en.wikipedia.org/wiki/Moment-generating_function
You can think of it like this:
- The characteristic function of a random variable X is defined as the function that maps t --> ExpectedValue[ exp( i * t * X ) ]
- Computing this expected value is the same as regarding t as a constant and integrating the function x --> exp( i * t * x) with respect to the distribution of X, i.e. if X has the density f, we compute the integral of f(x) * exp( i * t * x) with respect to x over the domain of f.
- on the other hand: computing the Fourier transform of f (here representing the density of X) and evaluating it at point t (i.e. computing (F(f))(t) if F represents the Fourier transform) is the same as fixing t and computing the integral of f(x) * exp( -i * t * x) with respect to x.
- Rearranging the integrand in the previous expression to f(x) * exp( i * -t * x), we see that it is the same as the integrand used in the characteristic function, only with a -t instead of a t.
Hope that helps :)
https://en.m.wikipedia.org/wiki/Characteristic_function_(pro...
Herbert S. Wilf (1990): Generatingfunctionology. https://www2.math.upenn.edu/~wilf/gfology2.pdf
"I have long struggled with understanding what probability-generating functions are and how to intuit them. There were two pieces of the puzzle missing for me, and we’ll go through both in this article."
Great article. For more, I really recommend Analytic Combinatorics:
https://ac.cs.princeton.edu/home/
Second this. This class is a classical example of conceptual blockbuster. Once one learns it, the complexity analysis of algorithms will never be the same again. In general, if a techie wants to spend their spare time learning new stuff, they will be better off focusing more on such conceptual stuff, as the return will compound over the years.
”If we want to encode the vector [6,2,8,4] in a single expression we can create a function containing those numbers: f(x) = 6 + 2x² + 8x³ + 4x⁴
…or if you flip the vector and use x=10:
Yes, but the polynomial form generalizes to coefficients of an arbitrary field, not just naturals. If your vector were, say, [1.3, 2.197656, pi, -1/2, 3*2i] then there wouldn’t be a reasonable base you could pick for a place-value representation.
Indeed. However, note that this is limited to encoding values between 0 and 9.
Probably worth noting that as we know know, polynomials (over a field) are a vector space, not just convertible to one. The set of formal variables { x^0, x^1, x^2, … } is an orthogonal basis.
A basis yes, but not orthogonal.
How so? It’s impossible to express x^n as any linear combination of x^k, k != n.
You have mixed up these two concepts:
https://en.wikipedia.org/wiki/Linear_independence
https://en.wikipedia.org/wiki/Orthogonal_basis
Okay, I think I get it… I thought one would/could define orthogonality of polynomials in terms of an inner product equivalent to the familiar K^n dot product, ie. sum of pairwise products of equal-degree coefficients, but I guess polynomials just don’t work that way. I can’t say I understand the hows and whys of the "correct" inner product given by Wikipedia :/
> I thought one would/could define orthogonality of polynomials in terms of an inner product equivalent to the familiar K^n dot product, ie. sum of pairwise products of equal-degree coefficients
Well, you can. But then you essentially get the real coordinate space, and them being polynomials and not just real-valued vectors is just an extra story that you slapped on top, but that has no relevance.
Polynomial vector spaces become useful, when we treat polynomials as functions, and the vector space of polynomials as a subspace of some other space of functions. And with function spaces the inner product is defined as an integral of the product of two functions (possibly with a weight function thrown in).
Or maybe the coefficient-based inner product spaces of polynomials have some uses, too. But they are less common than the function-based (integral) inner product.
Yeah, that's what I figured. Thanks for your comments, I learned something new!
In a footnote OP says:
> I’m not yet good enough to intuitively get why the curvature of the probability-generating would be related to variance, but I’d be happy to receive pointers here.
Here’s my intuition for this.
The characteristic function is the Fourier transform of the density.
If the density is in “t” units, the ch.f. is in f= 1/t units. It is the “inverse domain.” (I’m using “f” to suggest frequency, ie the Fourier coordinate.)
Of course it is not a simple coordinate transformation! But some intuition does carry over.
This is reflected in all sorts of ways. It’s one reason why the IFT formula is so functionally close to the FT formula.
Anyway.
Because of this, the behavior of the FT (ch.f.) very close to the origin (“f=0”) tells about the tails of the distribution (t = 1/f is large).
In particular, high curvature around the origin tells you the tails are heavy. That’s the variance.
This extends to the fourth moment. You can get even sharper curvature around the origin of the FT (ch.f. at f=0) with a large coefficient on the fourth order term. This corresponds to a large fourth moment of the pdf, or a high kurtosis.
It’s useful to recall that, because of analytic continuation, knowing all the derivatives at the one point f=0 determines the ch.f. everywhere, and thereby determines the complete density. This corresponds to the fact that knowing all the moments determines the full density.
So in a very real sense, you only need the ch.f. in a tight neighborhood of the origin!
(Provided all moments are finite.)
I've always wondered why the hell generating functions existed, and I think this line sums it up:
> When de Moivre invented much of modern probability in the mid-1700s, he didn’t have vectors! Vectors are an 1800s invention.
Doesn't explain why we still teach them 300 years later though. Thats what the second half of the article covers.
Probably worth mentioning the moment-generating function as well, since it's a bit more elementary than the characteristic function and shares many of the properties. It's also more simply related to the probability generating function, you can go from one to the other with a basic change of coordinates (t -> log(x)). I also estimate calculating the moment generating function to be easier in most cases.
In fact most properties of the PGF come from the monent-generating/characteristic function. Including why the second derivative is related to the variance. The second derivative of the moment generating function is the second moment E[X^2]. The second derivative of the logarithm of the MGF is the variance by definition.
The one property that's somewhat unique to the PGF is how composition relates to drawing a randomly-sized sample, which I can see could be useful.
One can also think of probability generating functions as (flipped) Z transforms, moment generating functions as (flipped Laplace transforms), and characteristic functions as Fourier transforms of the respective PMF/PDF. Lot of their properties then follow from simple properties of Signals and Systems.
Do you have a reference that explains this in more detail? I'd be curious to know.
Don't have a reference on the top of my head, but the main idea is as follows:
The definition of MGF of a random variable with PDF f(x) is
E[e^{sX}] = int_{-inf}^{inf} f(x) e^{sx} dx
The definition of Laplace Transform of a signal f(t) is
F(s) = int _{-inf}^{inf} f(t) e^{-st} dt
Hence MGF is 'flipped' Laplace transform
Now for we know that the MGF of sum independent RVs is the product of their MGFs. So if we take the inverse Laplace transform, the density of the sum is convolution of the individual densities.
Similarly, if we take derivative in frequency domain, that is same as multiplying in time domain: So M'_X(s) is the 'flipped Laplace transform' of x f(x) and its value at s=0 is the 'DC-gain' of the signal.
And so on... the properties are all immediate consequence of the definition of MGF and since the definition is essentially the same as that of a Laplace transform , there is an equivalent property in signals and systems as well.
Is there a relationship between algebraic polynomial encoding of sequences, and https://en.wikipedia.org/wiki/G%C3%B6del_numbering_for_seque... ?
Does an encoding of a sequence in a given Gödel numbering, also somehow "retrievably" encode the probability space of the sequence's terms?
Generating functions are also a great tool in combinatorics (see for example the book Analytic Combinatorics by Flajolet and Sedgewick).
I rather like the derivation of the CLT using MGFs.
https://courses.cs.washington.edu/courses/cse312/20su/files/...
Tangentially related: http://bactra.org/notebooks/cumulants.html
I think the 12th order G(t) example is missing another term with coefficient 1/5 -- since these coefficients must sum to 1
Unless it’s already been fixed, the queen’s term (t^12) has 2/5 in the article.
Lost the plot at "... and for sensible v, this is equivalent to ..." :(
For that, I would look at early calculus/pre-calc, I think, examining infinite series and their properties and equivalencies.
There's certain forms like that that have well known values that they converge to as you continue adding terms into infinity. Sometimes that convergence is only possible if your domain is limited, eg. [0,1].
That clears it, thanks! I didn't figure out that "sensible" referred to convergent G(t).
It doesn't mean that in general; I think the author is saying "for sensible v" somewhat informally to mean "for v for which this makes sense".
You also don't really have to worry about convergence, since these are formal power series.
This is just the sum of a geometric sequence.
[flagged]
[flagged]
[flagged]