Case study in the politics of mathematics: optimality of Nash equilibria

580 words

Back in the Soviet era, Leonid Vitaliyevich Kantorovich got himself in trouble. He developed a theory of optimization problems, originally for the sake of optimizing plywood production, but he saw a potential for using his techniques to optimize the entire Soviet economy. So he sent letters to the central planning bureau to convince them to make use of his ideas.

He has been involved in advanced mathematical research since the age of 15; in 1939 he invented linear programming, one of the most significant contributions to economic management in the twentieth century. Kantorovich has spent most of his adult life battling to win acceptance for his revolutionary concept from Soviet academic and economic bureaucracies; the value of linear programming to Soviet economic practices was not really recognized by his country’s authorities until 1965, when Kantorovich was awarded a Lenin prize for his work.

Excerpt from original CIA file on Kantorovich.

There was just one problem, arising from the theory of linear programming duality. For any linear optimization problem, we can derive a dual problem. If you solve both the primal and dual problems, they turn out to have the same solution value, and moreover the optimal solution to the one certifies optimality of the solution to the other. These “dual solutions” can have clear interpretations. In the case of optimizing resource allocations, the dual solution can be interpreted as market prices.

LP duality theory connects the notion of optimal resource allocation as used for central planning with the efficient market hypothesis and Nash equilibria. This connection can be interpreted as “capitalist markets find the optimal allocation of goods and services”. Obviously, the communists did not like that.

This interpretation has been popular in the US as well. They’d see this interpretation in light of the slightly weaker “first welfare theorem” and Smith’s invisible hand.

Of course there are numerous solid arguments for why these interpretations are bogus. To name just a few: markets are not convex, barriers to market entry are non-zero, humans are not perfectly rational nor omniscient, computation and communication are neither free nor instantaneous, negative externalities are common, the dual LP only takes prices into account and not net cash flow of individuals, and the “welfare function” doesn’t necessarily exist.

All of this was known 50+ years ago, but apart from the 1950 cutesy example of the prisoner’s dilemma, game theorists didn’t do much of note to dispute the notion that Nash equilibria are typically good. Nothing really got formalized until 20 or so years go with the introduction of the price of anarchy. The price of anarchy (stability) of a game is the ratio of social welfare in the optimal solution versus the social welfare in the worst (best) Nash equilibrium. The PoA and PoS can be unboundedly bad, and recent decades have seen a lot of exciting work happening here.

Mathematics is political. When we want to apply mathematical theorems to the real world, we need to make certain simplifying assumptions. Or actually, if we want to think any thoughts about the real world, but whatever. The simplifications we make determine the theorems we can prove. We can work on theorems that “prove” that capitalism is good or we can work on theorems that “prove” that capitalism is bad. Both would be valid mathematics, but both individuals and communities can decide to value one over the other, resulting in undergrads being taught either “10/10 scientists agree: capitalism is good” or “beware: capitalism can be arbitrarily bad”.

What does it mean to call something a social construct?

430 words

Many explanations of the term “social construct” are severely lacking. Let me explain what I personally mean when I call something a social construct. Largely, it means that we could have done it differently.

Gender is a social construct. Our ideas of which things are for men and which are for women has differed over place and time. Little boys used to wear dresses, women used to be considered mentally unfit to own property, gender nonconforming people (like women wearing pants) used to be shunned by their community. And that is just Western countries over the past century, without considering different millennia or cultures. We could change our current notions of gender without the gods coming down to interfere with our ambitions.

Money is a social construct. By choosing to redistribute or not, we shape the role of money and the way it influences the rich and poor. Same for (dis)allowing usury, theft or private ownership of the means of production, to name a few socially constructed aspects of private property and thus of money. Every facet of money is constructed through a social process, but the resulting structure can impact a person’s life as much as the law of gravity.

Mathematics is a social construct.1If you’re a platonist, read this as “our practice and understanding of mathematics.” I refer to the mathematics we know, not the mathematics that can be. There is no linear “tech tree” forcing people to develop Euclidean geometry before graph theory or predator/prey models. No law of nature tells pure mathematicians to abhor the thought of their work finding practical use. Instead of sorting computational problems into various Turing degrees,2Computable problems have the lowest Turing degree. The halting problem has a higher Turing degree. There are infinitely many other Turing degrees, some of which are higher or lower than the halting problem and some which are incomparable we could have been doing something useful instead. Peer reviewers use their own judgement to decide whether a result is novel, interesting and proven with enough rigor, no objective measurement devices involved. The choice of what is Real Mathematics and what is not is similarly arbitrary.

I tell people that a thing is socially constructed when I want to remind them that they could be different. When they’re talking about the thing as if it is a fixture of life and I want to invite them to imagine how it could be changed. Because no matter your philosophy of mathematics or external reality, the way we understand and value various ideas is always different among individuals and communities.

Deepmind will start playing fair?

195 words

In case you missed it, last week bought the most exciting piece of AI news of the past couple years. Deepmind’s Alphastar will be playing real non-NDA’d humans for multiple games (via Ars Technica).

This is cool news. Partly because they finally seem intent on fixing their cheating with superhuman micro, but mostly because repeat games could allow humans to develop counter strategies. The exhibition matches earlier this year only had the humans play one match against any given bot, thus giving the bot a huge informational advantage.3The neural nets were trained through supervised learning on human matches before being pitted against each other. These new matches will hopefully end up being more fair. However, Blizzard’s press release is vague enough that Deepmind could still decide to play only a hand full of games, which would prevent humans from gaining enough knowledge to devise counter strategies. Worse, the Alphastar team could decide to stop playing once they observe online message boards sharing good tactics between human players or any other nasty hacking of p’s.

The hype labs haven’t been doing well in fair play or reproducibility so far, but I’m willing to hope they’ll improve.

Ovens have secret built-in automatic timers

269 words

Every oven I’ve ever used has had a secret function, a mechanism that automatically tells you when the food is ready. It is wonderful and I want to tell you about it.

So most ovens control their temperature using a bimetallic strip. When the temperature inside is less than the target temperature, the strip closes a circuit that activates the heating. As soon as the temperature is sufficiently big, the strip will have deformed enough to open the circuit and stop the heating. In many ovens, especially older ones, you can hear this as a soft *click*. If you are lucky, the mechanism is sensitive enough to rapidly go on and off to stay on temperature, at least for a couple seconds.

If you eat frozen pizza, it often only has to be heated to a sufficient temperature. When it reaches this temperature, the pizza will stop cooling down the air around it, thereby allowing the oven to reach its target temperature and starting to say *click*. So the sound will tell you when the food is ready, no need to read the packaging to find the correct baking time.

The same happens for dishes that are ready when enough water has evaporated, or when a certain endothermic chemical reaction has stopped happening. All are done the moment the oven says *click*. There might be some exceptions to this phenomenon, but I have yet to run in to one. Which is great because I always forget to read oven instructions on packaging or recipes before throwing them out. Try it out with your own electrically powered food heating units.

Which companies are adversaries?

234 words

I’ve been looking on and off for mp3 players for a couple months. I wanted a device with proper playlist support, bluetooth, and sufficient battery and storage capacity. It had to be cheap and with a UI that does not make me wish for death.

I ended up buying a $30 second hand Nokia Lumia 650. I deleted everything except the music player and maps app, downloaded maps of every country I might reasonably ever visit and the complete contents of Wikivoyage, copied my music onto it from my pc and put it permanently in airplane mode. It is a bit too laggy, but other than that I like this setup a lot.

But more important than my love for Windows Phone, is my hate for Android and iOS. I dislike the former for its role in the global surveillance economy and its butt-ugly interface. I dislike the latter because of its adversarial pricing model and excessively walled garden.

I don’t want to get my dinner from the pathologically neoliberal butcher, the dominant-strategy-playing externality-indifferent brewer or the stalking, price-discriminating, search-engine-optimizing baker. Their antithesis probably consist of the local organic farmer’s market, self-hosted FOSS software and artisan everything, but I do like economies of scale.

I’m still searching for the synthesis. For now, I’ll start with trying to minimize my interactions with companies who relate to their users or customers in a very adversarial manner

Celebrating 20 years of Pokémon Snap

207 words

Pokémon Snap first came out today 20 years ago. It is an on-rails shooter game. You’re in a car that drives around an island filled and your goal is to take pretty pictures of the pokémon that live there. It was one of my favourite games when I was little. Its relaxing gameplay and cute jokes still managed to put a smile on my face when I played the game again recently for the first time in years.

“apple-shaped food”

It takes only a couple hours to take a picture of every pokemon in the game, though with high variance since some pokemon are really tricky to find. The game feels quite modern despite its age.

At the end of every course you can show a number of photos to Professor Oak and he will rate their quality on a couple different aspects, like how well the pokémon fits in the frame and if they are in a nice pose. The amazing thing is that algorithm almost exactly matches my own opinion of which photos are good. You don’t see that often in algorithms.

The original Pokémon Snap website is surprisingly still online. It is blazingly fast. Wirth’s law is my least favourite law of nature.

Beyond Condorcet winners: Majority Judgement

631 words

There have been rumours on Wikipedia that Michael Balinksi passed away this Februari 4th. If this is indeed the case, then my heartfelt sympathies go out to those around him. I never knew him personally, but he was an amazing scientist and I’ve used results of his more than once. He was quite a phenomenal speaker.

Today I want to talk about a single-winner voting system that Balinksi introduced with Rida Laraki. It is called majority judgement and it is so brilliant that it almost makes me wonder what voting theorists could have been doing both before and since then.

One big concept in social choice theory is majority rule: if most people think that A is better than B, then A should be the winner. Most multi-candidate voting systems generalize this in various ways, always preserving that if candidate A would beat every other candidate B, C, etc, in a pairwise competition, then A should win the election. If candidate A satisfies this criterium, we call A a Condorcet winner. The leading wisdom in social choice theory was that any good voting rule should let the Condorcet winner win (if it exists).

According to my informal sample from the Effective Altruism community, EA’s favourite voting system seems to be approval voting, which is one of these voting systems that generalizes majority rule to multiple candidates.

The genius of majority judgement is that it moves past the Condorcet winner paradigm and considers a perspective beyond that.

To illustrate, let’s assume we’re running an election among two candidates, the red candidate and the blue candidate, and every voter gets a ballot with for each candidate the options “Amazing”, “Good”, “Mediocre” and “Terrible” to express how good of a president they think the candidate would be. Let us for simplicity assume that our voting population consists of 5 groups, the A’s, B’s, C’s, D’s and E’s, and everyone in a group votes the exact same way. The outcome of the election is in the table below.

ABC
% of population402040
Red candidateAmazingMediocreTerrible
Blue candidateGoodTerribleAmazing

The red candidate wins the Condorcet-style election: 60% of the population prefers the red candidate over the blue candidate. But the blue candidate is clearly better: 80% of the population considers the blue candidate to be “Good” or better, while 60% of the population considers the red candidate to be “Mediocre” or worse.

Majority judgement is a voting rule designed to have the blue candidate win in the above election. The actual algorithm is a bit involved, but the first step is to compare the median vote: if at least 50% of the voters think the blue candidate is “Good” or better and at least 50% of the voters think that the red candidate is “Mediocre” or worse, than the blue candidate will win. If the median opinion is a tie, a more complicated tie-breaking rule is entered. ((The exact rule satisfies a number of optimality criteria and is the only rule to do so. For this post I want to skip the details.))

I think the concept is very elegant, and I believe that the outcomes really would be better than with a system that elects Condorcet winners. In a talk that Balinksi gave, which I was lucky enough to attend, he pointed out another advantage of the majority judgement rule: it allows voters to express what they think of the candidates. You wouldn’t be asking anyone to “vote for the lesser evil”, everyone can keep their conscience clear. Majority judgement admits a clear way of expressing frustration with both candidates: rank both of them very badly. It also helps that the different options are given in words instead of ranking by numbers, for the latter turns out to entice voters to rate their favourite candidate 10/10 and all others 0/10.

Gaussian tail bounds

812 words

One-dimensional tail bounds

The standard normal distribution N(0,1) has probability density function \frac{1}{\sqrt{2\pi}}e^{-x^2/2}. There is no way to integrate this function symbolically in a nice way, but we do at times want to (upper) bound expressions of the form \frac{1}{\sqrt{2\pi}}\int_x^\infty e^{-t^2/2} \mathrm{d}t. How can we do this?

One way is to follow this approach. Since t\geq x everywhere, we can upper bound \frac{1}{\sqrt{2\pi}}\int_x^\infty e^{-t^2/2} \mathrm{d}t \leq \frac{1}{\sqrt{2\pi}}\int_x^\infty \frac{t}{x} e^{-t^2/2} \mathrm{d}t = \frac{1}{x\sqrt{2\pi}}e^{-x^2/2}.

There is another tail bound which is a bit weaker for large x, but I like the proof better. We’ll give a tail bound by looking at the moment-generating function \lambda \mapsto \mathbb{E}[e^{\lambda X}], where X \sim N(0,1) is our normally distributed random variable. We can explicitly calculate this expectation and find \mathbb{E}[e^{\lambda X}] = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^\infty e^{\lambda x - x^2/2}\mathrm{d}x = \frac{1}{\sqrt{2\pi}}e^{\lambda^2/2}\int_{-\infty}^\infty e^{-(x-\lambda)^2/2}\mathrm{d}x. The last term is just the entire Gaussian integral shifted a bit and hence \mathbb{E}[e^{\lambda X}] = e^{\lambda^2/2} Now we use Chernoff’s bound (an easy corrollary of Markov’s inequality) to find \mathbb{P}[X \geq t] \leq \mathbb{E}[e^{\lambda X}]e^{-\lambda t}, which we can now minimize over the choice of \lambda, setting \lambda=t, and we conclude that \mathbb{P}[X \geq t] \leq e^{-t^2/2}.

Multi-variate Gaussians

Let X \in \mathbb{R}^d be N(0,I_d) normally distributed, i.e., X is a vector with iid Gaussian N(0,1) entries. What tail bounds do we get on \|X\|? We start off with Markov’s inequality again. \mathbb{P}[\|X\| > t] = \mathbb{P}[e^{\lambda\|X\|^2} > e^{\lambda t^2}] \leq \frac{\mathbb{E}[e^{\lambda\|X\|^2}]}{e^{\lambda t^2}}.

Deriving the moment generating function \lambda \mapsto \mathbb{E}[e^{\lambda\|X\|^2}] of X^2 is an elementary calculation. \int_{-\infty}^\infty e^{\lambda x^2} \cdot e^{-x^2/2} \mathrm{d}x = \int_{-\infty}^\infty e^{\frac{-x^2}{2(\sqrt{1-2/\lambda})^2}}\mathrm{d}x = \frac{\sqrt{2\pi}}{\sqrt{1-2\lambda}}.

The coordinates of X are iid, so \mathbb{E}[e^{\lambda\|X\|^2}] = \mathbb{E}[e^{\lambda X_1^2}]^d = (1-2\lambda)^{-d/2}. The minimizer is at \lambda=(1-1/t^2)/2, and we find, requiring t \geq 1 for the last inequality,\mathbb{P}[\|X\| > t] \leq e^{-d(t^2-2\log t - 1)/2} \leq e^{-d(t-1)^2}.

Operator norm of Gaussian matrices

The operator norm or spectral norm of a n \times n matrix M is defined as \|M\| := \max_{x \in \mathbb{R}^n} \frac{\|Mx\|}{\|x\|}.

Now if M were a matrix with every entry independently N(0,1), what would the largest singular value of this random Gaussian matrix be? I’ll give an easy tail bound based on a net argument.

An \eta-net, \eta > 0, on the sphere is a subset N \subset \mathbb{S}^{d-1} such that for every point x \in \mathbb{S}^{d-1} there is a net element n \in N such that \|x-n\| \leq \eta, but every two net elements are at distance at least \eta from each other. A greedy algorithm can construct an \eta-net, and any \eta-net has size at most (4/\eta)^d. 4See e.g., Jiří Matoušek, Lectures on Discrete Geometry (Springer, 2002), page 314. The proof is based on a simple packing argument where balls of radius \eta/2 around each net element have to fit disjointly inside the ball of radius 1+\eta/2 \leq 1 centered at the origin.

Now let N\subset \mathbb{S}^{d-1} be a 1/2-net. By the above, the size of the net is bounded by |N| \leq 8^d.

The function x \mapsto \|Mx\| is \|M\|-Lipschitz. Hence we can bound \|M\| \leq \max_{x\in\mathbb{S}^{d-1}} \min_{\omega \in N} \|M\omega\| + \|M\|\cdot\|x-\omega\| \leq \max_{x\in\mathbb{S}^{d-1}} \min_{\omega \in N} \|M\omega\| + \|M\|/2. So we have now proved that \|M\| \leq 2\max_{\omega\in N} \|M\omega\|.

Now, as M\omega is N(0,I_d) normally distributed for any \omega\in\mathbb{S}^{d-1}, we can use the union bound over all points of N and conclude that, for all t \geq 1, \mathbb{P}[\|M\| \geq 2t\sqrt{d}] \leq 8^d e^{-d(t-1)^2/2}.

Maximum of n Gaussians

The distribution of the maximum \mathbb{P}[\max_{i \leq n} X_i \geq t] of n independent identically distributed variables X_1,\ldots,X_n \sim N(0,1) is, up to a constant factor, tight with the union bound \mathbb{P}[\max_{i \leq n} X_i \geq t] \leq ne^{-t^2/2}.

From this, we can find that the expected maximum is \mathbb{E}[\max_{i \leq n} X_i] = O(\sqrt{\ln n}). We derive this by bounding the integral \mathbb{E}[\max_{i \leq n} X_i] = \int_0^\infty \mathbb{P}[\max_{i \leq n} X_i \geq t] {\rm d}t. Split the integral at t = \sqrt{\ln n}, bound the integrand in the first part by 1 (it is a probability) and in the second part by ne^{-t^2/2}.

Average width of the simplex

Let x_1,\dots,x_{d+1} \in \mathbb{R}^d be the vertices of a regular simplex such that \|x_i\| = 1 for all i \in [d+1]. If \omega \in \mathbb{S}^{d-1} is chosen uniformly at random, the expectation of the difference \max_{i,j\in[d+1]} |\omega^{\mathsf{T}}(x_i-x_j)| is called the average width of the simplex. We can bound this up to a constant factor using our knowledge of Gaussians. Let H_t := \{y\in\mathbb{R}^d : \omega^{\mathsf{T}}y = t\}. The d-2-dimensional volume of H_t\cap \mathbb{S}^{d-1} is (1-t^2)^{(d-1)/2} times the volume of \mathbb{S}^{d-2} by Pythatoras’ theorem. Recalling that (1+1/\lambda)^\lambda \approx e, you can prove that the distribution of \omega^{\mathsf{T}}x_i is approximately N(0,1/\sqrt{d-1}). The Gaussian tail bound now says that the average width of the simplex is O(\frac{\sqrt{\ln d}}{\sqrt d}).

Two conflicting concepts

185 words

Sometimes you hear a word or concept that changes how you look at the world. For me, these include speciecism and epistemic injustice.

Speciecism is analogous to racism and sexism, but for species: treating another being differently because they are of another species. Speciecism is about intent; if you eat chickens because they are chickens and not humans, that is speciecist, but if you eat chickens because you concluded from observation that they are incapable of suffering, that is not speciecist.

Epistemic injustice is when someone is wronged in their capacity as a knower. If you unjustly limit somebody’s ability to access or express knowledge, like forbidding them from learning to read or speak, that is an epistemic injustice.

I am an outspoken anti-speciecist and I think we should do what we can to prevent epistemic injustice in all forms. But some animals have learned enough language to meaningfully communicate with humans. Does that mean I should find it reprehensible that there are no schools for animals? I think I should and I think I do, but I feel hesitant to firmly claim the position.

A case for donation splitting

703 words

TLDR: if welfare compounds then risk-aversion is good.

Within EA circles, the question of splitting donations pops up every once in a while. Should you donate all your money to the singular top-rated charity your singular top-rated cause area, or is there reason to split your donations between various different causes or interventions?

People other than me have written and talked about this under various headers, I’ll list a small subset. Reasons not to diversify (Giving What We Can)Reasons to diversify: the value of information, explore vs exploit (Amanda Askell @ 80k)Reasons both for and against: risk aversion, diminishing returns, EV maximization (Slate Star Codex). In-depth blog post with mahy arguments both for and against (EA forum). Not listed but probably talked about before: splitting your donations gives you extra practice at donating which might lead to you making better donation decisions in the future.

In this post I want to make an argument in favour of splitting donations based on compounding economic returns and measurement error. Specifically, compounding returns favour more consistent growth over a slightly higher but variable growth.

Let’s consider a 100-year time horizon. Suppose that there are 100 charities, C_1,\dots,C_{100}, whose effectiveness is heavily-tailed: donating $1000 to charity C_i allows them to produce i*\$1000 in welfare after a year. Charity evaluator BestowCapably measures the effectiveness of every charity C_i every year j and finds an effectiveness of i + s_{i,j}, where the s_{i,j} are independently normally N(0, \sigma^2) distribution. Let’s assume BestowCapably’s measurement error \sigma does not go down over time.

The way I think of these quantities is that effectiveness is a heavy-tailed distribution and that measurement error is multiplicative (instead of additive).

We assume all welfare gains are reinvested in charity the next year, so that the gains compound over years. The initial welfare is 1. We consider three different donation strategies: donate everything to the single best rated charity, split the donation between the top three rated charities, or split the donation between the top ten rated charities. We plot the compounded welfare after 100 years versus \sigma below.

In the above graph, we see that,for low measurement error, donation splitting is worse than donating everything to the best charity, but for high measurement error, the situation reverses and splitting donations wins out.

Section of doubt

The code I’ve used (included below) to simulate the scenario has a couple researcher degrees of freedom. It is unclear whether measurement error should scale with charity effectiveness. I used Gaussian noise without any justification. My choice of range of \sigma to plot was chosen to have a nice result. The range of charity effecicies has close to no justification. The same stable result can be gotten by donating everything to AMF and nothing to speculative cause areas. The splitting incentive I illustrated only holds at the margin, not for the average donation. Because \sigma is fixed, the magnitude of the effect of donation splitting in this model depends heavily on the number of charities (less charities means greater effect).

Nonetheless, if you care about multi-year impacts, it might be wise to consider more than just short-term expected value. Risk-aversion translates to expected counterfactual impact when results compound.

Appendix: Python code

import random
import matplotlib.pyplot as plt
import math

charitycount = 100
yearstocompound = 100

# The charities are {1,...,n}
# Charity i has effectiveness i
# Effectiveness measurement carries exp noise of size stddev
# Outputs list of (i, i + noise)
def measurecharities(n, stddev):
    charities = []
    for effectiveness in range(1,n+1):
        charities.append((effectiveness,random.gauss(effectiveness,stddev)))
    return charities

# Given list of tuples (x, y),
# calculates the average of x's for
# the k tuples with highest y value.
def avgtop(list, k):
    sortedlist = sorted(list, key=lambda tup: tup[1], reverse=True)
    sum = 0.0
    for i in range(k):
        sum += sortedlist[i][0]
    return sum/k

# Split donations among k charities
for k in [1,3,10]:
    x = []
    y = []
    # We plot the effect for different noise magnitudes
    for stddev in range(1,251):
        logwelfare = 0.0
        for i in range(yearstocompound):
            welfaregain = avgtop(measurecharities(charitycount, stddev), k)
            logwelfare += math.log(welfaregain)
        x.append(stddev)
        y.append(max(1,logwelfare))
    plt.plot(x, y,label=k)
plt.legend()
plt.xlabel('Error in measuring effectiveness')
plt.ylabel('Log(' + str(yearstocompound) + '-year compounded welfare gains)')
plt.title('Donating to top k out of ' + str(charitycount) + ' charities')
plt.show()