Is the economy linear programming or is it gradient descent?

444 words

This will be bad take, but also fun. There will be no thinking before typing, and no second draft.

There is a good tradition of people knowing that in a functioning economy, prices for goods are just the Lagrange multipliers of the constraints of that economy. Many people are aware of this fact, and more still would make much better decisions if they knew. You can learn this from any good text on linear optimization, such as The Best Uses of Economic Resources by L.V. Kantorovich.1Never mind that this text is from 1965 or that more recent textbooks don’t emphasize this view much. No good text will completely ignore this perspective though, often an example of a farmer or the Diet problem is discussed.

The linear optimization perspective aligns well with the typical Effective Altruism mindset: prices represent exchange value, so to be good we should allocate our money to maximize to maximize value produced per dollar spent. If a thing is very cost-inefficient, don’t spend anything on it. If everyone behaved that way, we would collectively solve the optimization problem that is our economy. The result would be an economy producing the most utility possible.

The convex optimization perspective on the economy contrasts with a more recent view implicitly held by many people: everything is gradient descent, the best algorithm is always gradient descent. Machine Learning? Gradient descent. Bandit problems? Gradient descent. Maximum flow? Gradient descent. Online algorithms? Gradient descent!

If prices carry gradient information2actually inverse Hessian times gradient would probably be a closer match but who cares about telling a coherent story not me then we should spend money on a good proportional to one over its price. If everyone spends their money that way, we will descent our loss function as quickly as possible, arriving at the optimal economy maximizing utility in little time.

If prices are Lagrange multipliers, you should be frugal and spend most of your money on bed nets and stuff. If prices are gradient thingies, you can and must buy some of that expensive tasty vegan white chocolate. I just did exactly that. Yum yum yum.

Edit: How did I mess up this post so badly. This Langrange multipliers versus gradient descent argument is a silly red herring. The real fun would have been in comparing a gradient descent based interpretation of prices and a mirror descent based interpretation. It wouldn’t be too hard to shitpost a legitimate utilitarian argument based on that, arguing that it is everyone’s moral duty to learn about the difference between gradient descent and mirror descent and to apply this knowledge in everyday life. Ah well. Writing prompt for whomever reads this.

Browsers leaking previously visited pages?

107 words

I have this WordPress plugin called Statify running. It produces a pretty visitor count graph, a list of which pages were visited today, as well as referer addresses of visitors. Now the strange thing is that sometimes they contain meaningful addresses that do not seem like spam, but are also of pages that in no way link to me. One example is today’s list:


None of the 4 listed pages seem to contain a link to my blog. I am guessing that some browser reports a referer when it shouldn’t, but I am not sure.

AGI and the intelligence scale

223 words

The AI-Safety-as-an-effective-cause hypothesis has already been soundly refuted by others, but I think it never hurts to show more disdain for this self-recursive Bayesian superintelligence nonsense. With that in mind, let’s talk about Bostrom’s awful book. Specifically, this graph:

We see a line starting at the left of the image and continueing on to the right. At 8/60 of the way from the left, a marking says "Mouse". At 11/60, "Chimp". At 14/60, "Village idiot". At 14.3/60, "Einstein". The line continues a long way to the right, indicated to go on for a long time past the end of the image. Caption of the image says: "Figure 8. A less anthropomorphic scale? The gap between a dumb and a clever person may appear large from an anthropocentric perspective, yet in a less parochial view the two have nearly indistinguishable minds.[9] It will almost certainly prove harder and take longer to build a machine intelligence that has a general level of smartness comparable to that of a village idiot than to improve such a system so that it becomes much smarter than any human." The "vill" in "village" is highlighted because I found the correct spot in the book through searching "vill" by ctrl+f

There are two claims embedded in here. One is “when you are past village idiot level, then Einstein level is very close”. The other is “there is a lot of room beyond Einstein level”. I would argue that both are preposterous, in particular if you define intelligence by some kind of problem-solving ability.

For this post I want to focus on the empirical prediction made by these two claims. Take any task for which “AI” is better than unskilled humans, for example a game. Look at the growth of the rating over time. The claims predict that there will be very little (development/training) time between an AI becoming competitive with unskilled humans and surpassing the very best humans. Proof by picture:

Plot with time on x-axis, skill on y-axis, and horizontal lines marking "best humans" and "worst humans" close together. A squiggly diagonal line, increasing, is marked "AI", with vertical lines denoting where it crosses the "best humans" and "worst humans" mark. The space between the vertical lines is small and marked as "very short time".

Does that prediction hold up? Let’s look at everything Deepmind and OpenAI did. Go? Nope. Starcraft? Nope. Rubik’s cube? Nope. Any other shitty thing? Nope. The prediction failed just as badly before the neural net hype, considering chess and go and Age of Empires and any other game you could play against the computer.


Bostrom is talking out of his ass.

Bayes’ rule is full of shit

273 words

We all know the type who think Bayes’ rule is the key to rational thought. They are talking out of their ass. This short note intends to explain why I say that, but without any trite arguments and just focusing on the computer science. In short, because Bayes’ rule is an information-theoretic concept without computational content.

For one, conditional probability distributions are not computable. Of course you say, but no sane person would limit themselves to only computable endeavors? And moreover, that result only applies when there is no noise, i.e., the realm of spherical cows! Your objection is valid, but we have to observe this theoretical impossibility result for the sake of completeness.

The real issue occurs in real-world complexity classes. Random SAT/UNSAT is hard in the appropriate density range.1I have trouble finding a specific reference, but anyone working on SAT solvers will tell you this. So if I pick a random SAT instance from a probability distribution known to both of us, and I tell you the bits encoding the instance one by one, your Bayesian updates on the probability that my instance is true will be hard as well. Both in theory2This is a slightly stronger claim than just \mathcal P = \mathcal{NP}. If I recall correctly, ACM SIGACT News Volume 50 Issue 1 or 2 will tell you that everyone believes this hardness statement. and in practice. No matter how much computing power you have, I can easily sample SAT instances that make your Bayes’ rule calculations suffer.

This hardness carries over to everyday problems. Bayes’ rule is typically insufficient to tell you the answers you are looking for.

Economic systems are all the same when you take out the stupid parts.

946 words

I’ve got a brilliant hot take here. Consider the following:

  • Capitalism but without money
  • Communism but without central planning
  • Kantorovich-style dual certificates but without being a dumb-ass

In this post I will argue that each of these three is better than their normie counterparts, and that they are all the exact same thing.

Capitalism but without money

Capitalism has a lot of good things going for it. In particular, it is a viable way of allowing agents to communicate what they consider worthwhile doing and of communicating to agents how they can make themselves of use. I am not saying that its current instantiation is super good at those things, just that putting prices on things is a way for society to collaborate.

However, as the second fundamental theorem of welfare economics tells us, there are a bunch of free variables in capitalism that could be set any which way: how much money does every agent have? Most settings of these free parameters will lead to sub-optimal outcomes, so under an assumption of naturalness we should look for a variant that is unique in its class. I argue that we get this by removing money.

Specifically, let’s get rid of money but keep prices. Prices will no longer reflect what you pay, just how much effort and resources were needed to procure it. When you “buy” one product instead of another, you are communicating which things are worth the effort to make and which are not.

Let’s take my grocery trip from today as an example. I bought cheap bread and expensive fake chicken pieces, even though for both the more expensive one is better. I have plenty of money so I could have easily bought expensive bread and expensive fake chicken. But I think the expensive bread is much more expensive but only a little better, while the expensive fake chicken is a lot better. By buying what I bought, I communicate what I think is worth the effort/resources and what is not. I would have brought the same if I was a millionaire, because money was no part of my purchasing decision. Only the prices matter.

Why would capitalism without money be better? For one, some people are too poor to buy healthy food, which is a huge loss of welfare. For two, billionaires will literally buy private yachts and £250,000 watches with useless tourbillions. This is a clear example of resources not being spent in a good way caused by money, but there are many more less visible examples as well.

Communism but without central planning

I won’t say too much about this. Communism was a disaster for many reasons. Economies are not legible in a top-down fashion. Even if they were, a central bureau couldn’t obtain all necessary data in time. Central planning is inherently undemocratic. We need a decentralized way of arranging the economy, led largely by consumers and built on the assumption that people can decide for themselves what they find important or worthwhile.

Kantorovich-style dual certificates but without being a dumb-ass

Kantorovich was a Soviet mathematician who invented the discipline of linear programming for the sake of planning the Soviet economy.

For those who are unaware, a linear program is an optimization problem that can be expressed as \max c^{\rm T} x, ~{\rm s.t.}~ Ax \leq b, with given c, x \in \mathbb{R}^d, b \in \mathbb{R}^n, A \in \mathbb{R}^{d \times n}. That is to say, one aims to maximize a linear function \sum_{i=1}^d c_i x_i of variables x_1,x_2,\dots,x_d subject to linear inequalities of the form \sum_{i=1}^d a_{ij}x_i \leq b_i for j = 1,2,3,\dots,n. This is a very powerful framework for optimization theory and is used for many practical applications.

Any linear program allows for a dual program, give by \min b^{\rm T} y, ~{\rm s.t.}~ A^{\rm T}y = c, y \geq 0. The solutions to these programs have the same objective value and tell you a lot of information about what an optimal solution looks like.

Kantorovich basically considered x_1,\dots,x_d to be numbers describing the economy: what goods should be produced and where. The linear constraints Ax \leq b represent the constraints of the economy: how many workers, resources and factories are there and what are their capacities. The objective function c^{\rm T}x encodes the goal of maximizing the welfare of the Soviet citizens. Thus, the optimal solution to this program would tell you the optimal way the Soviet economy could be run.

So far so noble. The dual linear program to Kantorovich’s optimization problem represents market prices. Its solution tells you the exchange value of different goods. If you know the correct exchange values, that will roughly tell you how the rest of the economy should look like, and if you know the shape of the economy you know what the exchange values are.

This is where it got political for Kantorovich. In a dual linear program, every constraint not met in the primal linear program gets value 0. The Soviet economy had surplus labor (it was constrained by different factors), so Kantorovich put the value of labor at 0. The Party thought this flied in the face of Marx’s teachings and it is a small miracle that Kantorovich survived.

But this view of prizes as dual solutions makes a lot of sense. Once you leave Kantorovich’s worst simplifying assumptions, no prize will be exactly 0. At that point, you get into a territory looking like interior-point methods (a class of algorithms for solving linear programming problems), in which your dual solution (prizes) tell you how to improve your primal solution (production and consumption of goods and services) and your primal solution tells you how to improve your dual solution.

A lot of people are the smartest person in their country

556 words

Definition For a point set S \subset \mathbb{R}^n and a point x \in S, we say that x is Pareto-optimal if, for any y \in S, x \neq y, there exists an index i \leq n such that x_i > y_i.

Exercise For P any probability distribution on \mathbb{R}^n where the different coordinates are independent random variables and independent x_1,\dots,x_m \sim P, the expected number of Pareto optimal points among them is \sum_{i_1 = 1}^m \frac{1}{i_1} \sum_{i_2 = 1}^{i_1} \frac{1}{i_2} \cdots \sum_{i_{n-1}=1}^{i_{n-2}} \frac{1}{i_{n-1}}. (hint: induction on n)

Principal component analysis is a standard tool for anyone who wants to do quantitative science. You start with a pile of data points that vary along different axes, find the axis of most variation in the data, find the axis of second most variation, and so on. Bonus points if afterwards you pronounce that this statistical ghost is a real and meaningful quantity. See: Spearman’s g, big-five personality traits, among others.

If your data is normally distributed, and lets face it, all data is normally distributed if you squint enough, then the principal components are, in expectation, exactly the rows of your covariance matrix. That is, when you re-parametrize your data such that the principal components align with the coordinate directions, then the coordinates of any data point are independent random variables.

The different quantities resulting from a principal component analylis are always orthogonal directions. The first principal components are the axes of largest variation in the data set (the single component of Spearman’s g, the five personality traits), they don’t come equipped with any sense of being the “most important”. Just the biggest variance, which is a notion that is well-defined only when you chose the units of the original coordinates well. Therefore, we cannot say that one of the components is “the real one”, i.e., we cannot impose a meaningful linear ordering of quality after calculating the principal components.

Anyone can tell you that intelligence varies along different dimensions. Easy example: some people are good at math but bad with languages, for others it is the other way around. How many distinct directions are there? I don’t think it is a stretch to think there are at least 11 different ones. Turning our proof into a computation, we can now count the expected number of Pareto optimal points for given m,n.

def initialize(i):
     tab = []
     for k in range(1,i+1):
     return tab
def iterate(tab):
     s = 0.
     for i in range(len(tab)):
         s = s + tab[i][1]
         tab[i] = (tab[i][0],s/tab[i][0])
     return tab
def final(tab):
     out = 0.
     for (x,y) in tab:
         out = out + y
     return out
def count(m, n):
     if n == 1:
         return 1
     table = initialize(m)
     while n > 2:
         table = iterate(table)
         n = n-1
     return final(table)

In our intelligence interpretation, this counts how many people can be said to be “the smartest”, in the sense that nobody else is smarter in all 11 different ways. That is, we can call someone “the smartest” if their location in intelligence space is Pareto-optimal among all people.

The above code has the maximum size of m limited by memory space, but we can calculate count(4882495,11) = 427462.8, where 4882495 is the population count of Ireland. Making the appropriate division, we conclude that one in every 11.4 Irish people is the smartest person in Ireland.

Discrete optimization

129 words

Professor: Welcome back to Algorithms class. Today, we will cover Dijkstra’s shortest path algorithm, the core of all navigation software. Next week’s assignment is to write code to compute shortest paths in a number of real world instances.

Professor: Suppose we have a road network described by a graph with n vertices and m edges and a vector c \in \mathbb{R}^E assigning a travel time to every edge. We want to minimize total travel time, and –

Student: I have a question. What if we want to keep other things into account, like finding a simple route with few turns?

Professor: The formalism allows for that by changing the objective and graph. Anyway, so we want to minimize travel time and the first step is to [blah blah blah]

Operations Research

98 words

Professor: Welcome back to Operations Research class. Last week we discussed the mathematical fundamentals of optimization theory, and this week we will see how to apply it to practice. Suppose we have n machines and m employees working on k different products. We want to maximize total profit, and –

Student: I have a question. What if we want to keep other things into account, like work satisfaction among employees?

Professor: The formalism allows for that by changing the objective and constraints. Anyway, so we want to maximize profit and the first step is to [blah blah blah]

Case study in the politics of mathematics: optimality of Nash equilibria

580 words

Back in the Soviet era, Leonid Vitaliyevich Kantorovich got himself in trouble. He developed a theory of optimization problems, originally for the sake of optimizing plywood production, but he saw a potential for using his techniques to optimize the entire Soviet economy. So he sent letters to the central planning bureau to convince them to make use of his ideas.

He has been involved in advanced mathematical research since the age of 15; in 1939 he invented linear programming, one of the most significant contributions to economic management in the twentieth century. Kantorovich has spent most of his adult life battling to win acceptance for his revolutionary concept from Soviet academic and economic bureaucracies; the value of linear programming to Soviet economic practices was not really recognized by his country’s authorities until 1965, when Kantorovich was awarded a Lenin prize for his work.

Excerpt from original CIA file on Kantorovich.

There was just one problem, arising from the theory of linear programming duality. For any linear optimization problem, we can derive a dual problem. If you solve both the primal and dual problems, they turn out to have the same solution value, and moreover the optimal solution to the one certifies optimality of the solution to the other. These “dual solutions” can have clear interpretations. In the case of optimizing resource allocations, the dual solution can be interpreted as market prices.

LP duality theory connects the notion of optimal resource allocation as used for central planning with the efficient market hypothesis and Nash equilibria. This connection can be interpreted as “capitalist markets find the optimal allocation of goods and services”. Obviously, the communists did not like that.

This interpretation has been popular in the US as well. They’d see this interpretation in light of the slightly weaker “first welfare theorem” and Smith’s invisible hand.

Of course there are numerous solid arguments for why these interpretations are bogus. To name just a few: markets are not convex, barriers to market entry are non-zero, humans are not perfectly rational nor omniscient, computation and communication are neither free nor instantaneous, negative externalities are common, the dual LP only takes prices into account and not net cash flow of individuals, and the “welfare function” doesn’t necessarily exist.

All of this was known 50+ years ago, but apart from the 1950 cutesy example of the prisoner’s dilemma, game theorists didn’t do much of note to dispute the notion that Nash equilibria are typically good. Nothing really got formalized until 20 or so years go with the introduction of the price of anarchy. The price of anarchy (stability) of a game is the ratio of social welfare in the optimal solution versus the social welfare in the worst (best) Nash equilibrium. The PoA and PoS can be unboundedly bad, and recent decades have seen a lot of exciting work happening here.

Mathematics is political. When we want to apply mathematical theorems to the real world, we need to make certain simplifying assumptions. Or actually, if we want to think any thoughts about the real world, but whatever. The simplifications we make determine the theorems we can prove. We can work on theorems that “prove” that capitalism is good or we can work on theorems that “prove” that capitalism is bad. Both would be valid mathematics, but both individuals and communities can decide to value one over the other, resulting in undergrads being taught either “10/10 scientists agree: capitalism is good” or “beware: capitalism can be arbitrarily bad”.

What does it mean to call something a social construct?

430 words

Many explanations of the term “social construct” are severely lacking. Let me explain what I personally mean when I call something a social construct. Largely, it means that we could have done it differently.

Gender is a social construct. Our ideas of which things are for men and which are for women has differed over place and time. Little boys used to wear dresses, women used to be considered mentally unfit to own property, gender nonconforming people (like women wearing pants) used to be shunned by their community. And that is just Western countries over the past century, without considering different millennia or cultures. We could change our current notions of gender without the gods coming down to interfere with our ambitions.

Money is a social construct. By choosing to redistribute or not, we shape the role of money and the way it influences the rich and poor. Same for (dis)allowing usury, theft or private ownership of the means of production, to name a few socially constructed aspects of private property and thus of money. Every facet of money is constructed through a social process, but the resulting structure can impact a person’s life as much as the law of gravity.

Mathematics is a social construct.1If you’re a platonist, read this as “our practice and understanding of mathematics.” I refer to the mathematics we know, not the mathematics that can be. There is no linear “tech tree” forcing people to develop Euclidean geometry before graph theory or predator/prey models. No law of nature tells pure mathematicians to abhor the thought of their work finding practical use. Instead of sorting computational problems into various Turing degrees,2Computable problems have the lowest Turing degree. The halting problem has a higher Turing degree. There are infinitely many other Turing degrees, some of which are higher or lower than the halting problem and some which are incomparable we could have been doing something useful instead. Peer reviewers use their own judgement to decide whether a result is novel, interesting and proven with enough rigor, no objective measurement devices involved. The choice of what is Real Mathematics and what is not is similarly arbitrary.

I tell people that a thing is socially constructed when I want to remind them that they could be different. When they’re talking about the thing as if it is a fixture of life and I want to invite them to imagine how it could be changed. Because no matter your philosophy of mathematics or external reality, the way we understand and value various ideas is always different among individuals and communities.