Is a PhD right for me?

575 words

When I was almost finished with my studies, I thought for a long time about whether I should do a PhD. On the one hand, I was in actual love with my master’s research topic and I had the opportunity to continue right where I left off. On the other hand, PhD students lead a stressful life of imposter syndrome and are paid a pittance.

I spent quite some time searching for articles and videos with titles like “why I quit my phd” to see if any of their feelings and thoughts sounded like the sort of things I might think. That was a useful activity which I would recommend to any prospective PhD student.

Right now, I am 2 years into my PhD.1I have heard multiple colleagues say that this is the worst part. Long enough in to think you should have visible results by now, but not long enough to actually have them. The title question remains. Here is a list of things that would have allowed me to make a more informed decision.

  • Imposter syndrome starts the day you’re hired and won’t stop for many years to come.
  • It is normal to fantasize about quitting every so often.
  • You will at some point realize how much more money you would make in industry. That is how much you’re effectively paying to do a PhD.
  • You will like your job less than you expected beforehand.
  • It is probably better to drown your sorrows in online procrastination than in substance abuse.
  • Free weekends will be rare, or at least you will feel guilty about them. Other people will always seem to be working more than you.2This is partly sampling bias. You do notice when your advisor sends you an email at 11pm on a Sunday, but you don’t notice when she doesn’t.

    There are good reasons for working outside office hours. Using your brain for 8 hours straight is hardly possible. Working less during the day and then a bit in the evening or on weekends is nicer.

    Also, publish-or-perish is a terrible incentive structure, this does play a part.

  • Depending on how good you are at making friends, your first conferences and workshops will be sad and lonely and exhausting.
  • Your sense of self-worth will become tied to your sense of how your research is going.
  • When your proofs or experiments are not working, you will feel frustrated and miserable for days/weeks/months on end.
  • When your theories and data do start working, the world will shine and your heart feels made of clouds and it will be blissful. For an hour or so, until you find the irrecoverable bug in your proof.
  • Academia is a scam. The fraction of your time reserved for research is a strictly decreasing function of time. Research directions are chosen because they are uncrowded and occasionally interesting, not because the outcomes would have any relevance to the world at large. Every single postdoc is sad and lonely because they have to move long distances every year or so, preventing them from having any friends.

That said, getting to do a PhD is undoubtedly a privilege. While the whole world is burning, trillions of non-human animals are brutally slaughtered every year for needless human consumption, and 815 million humans are too poor to afford enough food, you get play a part in the Unidirectional March Of Human Progress™ by spending four years in your safe, rich bubble thinking about a useless problem that Erdős once mentioned. Be sure to enjoy it.

Beyond Condorcet winners: Majority Judgement

631 words

There have been rumours on Wikipedia that Michael Balinksi passed away this Februari 4th. If this is indeed the case, then my heartfelt sympathies go out to those around him. I never knew him personally, but he was an amazing scientist and I’ve used results of his more than once. He was quite a phenomenal speaker.

Today I want to talk about a single-winner voting system that Balinksi introduced with Rida Laraki. It is called majority judgement and it is so brilliant that it almost makes me wonder what voting theorists could have been doing both before and since then.

One big concept in social choice theory is majority rule: if most people think that A is better than B, then A should be the winner. Most multi-candidate voting systems generalize this in various ways, always preserving that if candidate A would beat every other candidate B, C, etc, in a pairwise competition, then A should win the election. If candidate A satisfies this criterium, we call A a Condorcet winner. The leading wisdom in social choice theory was that any good voting rule should let the Condorcet winner win (if it exists).

According to my informal sample from the Effective Altruism community, EA’s favourite voting system seems to be approval voting, which is one of these voting systems that generalizes majority rule to multiple candidates.

The genius of majority judgement is that it moves past the Condorcet winner paradigm and considers a perspective beyond that.

To illustrate, let’s assume we’re running an election among two candidates, the red candidate and the blue candidate, and every voter gets a ballot with for each candidate the options “Amazing”, “Good”, “Mediocre” and “Terrible” to express how good of a president they think the candidate would be. Let us for simplicity assume that our voting population consists of 5 groups, the A’s, B’s, C’s, D’s and E’s, and everyone in a group votes the exact same way. The outcome of the election is in the table below.

ABC
% of population402040
Red candidateAmazingMediocreTerrible
Blue candidateGoodTerribleAmazing

The red candidate wins the Condorcet-style election: 60% of the population prefers the red candidate over the blue candidate. But the blue candidate is clearly better: 80% of the population considers the blue candidate to be “Good” or better, while 60% of the population considers the red candidate to be “Mediocre” or worse.

Majority judgement is a voting rule designed to have the blue candidate win in the above election. The actual algorithm is a bit involved, but the first step is to compare the median vote: if at least 50% of the voters think the blue candidate is “Good” or better and at least 50% of the voters think that the red candidate is “Mediocre” or worse, than the blue candidate will win. If the median opinion is a tie, a more complicated tie-breaking rule is entered. ((The exact rule satisfies a number of optimality criteria and is the only rule to do so. For this post I want to skip the details.))

I think the concept is very elegant, and I believe that the outcomes really would be better than with a system that elects Condorcet winners. In a talk that Balinksi gave, which I was lucky enough to attend, he pointed out another advantage of the majority judgement rule: it allows voters to express what they think of the candidates. You wouldn’t be asking anyone to “vote for the lesser evil”, everyone can keep their conscience clear. Majority judgement admits a clear way of expressing frustration with both candidates: rank both of them very badly. It also helps that the different options are given in words instead of ranking by numbers, for the latter turns out to entice voters to rate their favourite candidate 10/10 and all others 0/10.

Funding opportunity in academic publishing

472 words

I’ve recently listened to a talk by Jean-Sébastien Caux, the founder, implementer and chairman of SciPost. It looks like a charity that might be worth funding.

Academic publishing is a complete disaster. The incentive structure is messed up. Papers must have more than a certain minimum of content*coolness, but simultaneously they shouldn’t be too long. This results in researchers sticking two unrelated papers together and calling it one thing, and cutting single papers up into multiple pieces to get published. If your result is too simple then it will get rejected so people make their papers more difficult on purpose. There is no pipeline for communicating minor improvements and comments on other people’s papers to the outside world.

Peer review might not literally be a farce but it is closer to being so than anyone is really comfortable with. Because it all happens behind closed doors, peer reviews seldom have constructive feedback in them and reviewers will most likely harm their own careers if they spend time and effort into reviewing that could be spent doing research. People submit everything to the best journals first, trying one rung lower when their paper gets rejected. The review process can take years. Reviews are all hidden from the wider public, so the only way to judge a paper’s quality if you’re not in the field is by looking at citation counts and the journal a paper appeared in.

Publishers make a lot of profit selling the research community’s own results back to them. Journals and impact factors are silly inventions that date back to the dark ages before internet existed and serve little to no use in modern times.

Enter SciPost. Imagine a love child of the free software movement, open access academic publishing, and modern discussion platform design. Submission is free. A manuscript is public from the moment one of the editors decides it is probably worth getting reviewed, with all the fancy DOI’s and what-not that you could ask for. Both the content and the platform itself are licenced under free licences. Reviews are public, either with the authors name or anonymously, which turns out to greatly improve review quality. Reviews are citable objects with DOI’s and everything. A number of reviews get invited, but anyone can submit a review if they’d like. People can post comments on reviews. Their funding comes entirely from sponsors. Their average costs per publication are under $400, way less than the article processing fees of most open access journals. They keep themselves to principles of openness way beyond the Fair Open Access Principles.

Right now SciPost publishes papers in physics. They want to expand to other disciplines, but money is the major bottleneck. Over the past 3 years they’ve gotten around $250k in total funding, so the marginal gains from additional funds should be pretty good.

Gaussian tail bounds

812 words

One-dimensional tail bounds

The standard normal distribution N(0,1) has probability density function \frac{1}{\sqrt{2\pi}}e^{-x^2/2}. There is no way to integrate this function symbolically in a nice way, but we do at times want to (upper) bound expressions of the form \frac{1}{\sqrt{2\pi}}\int_x^\infty e^{-t^2/2} \mathrm{d}t. How can we do this?

One way is to follow this approach. Since t\geq x everywhere, we can upper bound \frac{1}{\sqrt{2\pi}}\int_x^\infty e^{-t^2/2} \mathrm{d}t \leq \frac{1}{\sqrt{2\pi}}\int_x^\infty \frac{t}{x} e^{-t^2/2} \mathrm{d}t = \frac{1}{x\sqrt{2\pi}}e^{-x^2/2}.

There is another tail bound which is a bit weaker for large x, but I like the proof better. We’ll give a tail bound by looking at the moment-generating function \lambda \mapsto \mathbb{E}[e^{\lambda X}], where X \sim N(0,1) is our normally distributed random variable. We can explicitly calculate this expectation and find \mathbb{E}[e^{\lambda X}] = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^\infty e^{\lambda x - x^2/2}\mathrm{d}x = \frac{1}{\sqrt{2\pi}}e^{\lambda^2/2}\int_{-\infty}^\infty e^{-(x-\lambda)^2/2}\mathrm{d}x. The last term is just the entire Gaussian integral shifted a bit and hence \mathbb{E}[e^{\lambda X}] = e^{\lambda^2/2} Now we use Chernoff’s bound (an easy corrollary of Markov’s inequality) to find \mathbb{P}[X \geq t] \leq \mathbb{E}[e^{\lambda X}]e^{-\lambda t}, which we can now minimize over the choice of \lambda, setting \lambda=t, and we conclude that \mathbb{P}[X \geq t] \leq e^{-t^2/2}.

Multi-variate Gaussians

Let X \in \mathbb{R}^d be N(0,I_d) normally distributed, i.e., X is a vector with iid Gaussian N(0,1) entries. What tail bounds do we get on \|X\|? We start off with Markov’s inequality again. \mathbb{P}[\|X\| > t] = \mathbb{P}[e^{\lambda\|X\|^2} > e^{\lambda t^2}] \leq \frac{\mathbb{E}[e^{\lambda\|X\|^2}]}{e^{\lambda t^2}}.

Deriving the moment generating function \lambda \mapsto \mathbb{E}[e^{\lambda\|X\|^2}] of X^2 is an elementary calculation. \int_{-\infty}^\infty e^{\lambda x^2} \cdot e^{-x^2/2} \mathrm{d}x = \int_{-\infty}^\infty e^{\frac{-x^2}{2(\sqrt{1-2/\lambda})^2}}\mathrm{d}x = \frac{\sqrt{2\pi}}{\sqrt{1-2\lambda}}.

The coordinates of X are iid, so \mathbb{E}[e^{\lambda\|X\|^2}] = \mathbb{E}[e^{\lambda X_1^2}]^d = (1-2\lambda)^{-d/2}. The minimizer is at \lambda=(1-1/t^2)/2, and we find, requiring t \geq 1 for the last inequality,\mathbb{P}[\|X\| > t] \leq e^{-d(t^2-2\log t - 1)/2} \leq e^{-d(t-1)^2}.

Operator norm of Gaussian matrices

The operator norm or spectral norm of a n \times n matrix M is defined as \|M\| := \max_{x \in \mathbb{R}^n} \frac{\|Mx\|}{\|x\|}.

Now if M were a matrix with every entry independently N(0,1), what would the largest singular value of this random Gaussian matrix be? I’ll give an easy tail bound based on a net argument.

An \eta-net, \eta > 0, on the sphere is a subset N \subset \mathbb{S}^{d-1} such that for every point x \in \mathbb{S}^{d-1} there is a net element n \in N such that \|x-n\| \leq \eta, but every two net elements are at distance at least \eta from each other. A greedy algorithm can construct an \eta-net, and any \eta-net has size at most (4/\eta)^d. 3See e.g., Jiří Matoušek, Lectures on Discrete Geometry (Springer, 2002), page 314. The proof is based on a simple packing argument where balls of radius \eta/2 around each net element have to fit disjointly inside the ball of radius 1+\eta/2 \leq 1 centered at the origin.

Now let N\subset \mathbb{S}^{d-1} be a 1/2-net. By the above, the size of the net is bounded by |N| \leq 8^d.

The function x \mapsto \|Mx\| is \|M\|-Lipschitz. Hence we can bound \|M\| \leq \max_{x\in\mathbb{S}^{d-1}} \min_{\omega \in N} \|M\omega\| + \|M\|\cdot\|x-\omega\| \leq \max_{x\in\mathbb{S}^{d-1}} \min_{\omega \in N} \|M\omega\| + \|M\|/2. So we have now proved that \|M\| \leq 2\max_{\omega\in N} \|M\omega\|.

Now, as M\omega is N(0,I_d) normally distributed for any \omega\in\mathbb{S}^{d-1}, we can use the union bound over all points of N and conclude that, for all t \geq 1, \mathbb{P}[\|M\| \geq 2t\sqrt{d}] \leq 8^d e^{-d(t-1)^2/2}.

Maximum of n Gaussians

The distribution of the maximum \mathbb{P}[\max_{i \leq n} X_i \geq t] of n independent identically distributed variables X_1,\ldots,X_n \sim N(0,1) is, up to a constant factor, tight with the union bound \mathbb{P}[\max_{i \leq n} X_i \geq t] \leq ne^{-t^2/2}.

From this, we can find that the expected maximum is \mathbb{E}[\max_{i \leq n} X_i] = O(\sqrt{\ln n}). We derive this by bounding the integral \mathbb{E}[\max_{i \leq n} X_i] = \int_0^\infty \mathbb{P}[\max_{i \leq n} X_i \geq t] {\rm d}t. Split the integral at t = \sqrt{\ln n}, bound the integrand in the first part by 1 (it is a probability) and in the second part by ne^{-t^2/2}.

Average width of the simplex

Let x_1,\dots,x_{d+1} \in \mathbb{R}^d be the vertices of a regular simplex such that \|x_i\| = 1 for all i \in [d+1]. If \omega \in \mathbb{S}^{d-1} is chosen uniformly at random, the expectation of the difference \max_{i,j\in[d+1]} |\omega^{\mathsf{T}}(x_i-x_j)| is called the average width of the simplex. We can bound this up to a constant factor using our knowledge of Gaussians. Let H_t := \{y\in\mathbb{R}^d : \omega^{\mathsf{T}}y = t\}. The d-2-dimensional volume of H_t\cap \mathbb{S}^{d-1} is (1-t^2)^{(d-1)/2} times the volume of \mathbb{S}^{d-2} by Pythatoras’ theorem. Recalling that (1+1/\lambda)^\lambda \approx e, you can prove that the distribution of \omega^{\mathsf{T}}x_i is approximately N(0,1/\sqrt{d-1}). The Gaussian tail bound now says that the average width of the simplex is O(\frac{\sqrt{\ln d}}{\sqrt d}).

Typed languages, units.

511 words

I recently picked up programming again. I used to do it a lot before I went to university, but the constant mind-numbing programming assignments quickly put me off of programming. Apart from the occasional quick bug fix for software I use myself, I haven’t done any serious coding for years.

Until recently, when I needed something coded up during my research. I decided to learn Python, and I like it. It is easy to use, the libraries are extensive and user-friendly, and ipython is a useful tool. There is just one thing that draws my ire: the weak type system. Studying math has given me an appreciation for type checking that is even stricter than most languages.

An example: my length in centimeters plus the outside temperature in °C right now equals 180. This calculation makes no sense, because the units don’t match: you can’t add centimeters to degrees Celcius. But then there’s Python, which just lets you do that.

In [1]: length = 170

In [
2]: temperature = 10

In [
3]: length + temperature
Out[
3]: 180

Most bugs that stem from typos are of this sort. Those bugs are possible because the type system is too weak. If you have two loops, one iterating over i and one over j, basic unit-based type checking would probably flag any instance of i in a place where you should have typed j instead. If you intend to query A[i][j] then it should be possible to let i have row-index type and j have type-index type, making A[j][i] raise a type error.

Another example: Let A \in \mathbb{R}^{n \times n}, x \in \mathbb{R}^n, and we’re interested in the quantity Ax \in \mathbb{R}^n. If you’re like me and you can’t remember what rows and what columns are, then that doesn’t have to impact your ability to symbolically do linear algebra: the quantities xA = A^{\mathsf{T}}x, Ax^{\mathsf{T}} and A^{-1} x don’t “compile”, so any mathematician that reads it will know you typo-ed if you wrote one of those latter expressions. All operations might be matrices acting on vectors, but the matrices A^{-1} and A^{\mathsf{T}} fundamentally take input from different copies of \mathbb{R}^n than the ones that x and x^{\mathsf{T}} live in. That is why matrix operations make sense even if the matrices aren’t square or symmetric: there is only one way to make sense of any operation. Even if you write it wrong in a proof, most people can see what the typo is. But then there’s Python.

In [4]: import numpy as np 

In [5]: x = np.array([1,2])

In [6]: A = np.array([[3,4],[5,6]])

In [7]: np.dot(A,x)
Out[7]: array([11, 17])

In [8]: np.dot(A,np.transpose(x))
Out[8]: array([11, 17])

In [9]: np.dot(x,A)
Out[9]: array([13, 16])

I am like me and I can’t remember what rows and what columns are. I would like the interpreter to tell me the correct way of doing my linear algebra. At least one of the above matrix-vector-products should throw a type error. Considering the history of type systems, it is not surprising that the first languages didn’t introduce unit-based types. Nonetheless, it is a complete mystery to me why modern languages don’t type this way.

Search the literature!

182 wordsEvery once in a while, I neglect to remember my favourite quote:

Six months of research can save you an afternoon in the library.

This is from a plaque in the library at work. The original quote is by American chemist Frank Westheimer. The longer I have been doing research, the more I am seeing just how true this statement is.

How to use this insight in day-to-day life:

  1. Realise that an afternoon is both really short compared to six months, and really long compared to how long you typically browse the web looking for an answer.
  2. Be aware that, in an everyday question, you don’t know the right words to query the search engine. After your first couple tries in web search and scholarly search (the results of Google and of Google Scholar seem almost disjoint so do try both), try to acquire the right keywords through figuring out what field your question is in, or to generalize your question.
  3. Google Scholar has interesting results even on page 10. Results are not ordered in relevance as well as web search does.