Funding opportunity in academic publishing

472 words

I’ve recently listened to a talk by Jean-Sébastien Caux, the founder, implementer and chairman of SciPost. It looks like a charity that might be worth funding.

Academic publishing is a complete disaster. The incentive structure is messed up. Papers must have more than a certain minimum of content*coolness, but simultaneously they shouldn’t be too long. This results in researchers sticking two unrelated papers together and calling it one thing, and cutting single papers up into multiple pieces to get published. If your result is too simple then it will get rejected so people make their papers more difficult on purpose. There is no pipeline for communicating minor improvements and comments on other people’s papers to the outside world.

Peer review might not literally be a farce but it is closer to being so than anyone is really comfortable with. Because it all happens behind closed doors, peer reviews seldom have constructive feedback in them and reviewers will most likely harm their own careers if they spend time and effort into reviewing that could be spent doing research. People submit everything to the best journals first, trying one rung lower when their paper gets rejected. The review process can take years. Reviews are all hidden from the wider public, so the only way to judge a paper’s quality if you’re not in the field is by looking at citation counts and the journal a paper appeared in.

Publishers make a lot of profit selling the research community’s own results back to them. Journals and impact factors are silly inventions that date back to the dark ages before internet existed and serve little to no use in modern times.

Enter SciPost. Imagine a love child of the free software movement, open access academic publishing, and modern discussion platform design. Submission is free. A manuscript is public from the moment one of the editors decides it is probably worth getting reviewed, with all the fancy DOI’s and what-not that you could ask for. Both the content and the platform itself are licenced under free licences. Reviews are public, either with the authors name or anonymously, which turns out to greatly improve review quality. Reviews are citable objects with DOI’s and everything. A number of reviews get invited, but anyone can submit a review if they’d like. People can post comments on reviews. Their funding comes entirely from sponsors. Their average costs per publication are under $400, way less than the article processing fees of most open access journals. They keep themselves to principles of openness way beyond the Fair Open Access Principles.

Right now SciPost publishes papers in physics. They want to expand to other disciplines, but money is the major bottleneck. Over the past 3 years they’ve gotten around $250k in total funding, so the marginal gains from additional funds should be pretty good.

Gaussian tail bounds

787 words

This is all basic stuff, but at the moment it is too hard find through Google and sometimes I just want to know the answer without having to derive it on the spot. Hence, this post. I hope I’m not ruining anyone’s course’s hand-in exercises.

The standard normal distribution N(0,1) has probability density function \frac{1}{\sqrt{2\pi}}e^{-x^2/2}. There is no way to integrate this function symbolically in a nice way, but we do at times want to (upper) bound expressions of the form \frac{1}{\sqrt{2\pi}}\int_x^\infty e^{-t^2/2} \mathrm{d}t. How can we do this?

One way is to follow this approach. Since t\geq x everywhere, we can upper bound \frac{1}{\sqrt{2\pi}}\int_x^\infty e^{-t^2/2} \mathrm{d}t \leq \frac{1}{\sqrt{2\pi}}\int_x^\infty \frac{t}{x} e^{-t^2/2} \mathrm{d}t = \frac{1}{x\sqrt{2\pi}}e^{-x^2/2}.

There is another tail bound which is a bit weaker for large x, but I like the proof better. We’ll give a tail bound by looking at the moment-generating function \lambda \mapsto \mathbb{E}[e^{\lambda X}], where X \sim N(0,1) is our normally distributed random variable. We can explicitly calculate this expectation and find \mathbb{E}[e^{\lambda X}] = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^\infty e^{\lambda x - x^2/2}\mathrm{d}x = \frac{1}{\sqrt{2\pi}}e^{\lambda^2/2}\int_{-\infty}^\infty e^{-(x-\lambda)^2/2}\mathrm{d}x. The last thing is just the entire Gaussian integral shifted a bit and hence \mathbb{E}[e^{\lambda X}] = e^{\lambda^2/2} Now we use Chernoff’s bound (an easy corrollary of Markov’s inequality) to find \mathbb{P}[X \geq t] \leq \mathbb{E}[e^{\lambda X}]e^{-\lambda t}, which we can now minimize over the choice of \lambda, setting \lambda=t, and we conclude that \mathbb{P}[X \geq t] \leq e^{-t^2/2}.

Multi-variate Gaussians

Now X \in \mathbb{R}^d is N(0,I_d) normally distributed, i.e., X is a vector with iid Gaussian N(0,1) entries. What tail bounds do we get on \|X\|? We start off with Markov’s inequality again. \mathbb{P}[\|X\| > t] = \mathbb{P}[e^{\lambda\|X\|^2} > e^{\lambda t^2}] \leq \frac{\mathbb{E}[e^{\lambda\|X\|^2}]}{e^{\lambda t^2}}.

Deriving the moment generating function \lambda \mapsto \mathbb{E}[e^{\lambda\|X\|^2}] of X^2 is an elementary calculation. \int_{-\infty}^\infty e^{\lambda x^2} \cdot e^{-x^2/2} \mathrm{d}x = \int_{-\infty}^\infty e^{\frac{-x^2}{2(\sqrt{1-2/\lambda})^2}}\mathrm{d}x = \frac{\sqrt{2\pi}}{\sqrt{1-2\lambda}}.

The coordinates of X are iid, so \mathbb{E}[e^{\lambda\|X\|^2}] = \mathbb{E}[e^{\lambda X_1^2}]^d = (1-2\lambda)^{-d/2}. The minimizer is at \lambda=(1-1/t^2)/2, and we find, requiring t \geq 1 for the last inequality,\mathbb{P}[\|X\| > t] \leq e^{-d(t^2-2\log t - 1)/2} \leq e^{-d(t-1)^2}.

Operator norm of Gaussian matrices

The operator norm or spectral norm of a n \times n matrix M is defined as \|M\| := \max_{x \in \mathbb{R}^n} \frac{\|Mx\|}{\|x\|}.

Now if M were a matrix with every entry independently N(0,1), what would the largest singular value of this random Gaussian matrix be? I’ll give an easy tail bound based on a net argument.

An \eta-net, \eta > 0, on the sphere is a subset N \subset \mathbb{S}^{d-1} such that, for every point x \in \mathbb{S}^{d-1} there is a net element n \in N such that \|x-n\| \leq \eta but every two net elements are at distance at least \eta from each other. A greedy algorithm can construct an \eta-net, and any \eta-net has size at most (4/\eta)^d.1

Now let N\subset \mathbb{S}^{d-1} be a 1/2-net. By the above, the size of the net is bounded by |N| \leq 8^d.

The function x \mapsto \|Mx\| is \|M\|-Lipschitz. Hence we can bound \|M\| \leq \max_{x\in\mathbb{S}^{d-1}} \min_{\omega \in N} \|M\omega\| + \|M\|\cdot\|x-\omega\| \leq \max_{x\in\mathbb{S}^{d-1}} \min_{\omega \in N} \|M\omega\| + \|M\|/2. So we have now proved that \|M\| \leq 2\max_{\omega\in N} \|M\omega\|.

Now, as M\omega is N(0,I_d) normally distributed for any \omega\in\mathbb{S}^{d-1}, we can use the union bound over all points of N and conclude that, for all t \geq 1, \mathbb{P}[\|M\| \geq 2t\sqrt{d}] \leq 8^d e^{-d(t-1)^2/2}.

Maximum of n Gaussians

The distribution of the maximum \mathbb{P}[\max_{i \leq n} X_i \geq t] of n independent identically distributed variables X_1,\ldots,X_n \sim N(0,1) is, up to a constant factor, tight with the union bound \mathbb{P}[\max_{i \leq n} X_i \geq t] \leq ne^{-t^2/2}.

Hence the expected maximum is \mathbb{E}[\max_{i \leq n} X_i] = O(\sqrt{\ln n}).

Average width of the simplex

Let x_1,\dots,x_{d+1} \in \mathbb{R}^d be the vertices of a regular simplex such that \|x_i\| = 1 for all i \in [d+1]. If \omega \in \mathbb{S}^{d-1} is chosen uniformly at random, the difference \max_{i,j\in[d+1]} |\omega^{\mathsf{T}}(x_i-x_j| is called the average width of the simplex. We can bound this up to a constant factor using our knowledge of Gaussians. Let H_t := \{y\in\mathbb{R}^d : \omega^{\mathsf{T}}y = t\}. The d-2-dimensional volume of H_t\cap \mathbb{S}^{d-1} is (1-t^2)^{(d-1)/2} times the volume of \mathbb{S}^{d-2} by Pythatoras’ theorem. Recalling that (1+1/\lambda)^\lambda \approx e, you can prove that the distribution of \omega^{\mathsf{T}}x_i is approximately N(0,1/\sqrt{d-1}). The Gaussian tail bound now says that the average width of the simplex is O(\frac{\sqrt{\ln d}}{\sqrt d}).

  1. See e.g., Jiří Matoušek, Lectures on Discrete Geometry (Springer, 2002), page 314. The proof is based on a simple packing argument where balls of radius \eta/2 around each net element have to fit disjointly inside the ball of radius 1+\eta/2 \leq 1 centered at the origin. []

Typed languages, units.

511 words

I recently picked up programming again. I used to do it a lot before I went to university, but the constant mind-numbing programming assignments quickly put me off of programming. Apart from the occasional quick bug fix for software I use myself, I haven’t done any serious coding for years.

Until recently, when I needed something coded up during my research. I decided to learn Python, and I like it. It is easy to use, the libraries are extensive and user-friendly, and ipython is a useful tool. There is just one thing that draws my ire: the weak type system. Studying math has given me an appreciation for type checking that is even stricter than most languages.

An example: my length in centimeters plus the outside temperature in °C right now equals 180. This calculation makes no sense, because the units don’t match: you can’t add centimeters to degrees Celcius. But then there’s Python, which just lets you do that.

In [1]: length = 170

In [
2]: temperature = 10

In [
3]: length + temperature
Out[
3]: 180

Most bugs that stem from typos are of this sort. Those bugs are possible because the type system is too weak. If you have two loops, one iterating over i and one over j, basic unit-based type checking would probably flag any instance of i in a place where you should have typed j instead. If you intend to query A[i][j] then it should be possible to let i have row-index type and j have type-index type, making A[j][i] raise a type error.

Another example: Let A \in \mathbb{R}^{n \times n}, x \in \mathbb{R}^n, and we’re interested in the quantity Ax \in \mathbb{R}^n. If you’re like me and you can’t remember what rows and what columns are, then that doesn’t have to impact your ability to symbolically do linear algebra: the quantities xA = A^{\mathsf{T}}x, Ax^{\mathsf{T}} and A^{-1} x don’t “compile”, so any mathematician that reads it will know you typo-ed if you wrote one of those latter expressions. All operations might be matrices acting on vectors, but the matrices A^{-1} and A^{\mathsf{T}} fundamentally take input from different copies of \mathbb{R}^n than the ones that x and x^{\mathsf{T}} live in. That is why matrix operations make sense even if the matrices aren’t square or symmetric: there is only one way to make sense of any operation. Even if you write it wrong in a proof, most people can see what the typo is. But then there’s Python.

In [4]: import numpy as np 

In [5]: x = np.array([1,2])

In [6]: A = np.array([[3,4],[5,6]])

In [7]: np.dot(A,x)
Out[7]: array([11, 17])

In [8]: np.dot(A,np.transpose(x))
Out[8]: array([11, 17])

In [9]: np.dot(x,A)
Out[9]: array([13, 16])

I am like me and I can’t remember what rows and what columns are. I would like the interpreter to tell me the correct way of doing my linear algebra. At least one of the above matrix-vector-products should throw a type error. Considering the history of type systems, it is not surprising that the first languages didn’t introduce unit-based types. Nonetheless, it is a complete mystery to me why modern languages don’t type this way.

Tumblr RSS under GDRP

219 words

When the GDRP became effective, Tumblr decided to break its RSS feeds for all EU residents. When you try to fetch https://username.tumblr.com/rss, they’ll serve their GDRP wall instead of your requested file. You can only grab the feed if you possess the correct cookies.

Anyway, here is a hacky fix for your RSS feeds. I’m assuming you possess an http server yourself. I use a Raspberry Pi with lighttpd and selfoss. I’m assuming your user on the server is called beth, you want to follow the user called username on Tumblr and your document root is /home/beth/public.

Create the folder /home/beth/public/rss. Create the file /home/beth/.bin/fetchfeeds.sh with the following contents. Duplicate the last line once for every user you want to follow, and adjust all three occurences of username to fit.

#!/bin/bash

curl --header 'Host: username.tumblr.com' --header 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/71.0.3578.80 Chrome/71.0.3578.80 Safari/537.36' --header 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8' --header 'Accept-Language: nl,en-GB;q=0.9,en;q=0.8' --header 'Cookie: rxx=1jo2mpfhsia.1c2ecvc5&v=1; pfg=1bc46aba34ffeb83e2ef0859447d282cf8a8a2a9f95200a2a705f3afebfe9bef%23%7B%22eu_resident%22%3A1%2C%22gdpr_is_acceptable_age%22%3A1%2C%22gdpr_consent_core%22%3A1%2C%22gdpr_consent_first_party_ads%22%3A1%2C%22gdpr_consent_third_party_ads%22%3A1%2C%22gdpr_consent_search_history%22%3A1%2C%22exp%22%3A1576795974%2C%22vc%22%3A%22%22%7D%237343812268; tmgioct=5c1acbc67c0f570418402840' --header 'Connection: keep-alive' 'https://username.tumblr.com/rss' -o '/home/beth/public/rss/username-tumblr.rss' -L

[The curl command was produced using the CurlWget browser extension.]

Don’t forget to fix the permissions: chmod +x ~/.bin/fetchfeeds.sh. Now put this in your cron table by using the command crontab -e: 0 * * * * ~/.bin/fetchfeeds.sh. This will execute the bash file in the first minute of every hour.

Lastly, put http://localhost/rss/username-tumblr.rss in your RSS reader.

Does blockchain make any sense?

612 words

Context: Butarin on non-financial applications of blockchain, 15 tweets. I’ll assume you have read it.

I am deeply convinced that blockchains have no use cases. Trust is provided in the real world by the option to sue people. Trust in most blockchains is misplaced because miner pools are too big and the group of dictators of any coin is small, not accountable and wrongly incentivised. Timestamping data works just as well without using a distributed blockchain. Immutability is provided by standard crypto. The oracle problem is a real fuckin’ serious problem that is only ever resolved through the force of law.

Every function of distributed blockchains can be provided by combining digital signatures, cryptographic commitment schemes, public key cryptography, hashing all data and either periodic or on-demand polling.

So every time Tyler Cowen of Marginal Revolution posts a link to Buterin’s Twitter feed, my confidence in other people’s assertions dies a little bit more.

I’ll take the tweets linked up top in order.

2-5. I think he means “Cryptography allows you to encrypt data, prove data was signed by someone, etc etc… blockchains OTOH allow you to prove that a piece of data was *not* published [according to protocol]”, because the original is obviously false (my daily newspaper published things but is not on any blockchain). The adapted statement is also false, because SSL is based on the very notion that certificates can be rejected and that you can check that a certificate hasn’t been rejected.

6. Not just SSL, GNuPG does this very same thing as well.

7-8.  Blockchains aren’t credibly neutral, every miner on a blockchain has money in it, skewing their incentives in certain directions. Trust only goes as far as you can pay. Cryptographic hashes and signatures can make every database trusted if you can sue its proprietor.

9-11. This might be true, but it is pretty much impossible that performance will ever reach the one of standard crypto, and Buterin conveniently manages to forget the additional cost in programmer-hours. Technological cost is human cost, of the kind which is in most limited supply.

12. This tweet illuminates a lot. Buterin lives in countries without proper online banking and moreover hates privacy of any kind.

13-14. What does this even mean. Anything can and will get hacked. Your hard drive will stop working one day, and your backup won’t work.

15. Smaller stakes \implies smaller mining rewards \implies less miners \implies easier to attack. Also applications breaking incurs a huge social cost; some people are still distrustful of others from when Google Reader (RIP) was cancelled.

I know a lot of computer scientists at my institute, including people who do research into applications of blockchains. Nobody I know believes blockchains are useful apart from black markets, tax evasion, as a novel pyramid scheme, or as a hype for getting grant money.

I get why Cowen falls for Buterin’s sweet talk, he is an economist with no knowledge of cryptography. I don’t get why not more people who know anything about cryptography are speaking out about the insanity of blockchain. My best guess is that everyone who knows stuff about it is using it to either make money themselves, pretending to be enthusiastic to further the pyramid scheme, or profiling themselves as “blockchain expert” and charging high consultancy fees.

Section of doubt

I hope I’m completely wrong and I just misunderstood every statement every blockchain fanatic has ever made. My constant rejection of all their statements does make me question myself, because it seems strange that so many people can be so consistently wrong about a thing. I’d love to show epistemic humility here, but I don’t consider it to be justified.

TIL about referer spam

359 wordsThis is cute. Suppose you are a spammer and you want to target website owners, how do you do this? Submitting spam comments is one possibility; either your spam is posted, in which case you win, or you get stuck in the spam filter, and the site admin will at some point scroll past it when checking if any proper comments were misflagged, and you’ll sort of win anyway. Here is a better strategy: access their website while spoofing your referer.

What is a referer, you ask? Let’s say you are browsing bethzero.com/2018/12/01/referer-spam, and you click a link to example.org. Your browser will, in its HTTP request to example.org, mention that you got referred to their page from bethzero.com/2018/12/01/referer-spam. This is so that website owners can know where their visitors come from. Super neat, how can we use this for spamming?

The trick is to send requests with fake referer attributes.[fn]This is the correct spelling. Except in some cases, such as rel=”noreferrer”.[/fn] Request the page https://bethzero.com, put https://bestblogideas.com as the referer. When the proprietor of bethzero.com looks at her visitor statistics, she will see that bestblogideas.com linked to her, and probably visit it herself to see what they write about her. This is where you sell her access to your cheap special $300 blogging video course.

Another fun way to use this is probably for doxxing. Create a special page tracking.com/<unique-code> and use that as referer. If you make sure that no other entry points to that page exist, then every visitor will be the Beth you’re targetting. For bonus points, you put Facebook’s tracking code on the page so that you’ll forever be able to target advertisements directly to Beth.

Beth might try to outsmart you and satisfy her interests in her referrers by pointing her browser at tracking.com without the unique code. In that case, you can try buying a number of domain names for a kind of adaptive group testing procedure. If you have a million dollars you can probably do this on bigger scales, finding the IP address of a good fraction of pseudonymous website owners. Looking at my own stats, I think this is really happening.

Delayed WordPress RSS feed

365 wordsIn the category “2-minute hacks for skilled people, 30-minute hacks for me because I can’t code for shit”, let’s replay an old RSS feed. Useful if you want to read the old posts on a blog but don’t want to do it all at once.

This way, you can relive the past of your favourite WordPress blogs. Requires a server with php installation and a feed reader that updates at least once per day. I’ve got a Raspberry Pi 3B running selfoss.

We’ll be using a fancy feature of Wordpess, namely that https://example.com/yyyy/mm/dd/feed gives the RSS feed with posts from that day interval. Actually every public-facing WordPress page can be appended with /feed to produce something meaningful. It will not be a truly faithful replay of old posts, because if posts get edited or deleted we won’t get to see the original post.

With this knowledge in mind, create a php file

$ touch public/delay.php
$ chmod 755 public/delay.php

and fill it with the following code

<?php
$url = htmlspecialchars($_GET['url']);
$years = floatval($_GET['years']);
$months = floatval($_GET['months']);
$days = floatval($_GET['days']);

// calculate the time stamp of some day in the past
// (we subtract a fixed amount of seconds so that we dont
// run into issues with how many days a month has and
// because I am not good enough at coding to handle this
// the right way.)
$months_delayed = $months + 12 * $years;
$days_delayed = $days + 30 * $months_delayed;
$moment = time() - intval(60 * 60 * 24 * $days_delayed);

// redirect to the RSS file with posts from that day
header('Location: ' . $url . '/' . date('Y/m/d', $moment) . '/feed', 303);
die();
?>

Requesting the page http://example.org/delay.php?url=https://bethzero.com/feed&months=1 will redirect you to an RSS file containing all posts made exactly 30 days ago. Because we’re using a temporary 303 redirect, the page that you get redirected to changes every day. If your feed reader updates every day, you should get to see every post.

One small issue happens if a day has more blog posts than can appear in the RSS feed (default is 10 in WordPress). In that case, you might miss out on the oldest posts of the day.

Bayes’ theorem and transgender lesbians

234 wordsSome time ago I read a cool post on Tumblr, but I can’t find it anymore. It was about calculating P(trans|WLW), the fraction of women who love women that is transgender, from P(trans), the fraction of the general population that is transgender, P(WLW), the fraction of the population that is a woman-loving woman, and P(WLW|trans), the fraction of gynephillic trans women among trans people. Bayes’ theorem says

P(trans|WLW) = P(WLW|trans)P(trans)/P(WLW).

I remember that the resulting number was significant. As I could not find it again, here is a quick and dirty reconstruction. For every statistic, I picked the first one I found that did not seem completely unrealistic.

So Bayes’ theorem gives us P(trans|WLW) = 0.15.

Bonus: suicide attempts

While preparing this post, I stumbled upon this report. Page 8 lists:

  • P(attempted suicide) = 0.016.
  • P(attempted suicice|trans) = 0.41.

Bayes now says P(trans|attempted suicide) = 0.26. Big if true.*

Section of Doubt

Applying Bayes’ theorem like this seems to give unreasonably good mileage. That suggests that social scientists aren’t allowed to use numbers from different studies and get conclusions from them, or asymmetric misreporting makes these calculations error-prone.

The last number above is big. Makes one wonder why so little effort is spent explicitly targetting at-risk trans people.

* Added July 16th: I just met a subject expert, she said this figure sounded about right.