HN Domain Leaderboard

How is the score calculated?

Let D be a domain. Consider the set of HN submissions for D over a time period T. Let m be the mean number of points received by these submissions, and p the 75th percentile of the number of points received by these submissions. Let n be the total number of submissions in this set. Then the score for D over the time period T is

Score = (m + 2p) log(n)

The aim of the leaderboard is to surface domains that produce content which the HN community consistently loves. This means that there have to be a decent number of submissions for high ranking domains, and a decent proportion of these submissions have to receive a lot of points. This leaderboard only shows domains with at least 25 submissions over the time period.

The "quality" of submissions for a domain is given by (m + 2p). Given that getting to the HN front page requires a decent amount of luck, links are often submitted a few times (by different users or the same user) before they end up getting HN traction, so most submissions end up with scores close to 0, and a few end up with very high scores. Having the 75th percentile of the points received by submissions accounts for this.

The "quantity" of submissions for a domain is taken into account by log(n). The (natural) log accounts for the fact that some domains produce a much greater quantity of content than others just by virtue of what they represent. The New York Times, for example, churns out orders of magnitude more content than a very high quality blog run by an individual. The score rewards domains that get more submissions, but the log introduces "diminishing returns" so that bigger sites don't completely dominate smaller ones just by having more content.

The score is somewhat subjective and there is a somewhat arbitrary choice of constants in the formula (e.g. 2, log base e). There are many different formulae that account for the considerations I've tried to make, and the formula I've used is just one of them. However, trying different similar formulae ended up with very similar results each time, so I think it's fairly robust to small changes.

Where is the data from?

Google have kindly made all Hacker News data available in BigQuery, free for public use to analyse up to a terabyte of data:

Who made this?