**Figure Gallery 1: ***Citations after 5 years vs. number of unique word roots. 7 sub-arxivs illustrated separately. Black crosses: equal-population grouped medians with 1-sigma sampling uncertainty illustrated by vertical lines. Blue line: best-fit "linear median" model with constant slope across all sub-arxivs. Cyan region: 1-sigma credible region of "linear median" model. *

The figures above show the citations after five years vs. the number of unique word roots in each paper abstract. I split the papers into their primary sub-arxiv for these figures. Overlaid on the points are 5 grouped medians (black crosses), each representing equal populations sorted on number of unique word roots. The horizontal bars illustrate the range covered by each population, and the vertical error bars illustrate the 1-sigma uncertainty on the medians determined by the bootstrap resampling described earlier.

Qualitatively, it is clear that for all categories, the median citation rate climbs as the number of unique words in the abstract increases. In order to better quantify this, I adopted a linear model for the expected median citation rate as a function of number of unique word roots. Below where this model intersects zero, the model citation rate is set to zero.

Because some sub-arxivs are poorly represented in my sample, I decided to fix the slope of this model across *all* sub-arxivs, while allowing the constant factor to vary for each sub-arxiv. This effectively implies an assumption that while there may be fewer citations to a paper in a given sub-arxiv, the addition of more unique word roots (or more information) is equally valued across all sub-arxivs.

Since there are an infinite number of lines that bisect a given 2D distribution exactly in half, the definition of a formal "linear median trendline" is subtle. Because both quantities under consideration here are discretized, it is possible to construct a relatively simple and formally meaningful median trendline definition that does not rely on any additional binning of the data. I hope to describe this median trendline definition in a future post. Note that the grouped medians illustrated by the black crosses in the figures above do not have any influence on the model, and yet they generally agree very well with the model (except for astro-ph.IM, which is an outlier in a number of ways...).

I determined the uncertainties on the linear median model parameters (one slope *M* and seven intercepts {*b1 ... b7*}) with a bootstrap-resampling approach: I resample with replacement from the abstracts' citations and length in each sub-arxiv, re-fit the "linear median trendline" model to the resampled data, and repeat this process many times. The posterior distribution of the parameters of the model are estimated from the distribution of the models fit to the resampled data.

The best-fit slope and its 1-sigma uncertainty across all sub-arxivs is 0.20 (+0.03/-0.02). This indicates that for abstracts longer than some minimum length, each five additional unique word roots adds roughly one citation to the expected median citations after five years.

Using the constant terms of the linear median models, I ranked the absolute citation offsets between sub-arxivs. These offsets represent the median difference in citations accumulated after 5 years for two papers with otherwise identical abstract properties. The offsets are normalized to the general astro-ph repository, which has the highest median citation rate.