Override GitHub Linguist with gitattributes files

Linguist is, in short, the library used on GitHub to detect blob languages and generate language breakdown graphs. It takes the list of languages it knows from languages.yml and uses a number of methods to try and determine the language used by each file, and the overall repository breakdown.

The issue

Let’s consider the example below. The repository in question is an R project with some additional C++ and Mathematica scripts, but most importantly the repository also contains the source LaTeX files to build a report. Having a look at the language breakdown graph we would guess it’s a TeX/LaTeX project.

Original language breakdown graph
Original language breakdown graph

Linguist is simply doing its job, but in this case it’s inflating the project’s TeX stats and causing it to be erroneously1 labeled.

The way to “correct” this is to use a .gitattributes file to override Linguist’s calculations.

Linguist overrides

Linguist supports a number of different custom override strategies for language definitions and file paths. However, we are interested in overriding the linguist-vendored attributes.

In particular, in the example above, we can use the linguist-vendored attribute to vendor the tex path:

tex/* linguist-vendored

Below we can see the language breakdown graph correctly shows that the repository is mainly an R project.

Corrected language breakdown graph
Corrected language breakdown graph

Additional examples

After learning about Linguist overrides, I’ve been playing around with some use cases for my projects and I’ve come up with a couple of example .gitattributes files that could be handy in different scenarios.

R Markdown notebooks

*.Rmd linguist-language=R
*.nb.html linguist-vendored

Note that the first line may overinflate your R stats, as the percentages are calculated based on the bytes of code for each language. Feel free to delete it or comment it.

Hugo websites

themes/* linguist-vendored

{xaringan} presentations

slides/libs/* linguist-vendored

  1. Understandably, this may not be a problem to some people. However, I think there’s a strong case to made about correctly labelling your repositories, as GitHub is increasingly being used as a way to show your portfolio. ↩︎

Alfredo Hernández
Alfredo Hernández
Physicist and Data Scientist

I have a passion for technology, maths, and design.

Related