Override GitHub Linguist with gitattributes files
Linguist is, in short, the library used on GitHub to detect blob languages and generate language breakdown graphs. It takes the list of languages it knows from languages.yml
and uses a number of methods to try and determine the language used by each file, and the overall repository breakdown.
The issue
Let's consider the example below. The repository in question is an R project with some additional C++ and Mathematica scripts, but most importantly the repository also contains the source LaTeX files to build a report. Having a look at the language breakdown graph we would guess it's a TeX/LaTeX project.
Linguist is simply doing its job, but in this case it's inflating the project's TeX stats and causing it to be erroneously1 labeled.
The way to "correct" this is to use a .gitattributes
file to override Linguist's calculations.
Linguist overrides
Linguist supports a number of different custom override strategies for language definitions and file paths. However, we are interested in overriding the linguist-vendored
attributes.
In particular, in the example above, we can use the linguist-vendored
attribute to vendor the tex
path:
Below we can see the language breakdown graph correctly shows that the repository is mainly an R project.
Additional examples
After learning about Linguist overrides, I've been playing around with some use cases for my projects and I've come up with a couple of example .gitattributes
files that could be handy in different scenarios.
R Markdown notebooks
Note that the first line may overinflate your R stats, as the percentages are calculated based on the bytes of code for each language. Feel free to delete it or comment it.
Hugo websites
{xaringan} presentations
-
Understandably, this may not be a problem to some people. However, I think there's a strong case to made about correctly labelling your repositories, as GitHub is increasingly being used as a way to show your portfolio. ↩