Override GitHub Linguist with gitattributes files
Linguist is, in short, the library used on GitHub to detect blob languages and generate language breakdown graphs. It takes the list of languages it knows from
languages.yml and uses a number of methods to try and determine the language used by each file, and the overall repository breakdown.
Let’s consider the example below. The repository in question is an R project with some additional C++ and Mathematica scripts, but most importantly the repository also contains the source LaTeX files to build a report. Having a look at the language breakdown graph we would guess it’s a TeX/LaTeX project.
Linguist is simply doing its job, but in this case it’s inflating the project’s TeX stats and causing it to be erroneously1 labeled.
Linguist supports a number of different custom override strategies for language definitions and file paths. However, we are interested in overriding the
In particular, in the example above, we can use the
linguist-vendored attribute to vendor the
Below we can see the language breakdown graph correctly shows that the repository is mainly an R project.
After learning about Linguist overrides, I’ve been playing around with some use cases for my projects and I’ve come up with a couple of example
.gitattributes files that could be handy in different scenarios.
R Markdown notebooks
*.Rmd linguist-language=R *.nb.html linguist-vendored
Note that the first line may overinflate your R stats, as the percentages are calculated based on the bytes of code for each language. Feel free to delete it or comment it.
Understandably, this may not be a problem to some people. However, I think there’s a strong case to made about correctly labelling your repositories, as GitHub is increasingly being used as a way to show your portfolio. ↩︎