How VSCode Language Detection Works?
VSCode version 1.60 introduced a new feature to detect the programming language of a file based on its content. This functionality is beneficial for developers. But how is it implemented?
Overview
The language detection feature is based on a machine learning model trained by guesslang.
To run models in the node/browser, the VSCode team uses Tensorflow.js. They load the pre-trained model and encapsulated it into the vscode-languagedetection package.
Moreover, to further enhance the precision of language detection further, VSCode employs a private library named vscode-regexp-languagedetection
. This improve the accuracy by checking the files recently opened in your workspace.
Details
The language detection feature is primarily implemented in the languageDetectionSimpleWorker.detectLanguage
method in VSCode. By default, the method will use guesslang to detect the language. If not found, it will use a private regular expression model to detect the language based on your recently opened workspace files.
Here is the simplified version of the detectLanguage
method:
1 | // https://github.com/microsoft/vscode/blob/19ecb4b8337d0871f0a204853003a609d716b04e/src/vs/workbench/services/languageDetection/browser/languageDetectionSimpleWorker.ts#L39-L81 |
In detectLanguagesImpl method, it uses vscode-languagedetection
to get confidence scores for each language and adjust language confidence based on VS Code’s language usage, finally return the most possible language.
In languageDetectionWorkerServiceImpl, it will listen to the workspace and store all your recently opened files’ languages and used to calculate a language bias for the regular expression model.
This is a detailed implementation of the language detection feature in VSCode. Though not complex, it is imbued with numerous intricacies.
おまけ
Beside guesslang, there are other libraries that are used to detect the language. For example, Magika is a similar tool to detect common file content types not only programming languages. It is developed by Google and can be used to detect the content type of a file based on its contents.