

要为Python中的源文件(例如Java或C ++)创建标记生成器.遇到了 Pygments ,尤其是这些词法分析器.在文档和在线中找不到如何使用词法分析器的示例.

Want to create a tokenizer for source files (e.g. Java or C++) in Python. Came across Pygments and in particular these lexers. I could not found examples i the documentation and online for how to use the lexer.


Wondering if it is possible to actually use Pygments in Python in order to get the tokens and their position for a given source file.


I am struggling with the very basics here, so If someone could offer even a small chunk of code detailing the above it would be much appreciated.


如果您查看Pygment的 highlight 函数,实际上,其作用是将源文本传递到lexer实例中通过 get_tokens 方法,该方法返回令牌列表.然后将这些令牌传递给格式化程序.如果需要令牌列表,而无需格式化程序,则只需要做第一部分.

If you look at the source of Pygment's highlight function, essentially what it does is pass the source text into a lexer instance via the get_tokens method, which returns a list of tokens. Those tokens are then passed to the formatter. As you want the list of tokens, without the formatter, you only need to do the first part.

因此要使用C ++词法分析器(其中src是包含您的C ++源代码的字符串):

So to use the C++ lexer (where src is a string containing your C++ source code):

from pygments.lexers.c_cpp import CppLexer

lexer = CppLexer()
tokens = lexer.get_tokens(src)


Of course, you could lookup or guess the lexer instead of importing the desired lexer directly by using one of get_lexer_by_name, get_lexer_for_filename, get_lexer_for_mimetype, guess_lexer, or guess_lexer_for_filename. For example:

from pygments.lexers import get_lexer_by_name

Lexer = get_lexer_by_name('c++')
lexer = Lexer()  # Don't forget to create an instance
tokens = lexer.get_tokens(src)


Whether the returned list of tokens will provide you with what you want it another matter. You'll have to try it and see.


09-05 20:08