vis icon indicating copy to clipboard operation
vis copied to clipboard

Externalized syntax highlighting

Open TieDyedDevil opened this issue 8 years ago • 4 comments

Please consider this a request for discussion and information.

I've been trying to find a decent syntax highlighter that I can run as a filter (or, I suppose, take a file as input and write to standard output), applying colorization if the output is to a terminal. I found a couple such utilities in the Fedora distro: GNU source-highlight and another named highlight. Frankly, the LUA syntax highlighting is much better than either of the standalone tools, considering both language coverage and quality of highlighting. I suppose I could add missing languages to one of the highlight utilities, but the quality of their as-shipped lexers suggests that the programs either don't have a parser that's rich enough to handle very much beyond keyword highlighting or is so difficult to use that even the programs' maintainers won't bother to do a better job.

Since I know how to write LPEG lexers for vis, I was wondering how I should leverage that. Would I be best served to start with the LPEG source repo, or might there be a clever way to cause vis to behave as a filter?

TieDyedDevil avatar Sep 01 '17 01:09 TieDyedDevil

You will need the Scintillua parts imported by vis (i.e. the lua/lexers folder) and provide a console highlighting version of the WIN_HIGHLIGHT event implementation.

The following should be a reasonable starting point, you will have to plugin suitable style definitions.

#!/usr/bin/env lua

if #arg < 2 then
	print("usage: lexer-name files ...")
	return 1
end

local syntax = arg[1]
local lexers = require('lexer')
local lexer = lexers.load(syntax)

if not lexer then 
	print(string.format("Failed to load lexer: `%s'", syntax))
	return 1
end

local token_styles = {
	-- TODO adapt for console output
	['default'] ='back:black,fore:white',
	['nothing'] = 'back:black',
	['class'] = 'fore:yellow,bold',
	['comment'] = 'fore:blue,bold',
	['constant'] = 'fore:cyan,bold',
	['definition'] = 'fore:blue,bold',
	['error'] = 'fore:red,italics',
	['function'] = 'fore:blue,bold',
	['keyword'] = 'fore:yellow,bold',
	['label'] = 'fore:green,bold',
	['number'] = 'fore:red,bold',
	['operator'] = 'fore:cyan,bold',
	['regex'] = 'fore:green,bold',
	['string'] = 'fore:red,bold',
	['preprocessor'] = 'fore:magenta,bold',
	['tag'] = 'fore:red,bold',
	['type'] = 'fore:green,bold',
	['variable'] = 'fore:blue,bold',
	['whitespace'] = '',
	['embedded'] = 'back:blue,bold',
	['identifier'] = 'fore:white',
}

for i = 2, #arg do
	local filename = arg[i]
	local file = assert(io.open(filename, "r"))
	local text = file:read("*all")
	file:close()
	local tokens = lexer:lex(text, 1)
	local token_start = 1
	
	for i = 1, #tokens, 2 do
		local token_end = tokens[i+1]
		local name = tokens[i]
		local style = token_styles[name]
		if style ~= nil then
		--	TODO
		--	io.write(style)
		end
		io.write(text:sub(token_start, token_end))
		token_start = token_end + 1
	end
end

martanne avatar Sep 03 '17 20:09 martanne

This is quite helpful. Thank you.

TieDyedDevil avatar Sep 04 '17 16:09 TieDyedDevil

Here's a first functioning version. It requires lua-term https://github.com/hoelzro/lua-term for terminfo access.

As in Marc's sample code, the lexer name is passed as the first argument on the command line. I'd like to leverage vis's filetype plugin, but that'll involve a bit of rearrangement of the vis lua code in order to separate the filetype metadata from the vis-specific code.

Please note the comment regarding default style. I don't know whether individually highlighting each default byte is intentional...

I've done what I think is the right thing w.r.t. finding the vis installation. Have I missed any cases?

#! /usr/bin/env lua

-- Standalone syntax highlighter uses the lexers provided by `vis`.

if #arg < 2 then
	print('usage: ' .. arg[0] .. ' LEXER-NAME FILE...')
	return 1
end

vis_path = os.getenv('VIS_PATH')
if vis_path ~= nil then
	package.path = package.path .. ';' .. vis_path .. '/?.lua'
end
package.path = package.path .. ';/usr/local/share/vis/?.lua'
package.path = package.path .. ';/usr/share/vis/?.lua'

local syntax = arg[1]
local lexers = require('lexer')
local lexer = lexers.load(syntax)

if not lexer then 
	print(string.format('Failed to load lexer: `%s`', syntax))
	return 1
end

local term = require('term')
local colors = term.colors

local token_styles = {
	-- bold => bright
	-- italics => underscore
	['default'] = colors.onblack .. colors.white,
	['nothing'] = colors.onblack,
	['class'] = colors.yellow .. colors.bright,
	['comment'] = colors.blue .. colors.bright,
	['constant'] = colors.cyan .. colors.bright,
	['definition'] = colors.blue .. colors.bright,
	['error'] = colors.red .. colors.underscore,
	['function'] = colors.blue .. colors.bright,
	['keyword'] = colors.yellow .. colors.bright,
	['label'] = colors.green .. colors.bright,
	['number'] = colors.red .. colors.bright,
	['operator'] = colors.cyan .. colors.bright,
	['regex'] = colors.green .. colors.bright,
	['string'] = colors.red .. colors.bright,
	['preprocessor'] = colors.magenta .. colors.bright,
	['tag'] = colors.red .. colors.bright,
	['type'] = colors.green .. colors.bright,
	['variable'] = colors.blue .. colors.bright,
	['whitespace'] = '',
	['embedded'] = colors.onblue .. colors.bright,
	['identifier'] = colors.white,
}

for i = 2, #arg do
	local filename = arg[i]
	local file = assert(io.open(filename, 'r'))
	local text = file:read('*all')
	file:close()
	local tokens = lexer:lex(text, 1)
	local token_start = 1
	local last = ''
	
	for i = 1, #tokens, 2 do
		local token_end = tokens[i+1] - 1
		local name = tokens[i]
		local style = token_styles[name]
		if style ~= nil then
			-- Whereas the lexer reports all other syntaxes over
			-- the entire span of a token, it reports 'default'
			-- byte-by-byte. We emit only the first 'default' of
			-- a series in order to properly display multibyte
			-- UTF-8 characters.
			if not (last == 'default' and name == 'default') then
				io.write(tostring(style))
			end
			last = name
		end
		io.write(text:sub(token_start, token_end))
		token_start = token_end + 1
	end
end

TieDyedDevil avatar Sep 04 '17 19:09 TieDyedDevil

Mega necrobump, but I expanded this concept into a standalone project for anyone interested.

jpe90 avatar Jul 25 '22 10:07 jpe90

@rnpnr This should be probably closed, right?

mcepl avatar Aug 10 '23 20:08 mcepl

Yes. I think the linked standalone project is sufficient or the inline code. There is also #617 which could just be copied locally.

rnpnr avatar Aug 10 '23 21:08 rnpnr