| Class | MediaWikiLexer |
| In: |
lib/mediacloth/mediawikilexer.rb
|
| Parent: | Object |
The lexer for MediaWiki language.
Standalone usage:
file = File.new("somefile", "r")
input = file.read
lexer = MediaWikiLexer.new
lexer.tokenize(input)
Inside RACC-generated parser:
...
---- inner ----
attr_accessor :lexer
def parse(input)
lexer.tokenize(input)
return do_parse
end
def next_token
return @lexer.lex
end
...
parser = MediaWikiParser.new
parser.lexer = MediaWikiLexer.new
parser.parse(input)
Initialized the lexer with a match table.
The match table tells the lexer which method to invoke on given input char during "tokenize" phase.
# File lib/mediacloth/mediawikilexer.rb, line 30
30: def initialize
31: @position = 0
32: @pair_stack = [[false, false]] #stack of tokens for which a pair should be found
33: @list_stack = []
34: @lexer_table = Hash.new(method(:match_other))
35: @lexer_table["'"] = method(:match_italic_or_bold)
36: @lexer_table["="] = method(:match_section)
37: @lexer_table["["] = method(:match_link_start)
38: @lexer_table["]"] = method(:match_link_end)
39: @lexer_table[" "] = method(:match_space)
40: @lexer_table["*"] = method(:match_list)
41: @lexer_table["#"] = method(:match_list)
42: @lexer_table[";"] = method(:match_list)
43: @lexer_table[":"] = method(:match_list)
44: @lexer_table["-"] = method(:match_line)
45: @lexer_table["~"] = method(:match_signature)
46: @lexer_table["h"] = method(:match_inline_link)
47: end
Returns the next token from the stream. Useful for RACC parsers.
# File lib/mediacloth/mediawikilexer.rb, line 93
93: def lex
94: token = @tokens[@position]
95: @position += 1
96: return token
97: end
Transforms input stream (string) into the stream of tokens. Tokens are collected into an array of type [ [TOKEN_SYMBOL, TOKEN_VALUE], …, [false, false] ]. This array can be given as input token-by token to RACC based parser with no modification. The last token [false, false] inficates EOF.
# File lib/mediacloth/mediawikilexer.rb, line 53
53: def tokenize(input)
54: @tokens = []
55: @cursor = 0
56: @text = input
57: @next_token = []
58:
59: #This tokenizer algorithm assumes that everything that is not
60: #matched by the lexer is going to be :TEXT token. Otherwise it's usual
61: #lexer algo which call methods from the match table to define next tokens.
62: while (@cursor < @text.length)
63: @current_token = [:TEXT, ''] unless @current_token
64: @token_start = @cursor
65: @char = @text[@cursor, 1]
66:
67: if @lexer_table[@char].call == :TEXT
68: @current_token[1] += @text[@token_start, 1]
69: else
70: #skip empty :TEXT tokens
71: @tokens << @current_token unless empty_text_token?
72: @next_token[1] = @text[@token_start, @cursor - @token_start]
73: @tokens << @next_token
74: #hack to enable sub-lexing!
75: if @sub_tokens
76: @tokens += @sub_tokens
77: @sub_tokens = nil
78: end
79: #end of hack!
80: @current_token = nil
81: @next_token = []
82: end
83: end
84: #add the last TEXT token if it exists
85: @tokens << @current_token if @current_token and not empty_text_token?
86:
87: #RACC wants us to put this to indicate EOF
88: @tokens << [false, false]
89: @tokens
90: end