Class MediaWikiLexer
In: lib/mediacloth/mediawikilexer.rb
Parent: Object

The lexer for MediaWiki language.

Standalone usage:

 file = File.new("somefile", "r")
 input = file.read
 lexer = MediaWikiLexer.new
 lexer.tokenize(input)

Inside RACC-generated parser:

 ...
 ---- inner ----
 attr_accessor :lexer
 def parse(input)
     lexer.tokenize(input)
     return do_parse
 end
 def next_token
     return @lexer.lex
 end
 ...
 parser = MediaWikiParser.new
 parser.lexer = MediaWikiLexer.new
 parser.parse(input)

Methods

lex   new   tokenize  

Public Class methods

Initialized the lexer with a match table.

The match table tells the lexer which method to invoke on given input char during "tokenize" phase.

[Source]

    # File lib/mediacloth/mediawikilexer.rb, line 30
30:     def initialize
31:         @position = 0
32:         @pair_stack = [[false, false]] #stack of tokens for which a pair should be found
33:         @list_stack = []
34:         @lexer_table = Hash.new(method(:match_other))
35:         @lexer_table["'"] = method(:match_italic_or_bold)
36:         @lexer_table["="] = method(:match_section)
37:         @lexer_table["["] = method(:match_link_start)
38:         @lexer_table["]"] = method(:match_link_end)
39:         @lexer_table[" "] = method(:match_space)
40:         @lexer_table["*"] = method(:match_list)
41:         @lexer_table["#"] = method(:match_list)
42:         @lexer_table[";"] = method(:match_list)
43:         @lexer_table[":"] = method(:match_list)
44:         @lexer_table["-"] = method(:match_line)
45:         @lexer_table["~"] = method(:match_signature)
46:         @lexer_table["h"] = method(:match_inline_link)
47:     end

Public Instance methods

Returns the next token from the stream. Useful for RACC parsers.

[Source]

    # File lib/mediacloth/mediawikilexer.rb, line 93
93:     def lex
94:         token = @tokens[@position]
95:         @position += 1
96:         return token
97:     end

Transforms input stream (string) into the stream of tokens. Tokens are collected into an array of type [ [TOKEN_SYMBOL, TOKEN_VALUE], …, [false, false] ]. This array can be given as input token-by token to RACC based parser with no modification. The last token [false, false] inficates EOF.

[Source]

    # File lib/mediacloth/mediawikilexer.rb, line 53
53:     def tokenize(input)
54:         @tokens = []
55:         @cursor = 0
56:         @text = input
57:         @next_token = []
58: 
59:         #This tokenizer algorithm assumes that everything that is not
60:         #matched by the lexer is going to be :TEXT token. Otherwise it's usual
61:         #lexer algo which call methods from the match table to define next tokens.
62:         while (@cursor < @text.length)
63:             @current_token = [:TEXT, ''] unless @current_token
64:             @token_start = @cursor
65:             @char = @text[@cursor, 1]
66: 
67:             if @lexer_table[@char].call == :TEXT
68:                 @current_token[1] += @text[@token_start, 1]
69:             else
70:                 #skip empty :TEXT tokens
71:                 @tokens << @current_token unless empty_text_token?
72:                 @next_token[1] = @text[@token_start, @cursor - @token_start]
73:                 @tokens << @next_token
74:                 #hack to enable sub-lexing!
75:                 if @sub_tokens
76:                     @tokens += @sub_tokens
77:                     @sub_tokens = nil
78:                 end
79:                 #end of hack!
80:                 @current_token = nil
81:                 @next_token = []
82:             end
83:         end
84:         #add the last TEXT token if it exists
85:         @tokens << @current_token if @current_token and not empty_text_token?
86: 
87:         #RACC wants us to put this to indicate EOF
88:         @tokens << [false, false]
89:         @tokens
90:     end

[Validate]