此词法分析器允许您使用F#计算表达式以非常声明的方式定义基于正则表达式的规则。
F#
打开 Lexer 让 定义= lexerDefinitions { 做!addNextlineDefinition “NEWLINE” @ “(\ n \ r)| \ n | \ r”“ 做!addIgnoreDefinition “WS” @ “\ s” 做!addDefinition “让” “让” 做!addDefinition “ID” “(?i)[az] [a-z0-9] *” 做!addDefinition “FLOAT” @ “[0-9] + \。[0-9] +” 做!addDefinition “INT” “[0-9] +” 做!addDefinition “OPERATOR” @ “[+ * =!/&| <> \ ^ \ - ] +” }
通过这些定义,您可以执行词法分析器:
F#
打开 Lexer 让 lex输入= 试试 让 y = Lexer.tokenize定义输入 printfn “%A” y 与 e - > printf “%s” e.Message lex “让a = 5”
这将导致:
F#
seq [ {name = “LET” ; text = “let” ; pos = 0 ; column = 0 ; line = 0 ;}; {name = “ID” ; text = “a” ; pos = 4 ; column = 4 ; line = 0 ;}; {name = “OPERATOR” ; text = “=” ; pos = 6 ;列= 6 ; line = 0 ;}; {name = “INT” ; text = “5” ; pos = 8 ; column = 8 ; line = 0 ;}]
词法分析器的代码分为三个部分。第一部分是使用F#计算表达式的状态monad。这使得声明性方法(见上文)能够设置词法分析器规则。
F#
模块 StateMonad 类型 State <'s,'a> = State of ('s - >('a *'s)) let runState(State f)= f type StateBuilder()= member b.Return(x)= State(fun s - >(x,s)) member b.Delay(f)= f():State <'s,'a> member b.Zero()= State(fun s - >((),s)) 成员 b.Bind(状态p,休息)=状态(有趣的 s - > 让 v,s2 = p s in (runState(rest v))s2) 成员 b.Get()=状态(有趣 的 - >(s,s) )) 成员 b.Put s = State(fun _ - >((),s))
第二部分是用于定义词法分析器规则的组合器。有三个主要组合器: AddDefinition允许您定义名称/正则表达式对, AddIgnoreDefinition允许您定义词法分析器应忽略的字符, AddNextlineDefinition允许您定义哪些字符确定新行。
F#
输入 LexDefinitions = {regexes:string list; 名称:字符串列表; nextlines:布尔列表; 忽略:布尔列表; } 让 buildDefinition命名模式nextLine ignore = 州{ 让!x = state.Get() 做!state.Put {regexes = x.regexes @ [sprintf @ “(?<%s>%s)” name pattern]; names = x.names @ [name]; nextlines = x.nextlines @ [nextLine]; ignores = x.ignores @ [ignore]} } 让 addDefinition名模式= buildDefinition名字模式 假 虚假 让利 addIgnoreDefinition名模式= buildDefinition名字模式 假 真 让 addNextlineDefinition名模式= buildDefinition名称模式 真 真实
最后一部分是执行标记化的代码。它使用Seq.unfold方法创建令牌列表。Unfold是一个函数,它接受一个项目并从中生成一个新项目列表。它与Seq.fold相反,它接受一个项目列表并将其转换为单个项目。tokenize函数使用Seq.unfold生成每个标记,同时跟踪当前行号,该行中的位置以及输入字符串中的位置。
F#
类型 Token = {name:string; text:string; pos:int; column:int; line:int} 让 createLexDefs pb =(runState pb){regexes = []; names = []; nextlines = []; ignores = []} |> snd let tokenize lexerBuilder(str:string)= let patterns = createLexDefs lexerBuilder let combinedRegex = Regex(List.fold(fun acc reg - > acc + “|” + reg)(List.head patterns。 regexes)(List.tail patterns.regexes)) 让 nextlineMap = List.zip patterns.names patterns.nextlines |> Map.ofList let ignoreMap = List.zip patterns.names patterns.ignores |> Map.ofList let tokenizeStep(pos, line,lineStart)= if pos> = str.Length then 没有 否则 让 getMatchedGroupName(grps:GroupCollection)names = List.find(fun (name:string) - > grps。[name] .Length> 0)名称 匹配 combinedRegex.Match(str,pos) with | 公吨 时 mt.Success && POS = mt.Index - > 让 组名= getMatchedGroupName mt.Groups patterns.names 让 柱= mt.Index - lineStart 让 nextPos = POS + mt.Length 让 (nextLine,nextLineStart)= 如果 nextlineMap.Item groupName 然后 (行+ 1,nextPos) else (line,lineStart) let token = if ignoreMap.Item groupName then None else Some { name = groupName; text = mt.Value; post = post; line = line; column = column; } 一些(令牌,(nextPos,nextLine,nextLineStart)) | 否则 - > let textAroundError = str.Substring(pos,min(pos + 5)str.Length) raise(ArgumentException(sprintf “Lexing error in line:%d and column:%d near text:%s” line(pos - lineStart)textAroundError)) Seq.unfold tokenizeStep(0, 0, 0)|> Seq.filter(有趣 X - > x.IsSome)|> Seq.map(有趣 X - > x.Value)
最后,这是使用XUnit.Net编写的单元测试:
F#
模块 LexerFacts 开放 的xUnit 开放 词法 开放 System.Linq的 让 simpleDefs = 州{ 做!addNextlineDefinition “NextLine” “/” 做!addIgnoreDefinition “IgnoredSymbol” “= +” 做!addDefinition “String” “[a-zA-Z] +” 做!addDefinition “Number” “\ d +” 做!addDefinition “名称” “马特” } [<事实>] 让 Will_return_no_tokens_for_empty_string()= 让 令牌= Lexer.tokenize simpleDefs “” Assert.Equal(0,tokens.Count()) [<事实>] 让 Will_throw_exception_for_invalid_token()= 让 代币= Lexer.tokenize simpleDefs “ - ” 让 EX = Assert.ThrowsDelegateWithReturn(乐趣 () - > 向上转型 tokens.Count())|> Record.Exception Assert.NotNull(前) Assert.True(例如:?System.ArgumentException) [<事实>] 让 Will_ignore_symbols_defined_as_ignore_symbols()= 让 令牌= Lexer.tokenize simpleDefs “=========” Assert.Equal(0,tokens.Count()) [<事实>] let Will_get_token_with_correct_position_and_type()= let tokens = Lexer.tokenize simpleDefs “1one = 2 = two” Assert.Equal(“Number”,tokens.ElementAt(2).name) Assert.Equal(“2”,tokens.ElementAt(2).text) Assert.Equal(5,tokens.ElementAt(2).pos) Assert.Equal(5,tokens.ElementAt(2).column) Assert.Equal(0,tokens.ElementAt(2).line) [<事实>] let Will_tokenize_string_with_alernating_numbers_and_strings()= let tokens = Lexer.tokenize simpleDefs “1one2two” Assert.Equal(“1”,tokens.ElementAt(0).text) Assert.Equal(“one”,tokens.ElementAt(1).text) Assert.Equal(“2”,tokens.ElementAt(2).text) Assert.Equal(“two”,tokens.ElementAt(3).text) [<事实>] let Will_increment_line_with_newline_symbol()= let tokens = Lexer.tokenize simpleDefs “1one / 2two ” Assert.Equal(“Number”,tokens.ElementAt(2).name) Assert.Equal(“2”,tokens.ElementAt(2).text) Assert.Equal(5,tokens.ElementAt(2).pos) Assert.Equal(0,tokens.ElementAt(2).column) Assert.Equal(1,tokens.ElementAt(2).line) [<事实>] let Will_give_priority_to_lexer_definitions_defined_earlier()= let tokens = Lexer.tokenize simpleDefs “Matt” Assert.Equal(“String”,tokens.ElementAt(0).name)