algorithm - 将Gedcom解析为SQLite数据库

我是爱好Xojo用户。我想将Gedcom文件导入到我的程序中，尤其是导入到SQLite数据库中。

数据库的结构

table

人数

 - ID: Integer
 - Gender: Varchar // M, F or U
 - Surname: Varchar
 - Givenname: Varchar

人际关系

 - ID: Integer
 - Husband: Integer
 - Wife: Integer

children

 - ID: Integer
 - PersonID: Integer
 - FamilyID: Integer
 - Order: Integer

人员事件

 - ID: Integer
 - PersonID: Integer
 - EventType: Varchar // e.g. BIRT, DEAT, BURI, CHR
 - Date: Varchar
 - Description: Varchar
 - Order: Integer

关系事件

 - ID: Integer
 - RelationshipID: Integer
 - EventType: Varchar // e.g. MARR, DIV, DIVF
 - Date: Varchar
 - Description: Integer
 - Order: Integer

我写了一个工作的Gedcom-Line-Parser。他将一个Gedcomline拆分为:

 - Level As Integer
 - Reference As String // optional
 - Tag As String
 - Value As String // optional

我通过TextInputStream加载Gedcom文件(工作正常)。不，我需要解析每一行。

Gedcom个人样本

0 @I1@ INDI
1 NAME George /Clooney/
2 GIVN George
2 SURN Clooney
1 BIRT
2 DATE 6 MAY 1961
2 PLAC Lexington, Fayette County, Kentucky, USA

您会看到，级别编号为我们显示了“树结构”。所以我认为这是将文件解析为JSONItem的分离对象(PersonObj，RelationshipObj，EventObj等)的最佳和最简单方法，因为它很容易获得Node的Childs。稍后，我可以简单地读取节点，子节点以创建数据库条目。但是我不知道如何创建这样的算法。

有人可以帮我吗？

最佳答案

要以良好的速度解析Gedcom行，请尝试以下方法:

将整个文件读取为一个字符串，并将各行拆分为:

dim f as FolderItem = ...
dim fileContent as String = TextInputStream.Open(f).ReadAll
fileContent = fileContent.DefineEncoding (Encodings.WindowsLatin1)
dim lines() as String = ReplaceLineEndings(fileContent,EndOfLine).Split(EndOfLine)

使用RegEx解析每一行以提取其3列

dim re as new RegEx
re.SearchPattern = "^(\d+) ([^ ]+)(.*)$"
for each line as String in lines
  dim rm as RegExMatch = re.Search (line)
  if rm = nil then
    // nothing found in this line. Is this correct?
    break
    continue // -> onward with next line
  end
  dim level as Integer = rm.SubExpressionString(1).Val
  dim code as String = rm.SubExpressionString(2)
  dim value as String = rm.SubExpressionString(3).Trim
  ... process the level, code and value
next

RegEx搜索模式意味着它寻找行的开头(“^”)，然后寻找一个或多个数字(“\d”)，一个空白，一个或多个非空白字符(“[^]”) ，最后是字符串(“$”)末尾的任何其他字符(“。”)。这些组中每组的括号用于然后使用SubExpression()提取其结果。

只要该行不包含至少一个数字，一个空格和至少一个以上的字符，就会检查rm = nil。如果Gedcom文件格式错误或具有空行，则可能是这种情况。

希望这可以帮助。

GEDCOM

algorithm - 将Gedcom解析为SQLite数据库