问题描述
我需要将1000多个.ttl文件合并到一个文件数据库中.如何合并它们并过滤源文件中的数据,并仅将所需数据保留在目标文件中?
I need to merge 1000+ .ttl files into one file database. How can I merge them with filtering the data in the source files and keep only the data needed in the target file?
谢谢
推荐答案
有很多选项,但是最简单的方法可能是使用Turtle解析器读取所有文件,然后让该解析器将其输出传递给处理程序,它先进行过滤,然后再将数据传递给Turtle编写器.
There's a number of options, but the simplest way is probably to have use a Turtle parser to read all the files, and let that parser pass its output to a handler which does the filtering before in turn passing the data to a Turtle writer.
类似的事情可能会起作用(使用RDF4J):
Something like this would probably work (using RDF4J):
RDFWriter writer = org.eclipse.rdf4j.rio.Rio.createWriter(RDFFormat.TURTLE, outFile);
writer.startRDF();
for (File file : // loop over your 100+ input files) {
Model data = Rio.parse(new FileInputStream(file), "", RDFFormat.TURTLE);
for (Statement st: data) {
if (// you want to keep this statement) {
writer.handleStatement(st);
}
}
}
writer.endRDF();
或者,只需将所有文件加载到RDF存储库中,并使用SPARQL查询将数据取出并保存到输出文件中,或者,如果您愿意:使用SPARQL更新来删除您不需要在将整个存储库导出到文件之前使用.
Alternatively, just load all the files into an RDF Repository, and use SPARQL queries to get the data out and save to an output file, or if you prefer: use SPARQL updates to remove the data you don't want before exporting the entire repository to a file.
遵循这些原则(再次使用RDF4J):
Something along these lines (again using RDF4J):
Repository rep = ... // your RDF repository, e.g. an in-memory store or native RDF database
try (RepositoryConnection conn = rep.getConnection()) {
// load all files into the database
for (File file: // loop over input files) {
conn.add(file, "", RDFFormat.TURTLE);
}
// do a sparql update to remove all instances of ex:Foo
conn.prepareUpdate("DELETE WHERE { ?s a ex:Foo; ?p ?o }").execute();
// export to file
con.export(Rio.createWriter(RDFFormat.TURTLE, outFile));
} finally {
rep.shutDown();
}
根据数据量/文件大小,您可能需要稍微扩展此基本设置(例如,通过使用事务而不是仅让连接自动提交).但是,希望您能理解一般的想法.
Depending on the amount of data / the size of your files, you may need to extend this basic setup a bit (for example by using transactions instead of just letting the connection auto-commit). But you get the general idea, hopefully.
这篇关于将RDF .ttl文件合并到一个文件数据库中-过滤并仅保留所需的数据/三元组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!