解析超过一百万个xml文件

本文介绍了解析超过一百万个xml文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在寻找解析超过一百万个大小从2KB到10MB的XML文件的最佳方法。总的来说，文件加起来在500GB左右的某个地方。应用程序从整个文件中的各个节点收集数据，并将它们推送到Postgres数据库模式中。我使用etree编写了python代码，很久以前当XML文件的数量要小得多时就这样做了。但现在需要将近一周的时间来处理。有关扩大规模的最佳策略的想法吗？如果我能在一两天内完成这些工作，那将是一个巨大的进步。

我尝试过的事情：

  class  ParseJob（threading.Thread）：
 path_queue = 无 
 stops = 无 
  def  __init __（self，path_queue，stopper）：
 super（self .__ class __，self）.__ init __（）
 self.path_queue = path_queue 
 self.stopper = stopper 
  def  run（self）：
  while   not not  self.stopper.is_set（）：
 尝试：
 path = self.path_queue.get_nowait（）
 new = open（path，'  rb'）
 xmldoc = minidom.parse（new）
 parseFunc（xmldoc）
 new.close（）
 self.path_queue.task_done（）
 除 Queue.Empty：
  break  
 
  def  parseFunc（xmldoc）：
 ＃ ＃执行所有解析 
 
  def  main（）：
 path_queue = Queue.Queue（）
 dir = ＃  #xml文件的路径 
  路径 in  dir：
 tile_queue.put（path）
 stopper = threading.Event（）
 num_workers =  8  
 threads = list（）
  for  i  in 范围（num_workers）：
 job = ParseJob（path_queue，stops）
 threads.append（job）
 job.start（）
 path_queue。 join（）
 
 main（）

解决方案

这篇关于解析超过一百万个xml文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！