问题描述
我有很多文件夹,每个文件夹都有几个 pdf 文件(还有其他文件类型,如 .xlsx 或 .doc).我的目标是提取每个文件夹的 pdf 文本并创建一个数据框,其中每个记录都是文件夹名称".每一列以字符串形式代表该文件夹中每个pdf文件的文本内容.
I have many folders where each has a couple of pdf files (other file types like .xlsx or .doc are there as well). My goal is to extract the pdf's text for each folder and create a data frame where each record is the "Folder Name" and each column represents text content of each pdf file in that folder in string form.
我设法使用 tika
包(下面的代码)从一个 pdf 文件中提取文本.但不能循环迭代文件夹或其他文件夹中的其他pdf以构建结构化数据框.
I managed to extract text from one pdf file with tika
package (code below). But can not make a loop to iterate on other pdfs in the folder or other folders so to construct a structured dataframe.
# import parser object from tike
from tika import parser
# opening pdf file
parsed_pdf = parser.from_file("ducument_1.pdf")
# saving content of pdf
# you can also bring text only, by parsed_pdf['text']
# parsed_pdf['content'] returns string
data = parsed_pdf['content']
# Printing of content
print(data)
# <class 'str'>
print(type(data))
所需的输出应如下所示:
The desired output should look like this:
文件夹名称 | pdf1 | pdf2 |
---|---|---|
17534 | pdf1 的文本 | pdf 2 的文本 |
63546 | pdf1 的文本 | pdf1 的文本 |
26374 | pdf1 的文本 | - |
推荐答案
如果要查找目录及其子目录中的所有 PDF,可以使用 os.listdir
和glob
,参见 递归子文件夹搜索并返回列表中的文件 python .我选择了一个稍微长一点的表格,这样初学者更容易了解正在发生的事情
If you want to find all the PDFs in a directory and its subdirectories, you can use os.listdir
and glob
, see Recursive sub folder search and return files in a list python . I've gone for a slightly longer form so it is easier to follow what is happening for beginners
然后,对于每个文件,调用Apache Tika,并保存到Pandas DataFrame中的下一行
Then, for each file, call Apache Tika, and save to the next row in the Pandas DataFrame
#!/usr/bin/python3
import os, glob
from tika import parser
from pandas import DataFrame
# What file extension to find, and where to look from
ext = "*.pdf"
PATH = "."
# Find all the files with that extension
files = []
for dirpath, dirnames, filenames in os.walk(PATH):
files += glob.glob(os.path.join(dirpath, ext))
# Create a Pandas Dataframe to hold the filenames and the text
df = DataFrame(columns=("filename","text"))
# Process each file in turn, parsing with Tika and storing in the dataframe
for idx, filename in enumerate(files):
data = parser.from_file(filename)
text = data["content"]
df.loc[idx] = [filename, text]
# For debugging, print what we found
print(df)
这篇关于如何使用python从文件夹中的pdf中提取文本并将它们保存在数据框中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!