问题描述
我试图总结一个跨越很多ipython / jupyter笔记本的数据分析项目,每个笔记本都相当长。其中一个有助于这个过程的事情是,如果我至少知道整个投入的酸菜进入输出泡菜走出去。什么是最干净/最快速/最有效的方法?
我不确定这是否是最好的方法。 > def summerize_pickles(notebook_path):
从IPython.nbformat导入当前为nbformat
导入重新
$ b $打开(notebook_path)为fh:
nb = nbformat.reads_json (fh.read())
list_of_input_pickles = []
list_of_output_pickles = []
for cell in nb [worksheets] [0] [cells ]:
#这确认至少有一个泡菜。
if cell [cell_type]!=codeor cell [input]。find(pickle)== -1:#跳过非代码单元或代码单元但不引用pickle
continue
#如果有多行,它将逐行迭代
用于单元格[input]中的行。 ():
如果line.find(pickle)== -1:#跳过行不提及pickle可能减少搜索次数
continue
############################ #################### ######## ############################ ############## ##############
code_type = str()
如果line.find(pickle.dump)!= -1或者line.find(。to_pickle )!= -1:
code_type =output
elif line.find(pickle.load)!= -1或line.find(。read_pickle)!= -1:
code_type =input
else:
continue#这将告诉代码跳过import cpickle as pickle这样的行。
########################### ############################# ##################### ####### ############################
filename = re.findall(r'(。 *?)',line)#这将获取引号之间的所有内容。请参阅:http://stackoverflow.com/questions/171480/regex-grabbing-values-between-quotation-marks
##################### ####### ############################ ############### ############# ############################
if code_type ==输入:
list_of_input_pickles.append(filename [0])
elif code_type ==output:
list_of_output_pickles.append(filename [0])
pickles_dict = {input_pickles:list_of_input_pickles,
output_pickles:list_of_output_pickles}
返回pickles_dict
I'm trying to summarize a data analysis project which runs across many ipython / jupyter notebooks and each notebook is fairly long. One of the things that would help this process is if I knew at least what the overall "input" pickles going in and "output" pickles going out.
What's the cleanest/quickest/most efficient way to do this?
I'm not sure if this is the best way to do it, but it's at least one way...
def summerize_pickles(notebook_path):
from IPython.nbformat import current as nbformat
import re
with open(notebook_path) as fh:
nb = nbformat.reads_json(fh.read())
list_of_input_pickles = []
list_of_output_pickles = []
for cell in nb["worksheets"][0]["cells"]:
# This confirms there is at least one pickle in it.
if cell["cell_type"] != "code" or cell["input"].find("pickle") == -1: # Skipping over those cells which aren't code or those cells with code but which don't reference "pickle
continue
# In case there are multiple lines, it iterates line by line.
for line in cell["input"].splitlines():
if line.find("pickle") == -1: # Skips over lines w/ no mention of "pickle" to potentially reduce the number of times it's searched.
continue
############################ ############################ ############################ ############################
code_type = str()
if line.find("pickle.dump") != -1 or line.find(".to_pickle")!= -1:
code_type = "output"
elif line.find("pickle.load") != -1 or line.find(".read_pickle")!= -1:
code_type = "input"
else:
continue # This tells the code to skip over lines like "import cpickle as pickle"
############################ ############################ ############################ ############################
filename = re.findall(r'"(.*?)"', line) # This gets all the content between the quotes. See: http://stackoverflow.com/questions/171480/regex-grabbing-values-between-quotation-marks
############################ ############################ ############################ ############################
if code_type == "input":
list_of_input_pickles.append(filename[0])
elif code_type == "output":
list_of_output_pickles.append(filename[0])
pickles_dict = {"input_pickles":list_of_input_pickles,
"output_pickles":list_of_output_pickles }
return pickles_dict
这篇关于从ipython / jupyter笔记本中提取进出泡菜的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!