我上传了13万个json文件。
我用Python
执行此操作:
import os
import json
import pandas as pd
path = "/my_path/"
filename_ending = '.json'
json_list = []
json_files = [file for file in os.listdir(f"{path}") if file.endswith(filename_ending)]
import time
start = time.time()
for jf in json_files:
with open(f"{path}/{jf}", 'r') as f:
json_data = json.load(f)
json_list.append(json_data)
end = time.time()
它需要60秒。
我用
multiprocessing
执行此操作:import os
import json
import pandas as pd
from multiprocessing import Pool
import time
path = "/my_path/"
filename_ending = '.json'
json_files = [file for file in os.listdir(f"{path}") if file.endswith(filename_ending)]
def read_data(name):
with open(f"/my_path/{name}", 'r') as f:
json_data = json.load(f)
return json_data
if __name__ == '__main__':
start = time.time()
pool = Pool(processes=os.cpu_count())
x = pool.map(read_data, json_files)
end = time.time()
它需要53秒。
我用
ray
执行此操作:import os
import json
import pandas as pd
from multiprocessing import Pool
import time
import ray
path = "/my_path/"
filename_ending = '.json'
json_files = [file for file in os.listdir(f"{path}") if file.endswith(filename_ending)]
start = time.time()
ray.shutdown()
ray.init(num_cpus=os.cpu_count()-1)
@ray.remote
def read_data(name):
with open(f"/my_path/{name}", 'r') as f:
json_data = json.load(f)
return json_data
all_data = []
for jf in json_files:
all_data.append(read_data.remote(jf))
final = ray.get(all_data)
end = time.time()
耗时146秒。
我的问题是为什么
ray
需要这么多时间?是因为:
1)对于相对少量的数据,射线相对较慢?
2)我在代码中做错了什么?
3)
ray
不是很有用吗? 最佳答案
我想说,假设1)可能是最接近真相的。雷似乎是一个功能强大的库,但是您要做的只是读取一堆文件。您的代码只是为了进行基准测试的示例,还是更大程序的一部分?如果是后者,那么让您的基准代码反映出来可能会很有趣。
没什么大不了的,但是我对您的3个程序进行了调整,因此它们的效率至少应该稍微高一点。
import os
import json
folder_path = "/my_path/"
filename_ending = '.json'
json_files = (os.path.join(folder_path, fp) for fp in os.listdir(f"{folder_path}") if fp.endswith(filename_ending))
def load_json_from_file(file_path):
with open(file_path, 'r') as file_1:
return json.load(file_1)
json_list = [load_json_from_file(curr_fp) for curr_fp in json_files]
import os
import json
import multiprocessing as mp
folder_path = "/my_path/"
filename_ending = '.json'
json_files = (os.path.join(folder_path, fp) for fp in os.listdir(f"{folder_path}") if fp.endswith(filename_ending))
def load_json_from_file(file_path):
with open(file_path, 'r') as file_1:
return json.load(file_1)
with mp.Pool() as pool:
json_list = pool.map(load_json_from_file, json_files)
import os
import json
import ray
folder_path = "/my_path/"
filename_ending = '.json'
@ray.remote
def load_json_from_file(file_path):
with open(file_path, 'r') as file_1:
return json.load(file_1)
json_files = (os.path.join(folder_path, fp) for fp in os.listdir(f"{folder_path}") if fp.endswith(filename_ending))
ray.init()
futures_list = [load_json_from_file.remote(curr_fp) for curr_fp in json_files]
json_list = ray.get(futures_list)
如果您有任何问题,请告诉我。如果您可以再次运行基准测试,我很想知道有什么区别(如果有)。
关于python - Ray比Python和.multiprocessing都慢得多,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/58702492/