我开始使用pyspark学习Spark,并想知道以下日志消息的含义是什么?
导致泄漏的操作是两个RDD之间的join
:
print(user_types.join(user_genres).collect())
这听起来似乎有些明显,但是我的第一个问题是
我确实确实安装了
psutil
,并且警告消失了,但是我想了解到底发生了什么。有一个very similar question here,但是OP一直在问如何只安装psutil
。 最佳答案
这里的溢出表示将内存中的数据帧写入磁盘,这会降低pyspark的性能,因为写入磁盘的速度很慢。
为什么是psutil
检查节点的已用内存。
这是来自here的pyspark源代码shuffle.py的原始代码段,该代码段发出警告。下面的代码定义了一个函数,用于在存在psutil或系统为linux的情况下获取已用内存。
导入psutil并定义get_used_memory
try:
import psutil
def get_used_memory():
""" Return the used memory in MB """
process = psutil.Process(os.getpid())
if hasattr(process, "memory_info"):
info = process.memory_info()
else:
info = process.get_memory_info()
return info.rss >> 20
except ImportError:
def get_used_memory():
""" Return the used memory in MB """
if platform.system() == 'Linux':
for line in open('/proc/self/status'):
if line.startswith('VmRSS:'):
return int(line.split()[1]) >> 10
else:
warnings.warn("Please install psutil to have better "
"support with spilling")
if platform.system() == "Darwin":
import resource
rss = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
return rss >> 20
# TODO: support windows
return 0
写入磁盘如果节点的已用内存大于预设限制,则以下代码调用将数据帧写入磁盘。
def mergeCombiners(self, iterator, check=True):
""" Merge (K,V) pair by mergeCombiner """
iterator = iter(iterator)
# speedup attribute lookup
d, comb, batch = self.data, self.agg.mergeCombiners, self.batch
c = 0
for k, v in iterator:
d[k] = comb(d[k], v) if k in d else v
if not check:
continue
c += 1
if c % batch == 0 and get_used_memory() > self.memory_limit:
self._spill()
self._partitioned_mergeCombiners(iterator, self._next_limit())
break
洒如果使用的内存大于预设的限制,则此代码实际上会将数据帧也写到磁盘上。
def _spill(self):
"""
dump already partitioned data into disks.
It will dump the data in batch for better performance.
"""
global MemoryBytesSpilled, DiskBytesSpilled
path = self._get_spill_dir(self.spills)
if not os.path.exists(path):
os.makedirs(path)
used_memory = get_used_memory()
if not self.pdata:
# The data has not been partitioned, it will iterator the
# dataset once, write them into different files, has no
# additional memory. It only called when the memory goes
# above limit at the first time.
# open all the files for writing
streams = [open(os.path.join(path, str(i)), 'w')
for i in range(self.partitions)]
for k, v in self.data.iteritems():
h = self._partition(k)
# put one item in batch, make it compatitable with load_stream
# it will increase the memory if dump them in batch
self.serializer.dump_stream([(k, v)], streams[h])
for s in streams:
DiskBytesSpilled += s.tell()
s.close()
self.data.clear()
self.pdata = [{} for i in range(self.partitions)]
else:
for i in range(self.partitions):
p = os.path.join(path, str(i))
with open(p, "w") as f:
# dump items in batch
self.serializer.dump_stream(self.pdata[i].iteritems(), f)
self.pdata[i].clear()
DiskBytesSpilled += os.path.getsize(p)
self.spills += 1
gc.collect() # release the memory as much as possible
MemoryBytesSpilled += (used_memory - get_used_memory()) << 20
关于python - pyspark需要psutil做什么? (面对 "UserWarning: Please install psutil to have better support with spilling")?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/51226469/