我开始使用pyspark学习Spark,并想知道以下日志消息的含义是什么?



导致泄漏的操作是两个RDD之间的join:

print(user_types.join(user_genres).collect())

这听起来似乎有些明显,但是我的第一个问题是

我确实确实安装了psutil,并且警告消失了,但是我想了解到底发生了什么。有一个very similar question here,但是OP一直在问如何只安装psutil

最佳答案

这里的溢出表示将内存中的数据帧写入磁盘,这会降低pyspark的性能,因为写入磁盘的速度很慢。
为什么是psutil
检查节点的已用内存。
这是来自here的pyspark源代码shuffle.py的原始代码段,该代码段发出警告。下面的代码定义了一个函数,用于在存在psutil或系统为linux的情况下获取已用内存。
导入psutil并定义get_used_memory

try:
    import psutil
    def get_used_memory():
        """ Return the used memory in MB """
        process = psutil.Process(os.getpid())
        if hasattr(process, "memory_info"):
            info = process.memory_info()
        else:
            info = process.get_memory_info()
        return info.rss >> 20
except ImportError:
    def get_used_memory():
        """ Return the used memory in MB """
        if platform.system() == 'Linux':
            for line in open('/proc/self/status'):
                if line.startswith('VmRSS:'):
                    return int(line.split()[1]) >> 10
        else:
            warnings.warn("Please install psutil to have better "
                          "support with spilling")
            if platform.system() == "Darwin":
                import resource
                rss = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
                return rss >> 20
            # TODO: support windows
        return 0
写入磁盘
如果节点的已用内存大于预设限制,则以下代码调用将数据帧写入磁盘。
def mergeCombiners(self, iterator, check=True):
        """ Merge (K,V) pair by mergeCombiner """
        iterator = iter(iterator)
        # speedup attribute lookup
        d, comb, batch = self.data, self.agg.mergeCombiners, self.batch
        c = 0
        for k, v in iterator:
            d[k] = comb(d[k], v) if k in d else v
            if not check:
                continue
            c += 1
            if c % batch == 0 and get_used_memory() > self.memory_limit:
                self._spill()
                self._partitioned_mergeCombiners(iterator, self._next_limit())
                break

如果使用的内存大于预设的限制,则此代码实际上会将数据帧也写到磁盘上。
def _spill(self):
        """
        dump already partitioned data into disks.
        It will dump the data in batch for better performance.
        """
        global MemoryBytesSpilled, DiskBytesSpilled
        path = self._get_spill_dir(self.spills)
        if not os.path.exists(path):
            os.makedirs(path)
        used_memory = get_used_memory()
        if not self.pdata:
            # The data has not been partitioned, it will iterator the
            # dataset once, write them into different files, has no
            # additional memory. It only called when the memory goes
            # above limit at the first time.
            # open all the files for writing
            streams = [open(os.path.join(path, str(i)), 'w')
                       for i in range(self.partitions)]
            for k, v in self.data.iteritems():
                h = self._partition(k)
                # put one item in batch, make it compatitable with load_stream
                # it will increase the memory if dump them in batch
                self.serializer.dump_stream([(k, v)], streams[h])
            for s in streams:
                DiskBytesSpilled += s.tell()
                s.close()
            self.data.clear()
            self.pdata = [{} for i in range(self.partitions)]
        else:
            for i in range(self.partitions):
                p = os.path.join(path, str(i))
                with open(p, "w") as f:
                    # dump items in batch
                    self.serializer.dump_stream(self.pdata[i].iteritems(), f)
                self.pdata[i].clear()
                DiskBytesSpilled += os.path.getsize(p)
        self.spills += 1
        gc.collect()  # release the memory as much as possible
        MemoryBytesSpilled += (used_memory - get_used_memory()) << 20

关于python - pyspark需要psutil做什么? (面对 "UserWarning: Please install psutil to have better support with spilling")?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/51226469/

10-12 13:53