pyspark 中的 scala.util.Try 相当于什么?

本文介绍了pyspark 中的 scala.util.Try 相当于什么?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个糟糕的 HTTPD access_log，只想跳过糟糕"的行.

I've got a lousy HTTPD access_log and just want to skip the "lousy" lines.

在 Scala 中，这很简单:

In scala this is straightforward:

import scala.util.Try

val log = sc.textFile("access_log")

log.map(_.split(' ')).map(a => Try(a(8))).filter(_.isSuccess).map(_.get).map(code => (code,1)).reduceByKey(_ + _).collect()

对于python，我通过使用lambda"符号显式定义一个函数来获得以下解决方案:

For python I've got the following solution by explicitly defining a function in contrast using the "lambda" notation:

log = sc.textFile("access_log")

def wrapException(a):
    try:
        return a[8]
    except:
        return 'error'

log.map(lambda s : s.split(' ')).map(wrapException).filter(lambda s : s!='error').map(lambda code : (code,1)).reduceByKey(lambda acu,value : acu + value).collect()

在 pyspark 中是否有更好的方法(例如在 Scala 中)?

Is there a better way doing this (e.g. like in Scala) in pyspark?

非常感谢！

推荐答案

更好是一个主观术语，但您可以尝试一些方法.

Better is a subjective term but there are a few approaches you can try.

在这种特殊情况下，您可以做的最简单的事情就是避免任何异常.您只需要一个 flatMap 和一些切片:

log.flatMap(lambda s : s.split(' ')[8:9])

如您所见，这意味着不需要异常处理或后续的filter.

As you can see it means no need for an exception handling or subsequent filter.

以前的想法可以用一个简单的包装器来扩展

Previous idea can be extended with a simple wrapper

def seq_try(f, *args, **kwargs):
    try:
        return [f(*args, **kwargs)]
    except:
        return []

和示例用法

from operator import div # FYI operator provides getitem as well.

rdd = sc.parallelize([1, 2, 0, 3, 0, 5, "foo"])

rdd.flatMap(lambda x: seq_try(div, 1., x)).collect()
## [1.0, 0.5, 0.3333333333333333, 0.2]

终于有更多面向对象的方法了:

finally more OO approach:

import inspect as _inspect

class _Try(object): pass

class Failure(_Try):
    def __init__(self, e):
        if Exception not in _inspect.getmro(e.__class__):
            msg = "Invalid type for Failure: {0}"
            raise TypeError(msg.format(e.__class__))
        self._e = e
        self.isSuccess =  False
        self.isFailure = True

    def get(self): raise self._e

    def __repr__(self):
        return "Failure({0})".format(repr(self._e))

class Success(_Try):
    def __init__(self, v):
        self._v = v
        self.isSuccess = True
        self.isFailure = False

    def get(self): return self._v

    def __repr__(self):
        return "Success({0})".format(repr(self._v))

def Try(f, *args, **kwargs):
    try:
        return Success(f(*args, **kwargs))
    except Exception as e:
        return Failure(e)

和示例用法:

tries = rdd.map(lambda x: Try(div, 1.0, x))
tries.collect()
## [Success(1.0),
##  Success(0.5),
##  Failure(ZeroDivisionError('float division by zero',)),
##  Success(0.3333333333333333),
##  Failure(ZeroDivisionError('float division by zero',)),
##  Success(0.2),
##  Failure(TypeError("unsupported operand type(s) for /: 'float' and 'str'",))]

tries.filter(lambda x: x.isSuccess).map(lambda x: x.get()).collect()
## [1.0, 0.5, 0.3333333333333333, 0.2]

您甚至可以将模式匹配与 multipledispatch

You can even use pattern matching with multipledispatch

from multipledispatch import dispatch
from operator import getitem

@dispatch(Success)
def check(x): return "Another great success"

@dispatch(Failure)
def check(x): return "What a failure"

a_list = [1, 2, 3]

check(Try(getitem, a_list, 1))
## 'Another great success'

check(Try(getitem, a_list, 10))
## 'What a failure'

如果您喜欢这种方法，我已将更完整的实现推送到 GitHub 和 pypi.

If you like this approach I've pushed a little bit more complete implementation to GitHub and pypi.

这篇关于pyspark 中的 scala.util.Try 相当于什么?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！