python - Pandas 值的概率

我正在尝试查找数据框内给定单词的概率，但是我的当前设置出现了AttributeError: 'Series' object has no attribute 'columns'错误。希望您能帮助我找到错误所在。

我从一个看起来像下面的数据帧开始，并使用下面的函数将其转换为每个单词的总计数。

query          count
foo bar        10
super          8
foo            4
super foo bar  2

功能如下:

def _words(df):
    return df['query'].str.get_dummies(sep=' ').T.dot(df['count'])

产生以下df(注意'foo'为16，因为它在整个df中出现了16次):

bar      12
foo      16
super    10

尝试在df中查找给定关键字的概率时出现问题，该关键字当前未附加列名。下面是我目前正在使用的内容，但是它抛出“AttributeError:'Series'对象没有属性'columns'”错误。

def _probability(df, query):
  return df[query] / df.groupby['count'].sum()

我的希望是，调用_probability(df，'foo')将返回0.421052632(16/(12 + 16 + 10))。提前致谢!

最佳答案

您可以在其末端扔一个管道:

df['query'].str.get_dummies(sep=' ').T.dot(df['count']).pipe(lambda x: x / x.sum())

bar      0.315789
foo      0.421053
super    0.263158
dtype: float64

从头开始:
这比较复杂，但速度更快

from numpy.core.defchararray import count

q = df['query'].values
c = df['count'].values.repeat(count(q.astype(str), ' ') + 1)
f, u = pd.factorize(' '.join(q.tolist()).split())
b = np.bincount(f, c)
pd.Series(b / b.sum(), u)

foo      0.421053
bar      0.315789
super    0.263158
dtype: float64

关于python - Pandas 值的概率，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/46655202/