因此,我正在关注有关多类文本分类的教程,我试图找到一种方法来通过带有以下格式的JSON文件的受监督方法来预测配方中的标签:

{
"title": "Turtle Cheesecake",
"summary": "Cheesecake is a staple at the Market, but it’s different nearly every day because we vary the toppings, crusts, and flavorings. Cookie crusts are particularly good with cheesecakes. If you prefer your cheesecake plain, just serve it without the topping",
"ingr": [
  "1½ cups graham cracker crumbs",
  "½ cup finely chopped pecans (pulse in a food processor several times)",
  "6 tablespoons ( ¾ stick) unsalted butter, melted",
  "1½ pounds cream cheese, softened",
  "¾ cup sugar",
  "2 tablespoons all purpose flour",
  "3 large eggs",
  "1large egg yolk",
  "½ cup heavy cream",
  "2 teaspoons pure vanilla extract",
  "1 cup sugar",
  "1 cup heavy cream",
  "½ teaspoon pure vanilla extract",
  "½ cup coarsely chopped pecans, toasted",
  "2 ounces semisweet chocolate, melted"
],
"prep": "To Make the Crust:\n\n\n\n Grease a 9-inch springform pan. Wrap the outside of the pan, including the bottom, with a large square of aluminum foil. Set aside.\n\n\n\..."
"tag": [
  "Moderate",
  "Casual Dinner Party",
  "Family Get-together",
  "Formal Dinner Party",
  "dessert",
  "dinner",
  "cake",
  "cheesecake",
  "dessert"
}


这是我正在运行的代码,导致TypeError:

import pandas as pd

df = pd.read_json('tagged-sample.json')
######################### Data Exploration #######################

from io import StringIO

col = ['tag', 'summary']
df = df[col]
df = df[pd.notnull(df['summary'])]

df.columns = ['tag', 'summary']

df['category_id'] = df['tag'].factorize()[0]


我该怎么做才能使用pandas.factorize中的“标签”类别
json。本教程在csv文件上执行此操作,可能会有所不同。
这是错误:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-d471748e6818> in <module>()
     12 df.columns = ['tag', 'summary']
     13
---> 14 df['category_id'] = df['tag'].factorize()[0]
     15
     16 #[['tag', 'category_id']].sort_values('category_id')

~\Anaconda3\lib\site-packages\pandas\core\base.py in factorize(self, sort, na_sentinel)
   1155     @Appender(algorithms._shared_docs['factorize'])
   1156     def factorize(self, sort=False, na_sentinel=-1):
-> 1157         return algorithms.factorize(self, sort=sort, na_sentinel=na_sentinel)
   1158
   1159     _shared_docs['searchsorted'] = (

~\Anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
    175                 else:
    176                     kwargs[new_arg_name] = new_arg_value
--> 177             return func(*args, **kwargs)
    178         return wrapper
    179     return _deprecate_kwarg

~\Anaconda3\lib\site-packages\pandas\core\algorithms.py in factorize(values, sort, order, na_sentinel, size_hint)
    628                                            na_sentinel=na_sentinel,
    629                                            size_hint=size_hint,
--> 630                                            na_value=na_value)
    631
    632     if sort and len(uniques) > 0:

~\Anaconda3\lib\site-packages\pandas\core\algorithms.py in _factorize_array(values, na_sentinel, size_hint, na_value)
    474     uniques = vec_klass()
    475     labels = table.get_labels(values, uniques, 0, na_sentinel,
--> 476                               na_value=na_value)
    477
    478     labels = _ensure_platform_int(labels)

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_labels()

TypeError: unhashable type: 'list'

最佳答案

如果调用pd.factorize(s)(其中s是Pandas系列),则系列的每个元素都必须为hashable

例如:

>>> s = pd.Series([1, 2, [3, 4, 5]])
>>> s
0            1
1            2
2    [3, 4, 5]
dtype: object
>>> pd.factorize(s)  # this will raise

>>> pd.factorize(s.drop(2))  # this is okay
(array([0, 1]), Int64Index([1, 2], dtype='int64'))


解决此问题的一种方法(不确定最终目标是什么)是将列表元素转换为可哈希的元组:

>>> s.apply(lambda x: tuple(x) if isinstance(x, list) else x)
0            1
1            2
2    (3, 4, 5)
dtype: object

>>> pd.factorize(s.apply(lambda x: tuple(x) if isinstance(x, list) else x))
(array([0, 1, 2]), Index([1, 2, (3, 4, 5)], dtype='object'))

10-08 20:07