我有下面的数据框。我想为类别列中的每个值创建一个列(例如:Sandthates,Restaurants ...)。该列将具有0或1,指示记录是否具有该值。这是我可以用getdummies做的事情,还是有人可以建议的另一种方式?
码:
print df1[1:3]
样本数据:
address \
4 4719 N 20Th St
14 9616 E Independence Blvd
attributes business_id \
4 {u'GoodForMeal': {u'dessert': False, u'latenig... duHFBe87uNSXImQmvBh87Q
14 {u'Alcohol': u'full_bar', u'HasTV': True, u'No... SDMRxmcKPNt1AHPBKqO64Q
categories city \
4 [Sandwiches, Restaurants] Phoenix
14 [Burgers, Bars, Restaurants, Sports Bars, Nigh... Matthews
hours is_open latitude \
4 {} 0 33.505928
14 {u'Monday': u'11:00-0:00', u'Tuesday': u'11:00... 1 35.135196
longitude name neighborhood postal_code review_count stars state
4 -112.038847 Blimpie 85016 10 4.5 AZ
14 -80.714683 Applebee's 28105 21 2.0 NC
更新:
testdummies = pd.concat(df1["categories"],pd.get_dummies(df1["categories"]))
testdummies.head()
Error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-13-1dae1382c2ba> in <module>()
1 # 13) create dummy variables for Categories
2
----> 3 testdummies = pd.concat(df1["categories"],pd.get_dummies(df1["categories"]))
4 testdummies.head()
/Users/anaconda/lib/python2.7/site-packages/pandas/core/reshape.pyc in get_dummies(data, prefix, prefix_sep, dummy_na, columns, sparse, drop_first)
1102 else:
1103 result = _get_dummies_1d(data, prefix, prefix_sep, dummy_na,
-> 1104 sparse=sparse, drop_first=drop_first)
1105 return result
1106
/Users/anaconda/lib/python2.7/site-packages/pandas/core/reshape.pyc in _get_dummies_1d(data, prefix, prefix_sep, dummy_na, sparse, drop_first)
1109 sparse=False, drop_first=False):
1110 # Series avoids inconsistent NaN handling
-> 1111 codes, levels = _factorize_from_iterable(Series(data))
1112
1113 def get_empty_Frame(data, sparse):
/Users/anaconda/lib/python2.7/site-packages/pandas/core/categorical.pyc in _factorize_from_iterable(values)
2038 codes = values.codes
2039 else:
-> 2040 cat = Categorical(values, ordered=True)
2041 categories = cat.categories
2042 codes = cat.codes
/Users/anaconda/lib/python2.7/site-packages/pandas/core/categorical.pyc in __init__(self, values, categories, ordered, name, fastpath)
288 codes, categories = factorize(values, sort=True)
289 except TypeError:
--> 290 codes, categories = factorize(values, sort=False)
291 if ordered:
292 # raise, as we don't have a sortable data structure and so
/Users/anaconda/lib/python2.7/site-packages/pandas/core/algorithms.pyc in factorize(values, sort, order, na_sentinel, size_hint)
311 table = hash_klass(size_hint or len(vals))
312 uniques = vec_klass()
--> 313 labels = table.get_labels(vals, uniques, 0, na_sentinel, True)
314
315 labels = _ensure_platform_int(labels)
pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_labels (pandas/hashtable.c:15447)()
TypeError: unhashable type: 'list'
更新:
码:
bus_rev_cat = pd.get_dummies(bus_rev['categories'].apply(pd.Series))
bus_rev2 = pd.concat([bus_rev,bus_rev_cat],axis=1)
print(bus_rev2[1:10])
Sample Data:
user_id business_id stars_x \
1 CxDOIDnH8gp9KXzpBHJYXw XSiqtcVEsP6dLOL7ZA9OxA 4
2 CxDOIDnH8gp9KXzpBHJYXw v95ot_TNwTk1iJ5n56dR0g 3
3 CxDOIDnH8gp9KXzpBHJYXw uloYxyRAMesZzI99mfNInA 2
4 CxDOIDnH8gp9KXzpBHJYXw gtcsOodbmk4E0TulYHnlHA 4
5 CxDOIDnH8gp9KXzpBHJYXw lOd50CiDJeNWmN_KsvR2rg 3
6 CxDOIDnH8gp9KXzpBHJYXw 7hUp4XxmUCGqvPFAM8IJww 3
7 CxDOIDnH8gp9KXzpBHJYXw Ze4VPogvcD7inc3QuvY_yg 2
8 CxDOIDnH8gp9KXzpBHJYXw txAKid34IUd9spo6MLF_Sw 3
9 CxDOIDnH8gp9KXzpBHJYXw oiknQaNH9cGC6UBWC8S_Zg 3
address attributes \
1 522 Yonge Street {u'BusinessParking': {u'garage': False, u'stre...
2 1661 Denison Street {u'BusinessParking': {u'garage': False, u'stre...
3 4101 Rutherford Road {u'BusinessParking': {u'garage': False, u'stre...
4 815 W Bloor Street {u'Alcohol': u'full_bar', u'HasTV': False, u'N...
5 114 Laird Drive {u'GoodForMeal': {u'dessert': False, u'latenig...
6 300 Borough Dr, 215 {u'BusinessParking': {u'garage': False, u'stre...
7 5117 Sheppard Avenue E {u'BusinessParking': {u'garage': False, u'stre...
8 205 Main St {u'BusinessParking': {u'garage': False, u'stre...
9 6347 Yonge Street {u'GoodForMeal': {u'dessert': False, u'latenig...
categories city \
1 [Restaurants, Ramen, Japanese] Toronto
2 [Chinese, Seafood, Restaurants] Markham
3 [Italian, Restaurants] Woodbridge
4 [Food, Coffee & Tea, Sandwiches, Cafes, Cockta... Toronto
5 [Japanese, Sushi Bars, Restaurants] East York
6 [Restaurants, Canadian (New), Steakhouses, Ame... Scarborough
7 [Canadian (New), Restaurants, Breakfast & Brunch] Toronto
8 [Italian, Restaurants, Canadian (New)] Markham
9 [Restaurants, Korean] Toronto
hours is_open latitude \
1 {u'Monday': u'11:00-22:00', u'Tuesday': u'11:0... 1 43.663689
2 {} 0 43.834295
3 {u'Monday': u'12:00-22:00', u'Tuesday': u'12:0... 1 43.823486
4 {u'Monday': u'12:00-2:00', u'Tuesday': u'12:00... 1 43.662726
5 {u'Tuesday': u'17:00-22:00', u'Friday': u'17:0... 0 43.706665
6 {u'Monday': u'11:00-0:00', u'Tuesday': u'11:00... 1 43.776146
7 {u'Monday': u'0:00-0:00', u'Tuesday': u'0:00-0... 1 43.793599
8 {} 1 43.868463
9 {} 0 43.796237
... 6_Pizza 6_Restaurants 7_Bars 7_Canadian (New) 7_French \
1 ... 0 0 0 0 0
2 ... 0 0 0 0 0
3 ... 0 0 0 0 0
4 ... 0 0 1 0 0
5 ... 0 0 0 0 0
6 ... 0 0 0 0 0
7 ... 0 0 0 0 0
8 ... 0 0 0 0 0
9 ... 0 0 0 0 0
7_Restaurants 8_Mediterranean 8_Nightlife 8_Southern 8_Specialty Food
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 1 0 0
5 0 0 0 0 0
6 0 0 0 0 0
7 0 0 0 0 0
8 0 0 0 0 0
9 0 0 0 0 0
[9 rows x 149 columns]
最佳答案
您可以使用get_dummies
来实现您想要的功能:
import pandas as pd
df = pd.DataFrame({"Categorical": ["a", "b", "c", "a"]})
df
>>> Categorical
0 a
1 b
2 c
3 a
pd.concat([df, pd.get_dummies(df["Categorical"])], axis=1)
>>> Categorical a b c
0 a 1 0 0
1 b 0 1 0
2 c 0 0 1
3 a 1 0 0
关于python - 为列值创建二元指标变量,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/46940960/