我有一张各种类型的容器(df_1)的表。我有另一个包含它们的表(df_2)。我想根据df_1的行是否是该类型容器的典型值来评估df_1的哪些行更可能被归类为它们的真实类型。
df_1 = pd.DataFrame({'Container' : [1,2,3,4,5,6,7,8],
'Type' : ['Box','Bag','Bin','Bag','Bin','Box','Bag','Bin']})
df_2 = pd.DataFrame({'Container' : [1,1,1,1,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,7,7,7,8],
'Item' : ['Ball','Ball','Brain','Ball','Ball','Baloon','Brain','Ball','Ball','Baloon','Brain','Ball','Ball','Baloon','Brain','Ball','Ball','Baloon','Bomb','Ball','Ball','Baloon','Brain','Ball','Ball','Bomb']})
最佳答案
以下方法考虑了每个容器的内容是否属于该类型。它对存在于其他容器中的项目(阳性)和存在于其他容器中的项目(阴性)具有同等的权重。它忽略了在其他容器中找到某物品的频率。它还忽略了内容物是否是另一种类型的容器的典型内容。
我认为这种方法会扩大规模。
# List of how typical the contents of each container are given the type of container
x = []
# Join
df_J = df_1 .set_index('Container').join(df_2 .set_index('Container'))
df_J['Container'] = df_J.index
df_J.index = range(len(df_J.index))
df_J ['Thing'] = 1
# Type of each container
Q_B = pd.DataFrame(df_1.Container).set_index('Container')
Q_B['Type'] = df_1.set_index('Container').Type
Di_Q_B = dict(zip(Q_B.index, Q_B.Type))
# Compare each container against all of the other containers
for Container in df_1.Container:
# Test data: Everything in the container
Te_C = df_2[df_2['Container'] == Container]
del Te_C['Container']
# Everything in all of the other containers
Tr_C = df_J[df_J['Container'] != Container]
# Training data: Everything in all of the other containers of that type
Tr_E = Tr_C[Tr_C['Type'] == Di_Q_B[Container]]
# Table of how many of each item is in each container
S_Tr = pd.pivot_table(Tr_E, values='Thing', index=Tr_E.Container, columns='Item', aggfunc=np.sum).fillna(0)
# Table of whether each item is in each container
Q_Tr = S_Tr.apply(np.sign)
# Table of how many containers in the training data contain each item
X_Tr = Q_Tr.sum(axis=0)
Y_Tr = pd.DataFrame(X_Tr)
# Table of whether any containers in the training data contain each item
Z_Tr = Y_Tr.apply(np .sign)
# List of which items are in the training data
Train = list(Z_Tr.index)
# Identify which of the items in the container are typical
Te_C['Typical'] = Te_C['Item'].map(lambda a: a in Train)
# Count how many typical items are in the container
u = Te_C['Typical'].sum()
# Count how many atypical items items are in the container
v = len(Te_C.index) - u
# Gauge how typical the contents of the container are (giving equal weight to typical and atypical items)
w = u - v
x.append(w)
# How typical the contents of each container are given the type of container
df_1['Pa_Ty'] = x
关于python - 离散数据的质量,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/57255055/