问题描述
使用FuzzyWuzzy库将所有结果存储在数据框列中时,我遇到了一个挑战(我猜想这可能需要循环吗?)我想看看你们中的任何人都可以为我提供解决方案!超级有帮助!
I'm running into a challenge with using the FuzzyWuzzy library to store all my results in a data frame column (I'm guessing it might require a loop?) I've been scratching my head over this all day, now I want to see if any of you can help me with the solution! Would be super helpful!
作为我要尝试执行的操作的示例,这里有2个数据框表…
As an example of what I'm trying to do, here's 2 data frame tables…
主表
+----+-----------------+
| ID | ITEM |
+----+-----------------+
| | |
| 1 | Pepperoni Pizza |
| | |
| 2 | Cheese Pizza |
| | |
| 3 | Chicken Salad |
| | |
| 4 | Plain Salad |
+----+-----------------+
查询表
+--------------+---+
| LOOKUP VALUE | - |
+--------------+---+
| | |
| Cheese | - |
| | |
| Salad | - |
+--------------+---+
基本上,我正在尝试针对主表中的整个值列表使用查找表的值,并将结果存储在第三个表中.
Essentially I'm trying to use the lookup table's values against the entire list of values in the Master table, and store the results in a third table.
这是我想要最终输出的样子...
+--------------+----------------------------+-------------------+
| LOOKUP VALUE | MATCHED VALUES | MATCHED VALUE IDS |
+--------------+----------------------------+-------------------+
| | | |
| Cheese | Cheese Pizza | 2 |
| | | |
| Salad | Chicken Salad, Plain Salad | 3,4 |
+--------------+----------------------------+-------------------+
我了解Fuzzy Wuzzy的基本知识,这是我的开始方式:
I know the very basics of Fuzzy Wuzzy, here's how I started:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
choices = ["Pepperoni Pizza","Cheese Pizza","Chicken Salad", "Plain Salad"]
process.extract("salad",choices,limit=2)
输出 = [('鸡肉沙拉',90),('普通沙拉',90)]
Output = [('Chicken Salad', 90), ('Plain Salad', 90)]
太好了,但是您如何系统地将所有查找值与主表中的所有值相对应地运行呢?
Great, but how do you do that in a systematic way, running all my lookup values against all the values in the master table?
非常感谢您向我宣读!
推荐答案
将列表存储在DataFrame中不是一个好主意,我建议将每个匹配项都存储为DataFrame中的一行.这是代码:
It's not a good idea to store lists in DataFrame, I suggest store every match as a row in DataFrame. Here is the code:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
import io
master = pd.read_csv(io.StringIO("""ID,ITEM
1,Pepperoni Pizza
2,Cheese Pizza
3,Chicken Salad
4,Plain Salad"""))
lookups = ["Cheese", "Salad"]
choices = master.set_index("ID").ITEM.to_dict()
res = [(lookup,) + item for lookup in lookups for item in process.extract(lookup, choices,limit=2)]
df = pd.DataFrame(res, columns=["lookup", "matched", "score", "id"])
df
输出:
lookup matched score id
0 Cheese Cheese Pizza 90 2
1 Cheese Chicken Salad 45 3
2 Salad Chicken Salad 90 3
3 Salad Plain Salad 90 4
基本上,我从master
创建一个choices
字典以进行匹配,然后循环lookups
并将结果存储为列表.最后将列表转换为DataFrame.
Basically, I create a choices
dict from master
for match and then for loop the lookups
and store the result as a list. And convert the list to DataFrame finally.
这篇关于使用Fuzzywuzzy在数据框中创建一列匹配结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!