使用Fuzzywuzzy在数据框中创建一列匹配结果

使用Fuzzywuzzy在数据框中创建一列匹配结果

本文介绍了使用Fuzzywuzzy在数据框中创建一列匹配结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用FuzzyWuzzy库将所有结果存储在数据框列中时,我遇到了一个挑战(我猜想这可能需要循环吗?)我想看看你们中的任何人都可以为我提供解决方案!超级有帮助!

I'm running into a challenge with using the FuzzyWuzzy library to store all my results in a data frame column (I'm guessing it might require a loop?) I've been scratching my head over this all day, now I want to see if any of you can help me with the solution! Would be super helpful!

作为我要尝试执行的操作的示例,这里有2个数据框表…

As an example of what I'm trying to do, here's 2 data frame tables…

主表

+----+-----------------+
| ID |      ITEM       |
+----+-----------------+
|    |                 |
| 1  | Pepperoni Pizza |
|    |                 |
| 2  | Cheese Pizza    |
|    |                 |
| 3  | Chicken Salad   |
|    |                 |
| 4  | Plain Salad     |
+----+-----------------+

查询表

+--------------+---+
| LOOKUP VALUE | - |
+--------------+---+
|              |   |
| Cheese       | - |
|              |   |
| Salad        | - |
+--------------+---+

基本上,我正在尝试针对主表中的整个值列表使用查找表的值,并将结果存储在第三个表中.

Essentially I'm trying to use the lookup table's values against the entire list of values in the Master table, and store the results in a third table.

这是我想要最终输出的样子...

+--------------+----------------------------+-------------------+
| LOOKUP VALUE |       MATCHED VALUES       | MATCHED VALUE IDS |
+--------------+----------------------------+-------------------+
|              |                            |                   |
| Cheese       | Cheese Pizza               | 2                 |
|              |                            |                   |
| Salad        | Chicken Salad, Plain Salad | 3,4               |
+--------------+----------------------------+-------------------+

我了解Fuzzy Wuzzy的基本知识,这是我的开始方式:

I know the very basics of Fuzzy Wuzzy, here's how I started:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

choices = ["Pepperoni Pizza","Cheese Pizza","Chicken Salad", "Plain Salad"]
process.extract("salad",choices,limit=2)

输出 = [('鸡肉沙拉',90),('普通沙拉',90)]

Output = [('Chicken Salad', 90), ('Plain Salad', 90)]

太好了,但是您如何系统地将所有查找值与主表中的所有值相对应地运行呢?

Great, but how do you do that in a systematic way, running all my lookup values against all the values in the master table?

非常感谢您向我宣读!

推荐答案

将列表存储在DataFrame中不是一个好主意,我建议将每个匹配项都存储为DataFrame中的一行.这是代码:

It's not a good idea to store lists in DataFrame, I suggest store every match as a row in DataFrame. Here is the code:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

import pandas as pd
import io

master = pd.read_csv(io.StringIO("""ID,ITEM
1,Pepperoni Pizza
2,Cheese Pizza
3,Chicken Salad
4,Plain Salad"""))

lookups = ["Cheese", "Salad"]

choices = master.set_index("ID").ITEM.to_dict()

res = [(lookup,) + item for lookup in lookups for item in process.extract(lookup, choices,limit=2)]
df = pd.DataFrame(res, columns=["lookup", "matched", "score", "id"])
df

输出:

   lookup        matched  score  id
0  Cheese   Cheese Pizza     90   2
1  Cheese  Chicken Salad     45   3
2   Salad  Chicken Salad     90   3
3   Salad    Plain Salad     90   4

基本上,我从master创建一个choices字典以进行匹配,然后循环lookups并将结果存储为列表.最后将列表转换为DataFrame.

Basically, I create a choices dict from master for match and then for loop the lookups and store the result as a list. And convert the list to DataFrame finally.

这篇关于使用Fuzzywuzzy在数据框中创建一列匹配结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-18 19:34