python-3.x - 使用TF-IDF或Word2Vec从职位描述中提取技能

我遇到一种情况，我需要从可用的职位描述中提取正在申请工作的特定申请人的技能，并将其存储为新列。
数据框X如下所示：

Job_ID        Job_Desc
1             Applicant should posses technical capabilities including proficient knowledge of python and SQL
2             Applicant should posses technical capabilities including proficient knowledge of python and SQL and R

结果输出应如下所示：

Job_ID       Skills
1            Python,SQL
2            Python,SQL,R

我已经使用tf-idf计数矢量化程序来获取Job_Desc列中最重要的单词，但仍然无法在输出中获得所需的技能数据。是否可以使用Word2Vec使用跳跃语法或CBOW模型以某种方式实现此目的？

我的代码如下所示：

from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_df=0.50)
word_count_vector=cv.fit_transform(X)

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)

def sort_coo(coo_matrix):
tuples = zip(coo_matrix.col, coo_matrix.data)
return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

def extract_topn_from_vector(feature_names, sorted_items, topn=10):
"""get the feature names and tf-idf score of top n items"""

#use only topn items from vector
sorted_items = sorted_items[:topn]

score_vals = []
feature_vals = []

for idx, score in sorted_items:
    fname = feature_names[idx]

    #keep track of feature name and its corresponding score
    score_vals.append(round(score, 3))
    feature_vals.append(feature_names[idx])

#create a tuples of feature,score
#results = zip(feature_vals,score_vals)
results= {}
for idx in range(len(feature_vals)):
    results[feature_vals[idx]]=score_vals[idx]

return results

feature_names=cv.get_feature_names()
doc=X[0]

tf_idf_vector=tfidf_transformer.transform(cv.transform([doc]))
sorted_items=sort_coo(tf_idf_vector.tocoo())
keywords=extract_topn_from_vector(feature_names,sorted_items,10)
print("\n=====Title=====")
print(X[0])
print("\n===Keywords===")
for k in keywords:
   print(k,keywords[k])

最佳答案

我想不出TF-IDF，Word2Vec或其他简单/无监督算法可以单独识别所需的“技能”的方式。

您可能需要大量手工编制的技能列表-至少，这是一种自动评估旨在提取技能的方法的方法。

使用精选列表，然后像Word2Vec这样的东西可能会帮助建议同义词，替代形式或相关技能。（对于已知的技能X和文本上的大型Word2Vec模型，类似于X的术语可能是相似的技能-但不能保证，因此您可能仍需要人工审核/处理。）

通过将文本映射到结果的足够大的数据集（例如，候选人描述文本（简历）映射），无论人工审阅者是选择他们进行面试，还是雇用他们，或者他们成功完成工作，您都可以确定可以高度预测是否适合某个职位的术语。这些术语通常可能实际上是“技能”。但是发现这些相关性可能是一个更大的学习项目。

关于python-3.x - 使用TF-IDF或Word2Vec从职位描述中提取技能，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/60154561/