我正在尝试从非结构化字符串中提取名称(印第安人)。
我的代码来了:
text = "Balaji Chandrasekaran Bangalore | Senior Business Analyst/ Lead Business Analyst An accomplished Senior Business Analyst with a track record of handling complex projects in given period of time, exceeding above the expectation. Successful at developing product road maps and leading cross-functional software teams from prototype to release. Professional Competencies Systems Development Life Cycle (SDLC) Agile methodologies Business process improvement Requirements gathering & Analysis Project Management UML Specification UI & UX (Wireframe Designing) Functional Specification Test Scenario Creation SharePoint Admin Work History Senior Business Analyst (Aug 2012 Current) YouBox Technology pvt ltd, Chennai Translating business goals, feature concepts and customer needs into prioritized product requirements and use cases. Expertized in designing innovative wireframes combining user experience analysis and technology models. Extensive Experience in implementing soft wares for Shipping/Logistics firms to handle CRM, Finance, Logistics, Operations, Intermodal, and documentation. Strong interpersonal skills, highly adept at diplomatically facilitating discussions and negotiations with stakeholders. Education Bachelor of Engineering: Electronics & Communication, 2011 CES Tech Hosur Accomplishment Successful onsite implementation at various locations around the globe for Europe Shipping Company. - (Pre Study, General Design, and Functional Specification) Organized Business Analyst Forum and conducted various activities to develop skill sets of Business Analysts."
if text != "":
grammar = """PERSON: {<NNP>}"""
chunkParser = nltk.RegexpParser(grammar)
tagged = nltk.pos_tag(nltk.word_tokenize(text))
tree = chunkParser.parse(tagged)
for subtree in tree.subtrees():
if subtree.label() == "PERSON":
pronouns.append(' '.join([c[0] for c in subtree]))
print(pronouns)
['Balaji'、'Chandrasekaran'、'Bangalore'、'|'、'Senior'、'Business',
'分析','/','领导','业务','分析师','高级','业务',
‘分析师’、‘成功’、‘开发’、‘生命周期’、‘SDLC’,
“敏捷”、“业务”、“需求”、“分析”、“项目”,
“管理”、“UML”、“规范”、“UI”、“UX”、“线框”,
“设计”、“功能”、“规范”、“测试”、“方案”,
“创建”、“SharePoint”、“管理”、“工作”、“历史”、“高级”,
“业务”、“分析师”、“八月”、“当前”、“技术”、“金奈”,
“翻译”、“CRM”、“财务”、“物流”、“运营”,
“联运”、“教育”、“学士”、“工程”、“电子”,
“沟通”,“成就”,“成功”,“地中海”,
“船舶”,“公司”,“理学硕士”,“格鲁吉亚”,“理学硕士”,“柬埔寨”,“理学硕士”,“理学硕士”,
“南方”,“成功”,“股份”,“理学硕士”,“日内瓦”,“瑞士”,“Pre”,
“研究”、“一般”、“设计”、“功能”、“规范”、“O”,
“业务”、“分析师”、“论坛”、“业务”]
但实际上我只需要得到巴拉吉·钱德拉塞卡兰,我甚至试着使用斯坦福德内尔库,但没有选到巴拉吉·钱德拉塞卡兰
有谁能帮我从unstructure字符串中提取名字,或者给我推荐一些好的教程。
提前谢谢你。
最佳答案
就像我在评论中所说的,你必须为印度名字创建自己的语料库,并根据它测试你的文本。NLTK的书教你如何在Chapter 2中做到这一点(确切地说是1.9节)。
from nltk.corpus import PlaintextCorpusReader
# You can use a regular expression to find the files, or pass a list of files
files = ".*\.txt"
new_corpus = PlaintextCorpusReader("/path/", files)
corpus = nltk.Text(new_corpus.words())
另请参见:Creating a new corpus with NLTK