问题描述
问题:我有一个Access数据库的地址字段,该数据库已转换为Sql Server2005.此字段在一个字段中包含所有内容.我需要将地址的各个部分解析为归一化表中的相应字段.我需要对大约4,000条记录执行此操作,并且它必须是可重复的.
Problem: I have an address field from an Access database which has been converted to Sql Server 2005. This field has everything all in one field. I need to parse out the individual sections of the address into their appropriate fields in a normalized table. I need to do this for approximately 4,000 records and it needs to be repeatable.
假设:
-
假设在美国的地址(目前)
Assume an address in the US (for now)
假定输入字符串有时会包含一个收件人(被寻址的人)和/或第二个街道地址(例如Suite B)
assume that the input string will sometimes contain an addressee (the person being addressed) and/or a second street address (i.e. Suite B)
状态可以缩写
邮政编码可以是标准的5位数或zip + 4
zip code could be standard 5 digit or zip+4
在某些情况下有错别字
更新:针对所提出的问题,并不是普遍遵循标准,我需要存储单个值,而不仅是地址编码,而且错误意味着错别字(已在上文更正)
UPDATE: In response to the questions posed, standards were not universally followed, I need need to store the individual values, not just geocode and errors means typo (corrected above)
样本数据:
-
A. P·卡罗尔Son 2299 Lewes-Georgetown Hwy,乔治敦,DE 19947
A. P. Croll & Son 2299 Lewes-Georgetown Hwy, Georgetown, DE 19947
格林威治州19950肖尼路11522号
11522 Shawnee Road, Greenwood DE 19950
S.W.国王路144号多佛,DE 19901
144 Kings Highway, S.W. Dover, DE 19901
集成常量.服务2 Penns Way Suite 405 New Castle,德国19720
Intergrated Const. Services 2 Penns Way Suite 405 New Castle, DE 19720
Humes Realty 33 Bridle Ridge Court,刘易斯,DE 19958
Humes Realty 33 Bridle Ridge Court, Lewes, DE 19958
尼科尔斯(Nichols)挖掘2742 Pulaski Hwy Newark,德国19711
Nichols Excavation 2742 Pulaski Hwy Newark, DE 19711
19904年在士麦那的2284 Bryn Zion Road
2284 Bryn Zion Road, Smyrna, DE 19904
VEI Dover Crossroads,LLC蛇形路1500号,巴尔的摩MD 21套房100
VEI Dover Crossroads, LLC 1500 Serpentine Road, Suite 100 Baltimore MD 21
580 North Dupont Highway Dover,DE 19901
580 North Dupont Highway Dover, DE 19901
P.O. Box 778 Dover,DE 19903
P.O. Box 778 Dover, DE 19903
推荐答案
我在这种解析上做了很多工作.由于存在错误,您将无法获得100%的准确度,但是您可以通过一些方法来获得大部分结果,然后进行可视BS测试.这是解决该问题的一般方法.它不是代码,因为编写代码是很学术的,没有怪异,只有很多字符串处理.
I've done a lot of work on this kind of parsing. Because there are errors you won't get 100% accuracy, but there are a few things you can do to get most of the way there, and then do a visual BS test. Here's the general way to go about it. It's not code, because it's pretty academic to write it, there's no weirdness, just lots of string handling.
(现在您已经发布了一些示例数据,我做了一些小的更改)
(Now that you've posted some sample data, I've made some minor changes)
- 后退.从邮政编码附近开始,该邮政编码将以两种已知格式之一开始,即接近结尾处:XXXXX或XXXXX-XXXX.如果未显示,则可以假定您位于下面的城市州部分中.
- 在zip之前的下一个内容将是状态,它可以是两个字母的格式,也可以是单词.您也知道这些将是什么-只有50个.另外,您可以使单词听起来清晰,以帮助弥补拼写错误.
- 在那之前是城市,并且可能与州在同一行.您可以使用邮政编码数据库来基于邮政编码检查城市和州,或者至少将其用作BS检测器.
- 街道地址通常为一两行.第二行通常是套房编号(如果有),但也可以是邮政信箱.
- 在第一行或第二行中检测名称几乎是不可能的,但是如果它没有前缀数字(或者如果前缀有"attn:"或"attention to:"则可能会给您)提示是名称还是地址行.
- Work backward. Start from the zip code, which will be near the end, and in one of two known formats: XXXXX or XXXXX-XXXX. If this doesn't appear, you can assume you're in the city, state portion, below.
- The next thing, before the zip, is going to be the state, and it'll be either in a two-letter format, or as words. You know what these will be, too -- there's only 50 of them. Also, you could soundex the words to help compensate for spelling errors.
- before that is the city, and it's probably on the same line as the state. You could use a zip-code database to check the city and state based on the zip, or at least use it as a BS detector.
- The street address will generally be one or two lines. The second line will generally be the suite number if there is one, but it could also be a PO box.
- It's going to be near-impossible to detect a name on the first or second line, though if it's not prefixed with a number (or if it's prefixed with an "attn:" or "attention to:" it could give you a hint as to whether it's a name or an address line.
我希望这会有所帮助.
这篇关于从字符串中解析可用的街道地址,城市,州,邮政编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!