本文介绍了如何从网站conatct页面获取只有公司地址块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
如何从网站联系页面获取公司地址块
i试过这个..
How to get only company address block from website conatct page
i have tried this..
public void Extract_all_text_from_webpage(string filename)
{
HtmlDocument document = new HtmlDocument();
document.Load(new MemoryStream(File.ReadAllBytes(filename)));
textBox1.Text += Environment.NewLine + (ExtractViewableTextCleaned(document.DocumentNode));
// if (_addressDictionaries.AddressDictDuplicates.Contains(ExtractViewableTextCleaned(document.DocumentNode)))
{
listBox1.Items.Add(Environment.NewLine + (ExtractViewableTextCleaned(document.DocumentNode)));
}
}
public static string ExtractViewableTextCleaned(HtmlNode node)
{
string textWithLotsOfWhiteSpaces = ExtractViewableText(node);
return _removeRepeatedWhitespaceRegex.Replace(textWithLotsOfWhiteSpaces, " ").Replace(" ","").Replace("©","");
}
public static string ExtractViewableText(HtmlNode node)
{
StringBuilder sb = new StringBuilder();
ExtractViewableTextHelper(sb, node);
return sb.ToString();
}
private static void ExtractViewableTextHelper(StringBuilder sb, HtmlNode node)
{
if (node.Name != "script" && node.Name != "style" && node.Name!="a")
{
if (node.NodeType == HtmlNodeType.Text)
{
AppendNodeText(sb, node);
}
foreach (HtmlNode child in node.ChildNodes)
{
ExtractViewableTextHelper(sb, child);
}
}
}
private static void AppendNodeText(StringBuilder sb, HtmlNode node)
{
string text = ((HtmlTextNode)node).Text;
if (string.IsNullOrWhiteSpace(text) == false)
{
sb.Append(Environment.NewLine + text);
// If the last char isn't a white-space, add a white space
// otherwise words will be added ontop of each other when they're only separated by
// tags
if (text.EndsWith("\t") || text.EndsWith("\n") || text.EndsWith(" ") || text.EndsWith("\r"))
{
// We're good!
}
else
{
sb.Append(" ");
}
}
}
推荐答案
这篇关于如何从网站conatct页面获取只有公司地址块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!