问题描述
,才有可能勉强度日动态网页。我所产生的数据意味着,例如生成标记< FONT>一些Java脚本这是
is it possible to scrape data generated by dynamic web page .I means for example This website generates the tag <font>
by some java script which is
document.write("<font class=spy2>:<\/font>"+(v2j0j0^o5r8)+(r8d4x4^y5i9)+(b2r8e5^u1p6)+(r8d4x4^y5i9))
的值更改每个页面refresh.each生成的代码表示数字0 - 9例如(代码1)+(代码2)+(CODE3)+(码4)
并在后端某种类型的语法分析的被写入其中的理解,并相应地生成的数字。
the values change on each page refresh.each generated code represents a number 0 - 9 for example (code1)+(code2)+(code3)+(code4)
and at the back end some type of parse is written which understands it and generates the numbers accordingly.
在渲染页面,例如代码1
设置一些地方为数字4有史以来位4生成它来自这个代码在哪里后得到解析
once page rendered and for example code1
was set some where for digit 4 the where ever the digit 4 is generated it comes from this code after getting parsed
\If我们使用 HtmlAgilityPack
我们看到,Java脚本代码,但不能将其产生的output.so有没有什么办法,我们可以读取标签它创建时的页面呈现?
\If we use HtmlAgilityPack
we see that java script code but not its generated output.so is there any way we can read the tag it creates when the page is rendered?
推荐答案
感谢您指出out.I只见上面通过实施.same的结果,但随后在看多了一个评论说,谁使用IE引擎我转身做了一个小的应用程序,做的工作。我加入IE并导航到该网站并阅读content.Here是代码
Thanks for pointing out.I saw that by implementing .same results but then looking at one more comment who says use IE engine i turned and made a small application that does the job.I added IE and navigated it to the website and read the content.Here is the code
private void webBrowser1_DocumentCompleted(object sender, System.Windows.Forms.WebBrowserDocumentCompletedEventArgs e)
{
System.Windows.Forms.HtmlElementCollection elementsforViewPost =
webBrowser1.Document.GetElementsByTagName("font");
foreach (System.Windows.Forms.HtmlElement current2 in elementsforViewPost)
{
if (current2.InnerText != null && CheckForValidProxyAddress(current2.InnerText) &&
ObtainedProxies.Where(index=>index.ProxyAddress == current2.InnerText.Trim()).ToList().Count == 0)
{
Proxy data = new Proxy();
data.IsRetired = false;
data.IsActive = true;
int result = 1;
data.DomainsVisited = 0;
data.ProxyAddress = current2.InnerText.Trim();
ObtainedProxies.Add(data);
}
和为接收的文本是有效的代理这里检查是什么,我也知道了一些页面不久前谷歌搜索
and for checking that received text is valid proxy here is what i did got it from some page long ago by googling
private bool CheckForValidProxyAddress(string address)
{
//create our match pattern
//string pattern = @"^([1-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(\.([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3}$:([0-9][0-9][0-9][0-9])";
string pattern = @"\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b\:[0-9]{0,4}";
//create our Regular Expression object
Regex check = new Regex(pattern);
//boolean variable to hold the status
bool valid = false;
//check to make sure an ip address was provided
if (address == "")
{
//no address provided so return false
valid = false;
}
else
{
//address provided so use the IsMatch Method
//of the Regular Expression object
valid = check.IsMatch(address, 0);
}
//return the results
return valid;
}
这篇关于刮动态内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!