本文介绍了Matlab文本字符串/ html解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从网站(html)获取信息到MATLAB。我可以使用以下命令将网上的html转换为字符串:

  urlread('http://www.websiteNameHere.com ...'); 

一旦我有字符串,我有一个非常长的字符串变量,包含整个html文件内容。从这个变量中,我正在寻找特定类中的值/字符。例如,html /网站将有一堆行,然后将具有以下形式的感兴趣的类:

  ... 
< h4 class =price>
< span class =priceSort> $ 39,991< / span>
< / h4>
< div class =mileage>
< span class =milesSort> 19,570 mi。< / span>
< / div>
...
< h4 class =price>
< span class =priceSort> $ 49,999< / span>
< / h4>
< div class =mileage>
< span class =milesSort> 9,000英里。< / span>
< / div>
...

我需要能够获取< span class =priceSort> < / span> ;即上述例子中的39,991美元和49,999美元。什么是最好的方式去做这件事?如果标签的特定开始和结束也是相同的(例如<价格> < / price>

我也需要知道最健壮的方法,因为我希望能够找到< span class =milesSort> 以及其他此类信息。感谢!

解决方案

使用

  s = urlread('http:/ /www.websiteNameHere.com ...'); 

x ='class =priceSort>'; %起始字符串x
y ='class =milesSort>'; %起始字符串y
z ='< / span>'; %结束字符串z

s2 = strsplit(s,x); %开始字符串x
s3 = strsplit(s,y); %开始字符串分裂

result1 = cell(size(s2,2)-1,1); %create cell array 1
result2 = cell(size(s3,2)-1,1); %create cell array 2

%通过忽略第一个值的值循环
%(更改ind = 2:size(s2,2)到ind = 1:size(s2,2)以查看为什么)

%起始字符串x循环
用于ind = 2:size(s2,2)
m = strsplit(s2 {1,ind},z);
result1 {ind-1} = m {1,1};
end

%起始字符串y循环
用于ind = 2:size(s3,2)
m = strsplit(s3 {1,ind},z);
result2 {ind-1} = m {1,1};
end

希望这可以帮助您


I am trying to get information from a website (html) into MATLAB. I am able to get the html from online into a string using:

urlread('http://www.websiteNameHere.com...');

Once I have the string I have a very LONG string variable, containing the entire html file contents. From this variable, I am looking for the value/characters in very specific classes. For example, the html/website will have a bunch of lines, and then will have the classes of interest in the following form:

...
<h4 class="price">
 <span class="priceSort">$39,991</span>
</h4>
<div class="mileage">
 <span class="milesSort">19,570 mi.</span>
</div>
...
<h4 class="price">
 <span class="priceSort">$49,999</span>
</h4>
<div class="mileage">
 <span class="milesSort">9,000 mi.</span>
</div>
...

I need to be able to get the information between <span class="priceSort"> and </span>; ie $39,991 and $49,999 in the above example. What is the best way to go about this? If the tags were specific beginning and ends that were also the same (such as <price> and </price>), I would have no problem...

I also need to know the most robust method, since I would like to be able to find <span class="milesSort"> and other information of this sort too. Thanks!

解决方案

Simple solution using strsplit

s = urlread('http://www.websiteNameHere.com...');

x = 'class="priceSort">'; %starting string x
y = 'class="milesSort">'; %starting string y
z = '</span>'; %ending string z

s2 = strsplit(s,x); %split for starting string x
s3 = strsplit(s,y); %split for starting string y

result1 = cell(size(s2,2)-1,1); %create cell array 1
result2 = cell(size(s3,2)-1,1); %create cell array 2

%loop through values ignoring first value
%(change ind=2:size(s2,2) to ind=1:size(s2,2) to see why)

%starting string x loop
for ind=2:size(s2,2)
    m = strsplit(s2{1,ind},z);
    result1{ind-1} = m{1,1};
end

%starting string y loop
for ind=2:size(s3,2)
    m = strsplit(s3{1,ind},z);
    result2{ind-1} = m{1,1};
end

Hope this helps

这篇关于Matlab文本字符串/ html解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-31 23:09