我试图创建一个iOS应用程序,只是要提取网页的一部分。

我有用于连接到URL并将HTML存储在NSString中的代码

我已经尝试过了,但是我得到的结果只是空字符串

    NSScanner* newScanner = [NSScanner scannerWithString:htmlData];
    // Create a new scanner and give it the html data to parse.

    while (![newScanner isAtEnd])
    {
        [newScanner scanUpToString:@"<body>" intoString:NULL];
        // Scam until <body> tag is found

        [newScanner scanUpToString:@"</body>" intoString:&bodyText];
        // Everything up to the end tag will get placed into the memory address of the result string

    }

我尝试了另一种方法...
    NSScanner* newScanner = [NSScanner scannerWithString:htmlData];
    // Create a new scanner and give it the html data to parse.

    while (![newScanner isAtEnd])
    {
        [newScanner scanUpToString:@"<body" intoString:NULL];
        // Scam until <body> tag is found

        [newScanner scanUpToString:@">" intoString:NULL];
        // Go to end of opening <body> tag

        [newScanner scanUpToString:@"</body>" intoString:&bodyText];
        // Everything up to the end tag will get placed into the memory address of the result string

    }

第二种方法返回以>< script...等开头的字符串

如果说老实话,我没有一个很好的URL来进行测试,并且我认为通过一些帮助去除体内标记的方法可能会更容易(例如<p></p>)

任何帮助将不胜感激

最佳答案

我不知道您的第一种方法为何无效。我假设您在该代码段之前定义了bodyText。这段代码对我来说很好用,

- (void)viewDidLoad {
    [super viewDidLoad];
    NSString *htmlData = @"This is some stuff before <body> this is the body </body> with some more stuff";
    NSScanner* newScanner = [NSScanner scannerWithString:htmlData];
    NSString *bodyText;
    while (![newScanner isAtEnd]) {
        [newScanner scanUpToString:@"<body>" intoString:NULL];
        [newScanner scanString:@"<body>" intoString:NULL];
        [newScanner scanUpToString:@"</body>" intoString:&bodyText];
    }
    NSLog(@"%@",bodyText); // 2015-01-28 15:58:00.360 ScanningOfHTMLProblem[1373:661934] this is the body
}

请注意,我添加了一个对scanString:intoString:的调用以跳过第一个"<body>"

10-07 19:43
查看更多