我想验证一长串的url字符串,但其中一些包含元音变音符字符,例如:_、_、_等。
有没有办法配置apache commons urlvalidator来接受这些字符?
此测试失败(请注意_):

@Test
public void urlValidatorShouldPassWithUmlaut()
{
    // Given
    org.apache.commons.validator.routines.UrlValidator validator;
    validator = new UrlValidator( new String[] { "http", "https" }, UrlValidator.ALLOW_ALL_SCHEMES );

    // When
    String url = "http://dbpedia.org/resource/São_Paulo";

    // Then
    assertThat( validator.isValid( url ), is( true ) );
}

此测试通过(_替换为A):
@Test
public void urlValidatorShouldPassWithUmlaut()
{
    // Given
    org.apache.commons.validator.routines.UrlValidator validator;
    validator = new UrlValidator( new String[] { "http", "https" }, UrlValidator.ALLOW_ALL_SCHEMES );

    // When
    String url = "http://dbpedia.org/resource/Sao_Paulo";

    // Then
    assertThat( validator.isValid( url ), is( true ) );
}

软件版本:
<dependency>
    <groupId>commons-validator</groupId>
    <artifactId>commons-validator</artifactId>
    <version>1.4.0</version>
</dependency>

更新:
validator.isValid( IDN.toASCII(url) )也会失败,因为IDN.toASCII(url)会做一些我还不明白的事情,例如,它会将http://dbpedia.org/resource/São_Paulo转换为http://dbpedia.xn--org/resource/so_paulo-w1b,根据UrlValidator它仍然无效。

最佳答案

必须先对元音变音符部分进行编码,然后才能将其验证为:

import org.apache.commons.validator.routines.UrlValidator;

import java.io.UnsupportedEncodingException;
import java.net.URLEncoder;

public class UmlautUrlTest {
    public static void main(String[] args) {
        String url = "http://dbpedia.org/resource/";
        String umlautPart="São_Paulo";
        UrlValidator v= null;
        try {
            String s[]={"http", "https"};
            v = new UrlValidator(s, UrlValidator.ALLOW_ALL_SCHEMES);
            String encodedUrl=URLEncoder.encode(umlautPart,"UTF-8");
            System.out.println(v.isValid(url+encodedUrl));
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();  //To change body of catch statement use File | Settings | File Templates.
        }
    }
}

输出为:
true
S%C3%A3o_Paulo

编辑:
您可以使用此函数对整个url进行编码以进行解析。
public static String encodeUrl(String url) {
        String temp[] = url.split("://");
        String protocol = temp[0];
        String restOfUrl = temp[1];
        temp = restOfUrl.split("\\.");
        //for the all except last token of host
        for (int i = 0; i < temp.length - 1; i++) {
            try {
                temp[i] = URLEncoder.encode(temp[i], "UTF-8");
            } catch (UnsupportedEncodingException e) {
                e.printStackTrace();  //To change body of catch statement use File | Settings | File Templates.
            }
        }
        String temp2[] = temp[temp.length - 1].split("/");
        String host = "";
        for (int i = 0; i < temp.length - 1; i++) {
            host = host + temp[i];
        }
        try {
            host = host + "." + URLEncoder.encode(temp2[0], "UTF-8");
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();  //To change body of catch statement use File | Settings | File Templates.
        }
        host = host.substring(0);
        String remainingPart = "";
        for (int i = 1; i < temp2.length; i++) {
            try {
                remainingPart = remainingPart + "/" + URLEncoder.encode(temp2[i], "UTF-8");
            } catch (UnsupportedEncodingException e) {
                e.printStackTrace();  //To change body of catch statement use File | Settings | File Templates.
            }
        }
        return (protocol + "://" + host + remainingPart);
    }

在测试中使用:validator.isValid(encodeUrl(url))

关于java - Apache Commons UrlValidator-配置为允许变音符号,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/18164336/

10-10 10:10