问题描述
我有一个多行JSON文件,其中包含包含编码为十六进制的特殊字符的记录.这是单个JSON记录的示例:
I have a multi-line JSON file with records that contain special characters encoded as hexadecimals. Here is an example of a single JSON record:
{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}
该记录应为{"value":"ıarines Bintıç Ramuçlar"}
,例如'"字符替换为相应的十六进制\ x22,其他特殊Unicode字符替换为一个或两个十六进制(例如\ xC3 \ xA7编码ç等)
This record is supposed to be {"value":"ıarines Bintıç Ramuçlar"}
, e.g. '"' character are replaced with corresponding hexadecimal \x22 and other special Unicode characters are replaced with one or two hexadecimals (for instance \xC3\xA7 encodes ç, etc.)
我需要在Scala中将类似的字符串转换为常规的Unicode字符串,因此在打印时它会生成不带十六进制的{"value":"ıarines Bintıç Ramuçlar"}
.
I need to convert similar Strings into a regular Unicode String in Scala, so when printed it produced {"value":"ıarines Bintıç Ramuçlar"}
without hexadecimals.
在Python中,我可以使用一行代码轻松地解码这些记录:
In Python I can easily decode these records with a line of code:
>>> a = "{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}"
>>> a.decode("utf-8")
u'{"value":"\u0131arines Bint\u0131\xe7 Ramu\xe7lar"}'
>>> print a.decode("utf-8")
{"value":"ıarines Bintıç Ramuçlar"}
但是在Scala中,我找不到解码它的方法.我尝试将其转换为以下方式失败:
But in Scala I can't find a way to decode it. I unsuccessfully tried to convert it like this:
scala> val a = """{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}"""
scala> print(new String(a.getBytes(), "UTF-8"))
{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}
我也尝试了URLDecoder,因为我在类似问题的解决方案中找到了(但有URL):
I also tried URLDecoder as I found in solution for similar problem (but with URL):
scala> val a = """{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}"""
scala> print(java.net.URLDecoder.decode(a.replace("\\x", "%"), "UTF-8"))
{"value":"ıarines Bintıç Ramuçlar"}
此示例产生了预期的结果,但对于通用文本字段而言似乎并不安全,因为它旨在与URL配合使用,并且需要将字符串中的所有\x
替换为%
.
It produced the desired result for this example but is seems not safe for generic text fields since it designed to work with URLs and requires replacing all \x
to %
in the string.
Scala是否有更好的方法来解决此问题?
我是Scala的新手,感谢您的帮助
I am new to Scala and will be thankful for any help
更新:我已经使用javax.xml.bind.DatatypeConverter.parseHexBinary
制定了自定义解决方案.它现在可以使用,但是看起来很麻烦,而且一点也不优雅.我认为应该有一种更简单的方法.
UPDATE:I have made a custom solution with javax.xml.bind.DatatypeConverter.parseHexBinary
. It works for now, but it seems cumbersome and not at all elegant. I think there should be a simpler way to do this.
这是代码:
import javax.xml.bind.DatatypeConverter
import scala.annotation.tailrec
import scala.util.matching.Regex
def decodeHexChars(string: String): String = {
val regexHex: Regex = """\A\\[xX]([0-9a-fA-F]{1,2})(.*)""".r
def purgeBuffer(buffer: String, acc: List[Char]): List[Char] = {
if (buffer.isEmpty) acc
else new String(DatatypeConverter.parseHexBinary(buffer)).reverse.toList ::: acc
}
@tailrec
def traverse(s: String, acc: List[Char], buffer: String): String = s match {
case "" =>
val accUpdated = purgeBuffer(buffer, acc)
accUpdated.foldRight("")((str, b) => b + str)
case regexHex(chars, suffix) =>
traverse(suffix, acc, buffer + chars)
case _ =>
val accUpdated = purgeBuffer(buffer, acc)
traverse(s.tail, s.head :: accUpdated, "")
}
traverse(string, Nil, "")
}
推荐答案
每个\x??
编码一个字节,就像\x22
编码"
和\x5C
编码\
一样.但是在UTF-8中,某些字符是使用多个字节编码的,因此您需要将\xC4\xB1
转换为ı
符号,依此类推.
Each \x??
encodes one byte, like \x22
encodes "
and \x5C
encodes \
. But in UTF-8 some characters are encoded using multiple bytes, so you need to transform \xC4\xB1
to ı
symbol and so on.
replaceAllIn
确实不错,但它可能会吃掉您的斜线.因此,如果在替换的字符串中不使用组(例如\1
),则建议使用quoteReplacement
来转义\
和$
符号.
replaceAllIn
is really nice, but it might eat your slashes. So, if you don't use groups (like \1
) in a replaced string, quoteReplacement
is a recommended way to escape \
and $
symbols.
/** "22" -> 34, "AA" -> -86 */
def hex2byte(hex: String) = Integer.parseInt(hex, 16).toByte
/** decode strings like \x22 or \xC4\xB1\xC3\xA7 to specified encoding */
def decodeHexadecimals(str: String, encoding: String="UTF-8") =
new String(str.split("""\\x""").tail.map(hex2byte), encoding)
/** fix weird strings */
def replaceHexadecimals(str: String, encoding: String="UTF-8") =
"""(\\x[\dA-F]{2})+""".r.replaceAllIn(str, m =>
util.matching.Regex.quoteReplacement(
decodeHexadecimals(m.group(0), encoding)))
P.S.有人知道java.util.regex.Matcher.quoteReplacement
和scala.util.matching.Regex.quoteReplacement
之间的区别吗?
P.S. Does anyone know the difference between java.util.regex.Matcher.quoteReplacement
and scala.util.matching.Regex.quoteReplacement
?
这篇关于在Scala问题中使用转义的特殊字符解码字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!