本文介绍了使用python清理文件中的\u2764\ufe0f \u2026数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我尝试使用正则表达式清理python中的数据Twitter,但我无法删除 \u2764\ufe0f \u2026
。 twitter数据在datas.txt文件中,该数据为数据:
I try to cleaning data twitter in python with regex, but i can't remove \u2764\ufe0f \u2026
. twitter data is in the datas.txt file, this is the data:
我尝试了三种方式:
First
I have tried three ways :
First
import re
with open ('datas.txt', 'r') as f:
mylist = [line for line in f]
emoji_pattern = re.compile(r'\\\\u\w+')
for i in mylist:
print(emoji_pattern.sub(r'', i))
第二
import re
f = open('datas.txt', 'r')
data = f.read()
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
u"\U0001f926-\U0001f937"
u'\U00010000-\U0010ffff'
u"\u200d"
u"\u2640-\u2642"
u"\u2600-\u2B55"
u"\u23cf"
u"\u23e9"
u"\u231a"
u"\u3030"
u"\ufe0f"
"]+", flags=re.UNICODE)
emoji_pattern.sub(r'', data)
第三
f= open("datas.txt", "r", encoding="UTF-8")
datas = f.read()
data = datas.encode('ascii', 'ignore').decode("utf-8")
print(data)
但仍然无法正常工作
推荐答案
您的文本文件包含非ASCII Unicode代码根据。您可以执行以下两项操作:
Your text file contains non-ASCII Unicode codepoints encoded according to how Python encodes Unicode literals in source code. There are two things you can do with that:
- 删除所有
\uXXXX
或\UXXXXXXXX
序列。这将删除以Python文字格式编写的所有Unicode代码点,原则上(尽管不一定),它将是非ASCII字符。例如,可以这样做:
- Delete all
\uXXXX
or\UXXXXXXXX
sequences from your data. This will remove all Unicode codepoints written in Python literal format, which, in principle (although not necessarily), will be non-ASCII characters. That can be done for example like this:
import re
with open ('datas.txt', 'r') as f:
mylist = [line for line in f]
unicode_literal = re.compile(r'\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}')
for i in mylist:
print(unicode_literal.sub(r'', i))
- 解释Unicode代码点作为其预期值。也就是说,您将获得一个字符串,其中包含与文本文件中表示的代码点相对应的非ASCII数据。您可以这样操作:
# Note file is read in byte mode
with open ('datas.txt', 'rb') as f:
mylist = [line for line in f]
for i in mylist:
print(mylist.decode('unicode-escape'))
这篇关于使用python清理文件中的\u2764\ufe0f \u2026数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!