本文介绍了使用python清理文件中的\u2764\ufe0f \u2026数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用正则表达式清理python中的数据Twitter,但我无法删除 \u2764\ufe0f \u2026 。 twitter数据在datas.txt文件中,该数据为数据:

I try to cleaning data twitter in python with regex, but i can't remove \u2764\ufe0f \u2026. twitter data is in the datas.txt file, this is the data:

我尝试了三种方式:

First

I have tried three ways :
First

import re

with open ('datas.txt', 'r') as f:
     mylist = [line for line in f]
emoji_pattern = re.compile(r'\\\\u\w+')
for i in mylist:
    print(emoji_pattern.sub(r'', i))


第二

import re
f = open('datas.txt', 'r')
data = f.read()
emoji_pattern = re.compile("["
                u"\U0001F600-\U0001F64F"  # emoticons
                u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                u"\U0001F680-\U0001F6FF"  # transport & map symbols
                u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                u"\U00002702-\U000027B0"
                u"\U000024C2-\U0001F251"
                u"\U0001f926-\U0001f937"
                u'\U00010000-\U0010ffff'
                u"\u200d"
                u"\u2640-\u2642"
                u"\u2600-\u2B55"
                u"\u23cf"
                u"\u23e9"
                u"\u231a"
                u"\u3030"
                u"\ufe0f"
    "]+", flags=re.UNICODE)
emoji_pattern.sub(r'', data)


第三

f= open("datas.txt", "r", encoding="UTF-8")
datas = f.read()
data = datas.encode('ascii', 'ignore').decode("utf-8")
print(data)

但仍然无法正常工作

推荐答案

您的文本文件包含非ASCII Unicode代码根据。您可以执行以下两项操作:

Your text file contains non-ASCII Unicode codepoints encoded according to how Python encodes Unicode literals in source code. There are two things you can do with that:


  • 删除所有 \uXXXX \UXXXXXXXX 序列。这将删除以Python文字格式编写的所有Unicode代码点,原则上(尽管不一定),它将是非ASCII字符。例如,可以这样做:

  • Delete all \uXXXX or \UXXXXXXXX sequences from your data. This will remove all Unicode codepoints written in Python literal format, which, in principle (although not necessarily), will be non-ASCII characters. That can be done for example like this:
import re

with open ('datas.txt', 'r') as f:
     mylist = [line for line in f]
unicode_literal = re.compile(r'\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}')
for i in mylist:
    print(unicode_literal.sub(r'', i))




  • 解释Unicode代码点作为其预期值。也就是说,您将获得一个字符串,其中包含与文本文件中表示的代码点相对应的非ASCII数据。您可以这样操作:

  • # Note file is read in byte mode
    with open ('datas.txt', 'rb') as f:
         mylist = [line for line in f]
    for i in mylist:
        print(mylist.decode('unicode-escape'))
    

    这篇关于使用python清理文件中的\u2764\ufe0f \u2026数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 18:36
查看更多