本文介绍了 pandas 解析带有左和右引号字符的csv的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试读取结构如下的熊猫文件

I am trying to read a file in pandas which is structured as follows

<first>$$><$$<second>$$><$$<first>$$>
<foo>$$><$$<bar>$$><$$<baz>$$>

使用pd.read_csv('myflie.csv', encoding='utf8', sep='$$><$$', decimal=',')将无法产生有意义的结果.所有数据都读取到一个单独的列中,并且不提取引号.

using pd.read_csv('myflie.csv', encoding='utf8', sep='$$><$$', decimal=',')will fail to produce a meaningful result. All data is read into a single colum, and quotes are not extracted.

推荐答案

您需要通过\转义$,因为它被读取为正则表达式(字符串的结尾):

You need escape $ by \, because it is read as regex (end of string):

import pandas as pd
from pandas.compat import StringIO


temp=u"""<first>$$><$$<second>$$><$$<first>$$>
<foo>$$><$$<bar>$$><$$<baz>$$>"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), 
                 encoding='utf8', 
                 sep='\$\$><\$\$', 
                 decimal=',', 
                 header=None, 
                 engine='python')

print (df)
         0         1           2
0  <first>  <second>  <first>$$>
1    <foo>     <bar>    <baz>$$>

然后可能要从最后一列中删除$$>,请使用 replace (在字符串末尾添加了&):

And then for remove $$> from last column is possible use replace (added & for end of string):

df.iloc[:, -1] = df.iloc[:, -1].str.replace('\$\$>$', '')
print (df)
         0         1        2
0  <first>  <second>  <first>
1    <foo>     <bar>    <baz>

并删除引号:

df = df.replace(['^<', '>$'], ['', ''], regex=True)
print (df)
       0       1      2
0  first  second  first
1    foo     bar    baz

一起替换:

df = df.replace(['^<', '>$', '>\$\$'], ['', '', ''], regex=True)
print (df)
       0       1      2
0  first  second  first
1    foo     bar    baz

这篇关于 pandas 解析带有左和右引号字符的csv的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

11-01 08:23