问题描述
示例文件我收到的大型CSV文件以(逗号或| |或^分隔)拥有数百万条记录.
某些字段具有不可打印的字符,例如CR | LF,这些字符已转换为字段结尾.这是在Windows10中.
sample fileI receive large CSV files delimited with (comma or | or ^) with millions of records.
Some of the fields have non-printable character like CR|LF which translated as end of field. This is in windows10.
我需要编写python才能通过文件并删除字段中的CR | LF.但是,我无法删除所有内容,因为这样行将被合并.
I need to write python to go thru the file and remove CR|LF in the fields. However, I cant remove all because then lines will be merged.
我已经在这里浏览了几篇关于如何删除不可打印内容的文章.我想写一个熊猫数据框,然后检查每个字段的CR | LF并将其删除.似乎有点复杂.如果您有执行此操作的快速代码,将对您有很大帮助.
I have gone thru several postings on here on how to remove non-printable. My thought to write a panda dataframe, then check every field for CR|LF and remove it. It seems a bit complicated. If you have a quick code how to do this, it will be great help.
谢谢.
示例文件:
record1, 111. texta, textb CR|LF
record2, 111. teCR|LF
xta, textb CR|LF
record3, 111. texta, textb CR|LF
示例输出文件应为:
record1, 111. texta, textb CR|LF
record2, 111. texta, textb CR|LF
record3, 111. texta, textb CR|LF
CR =回车= x0dLF =换行= x0a
CR = carriage Return = x0dLF = Line Feed = x0a
推荐答案
在文件上运行此脚本(例如,将其命名为fix_csv.py
)以对其进行清理:
Run this script (e.g. name it fix_csv.py
) on your file to sanitize it:
#!/usr/bin/env python3
import sys
import os
if len(sys.argv) < 3:
sys.stderr.write('Please give the input filename and an output filename.\n')
sys.exit(1)
# set the correct number of fields
nf = 3
# set the delimiter
delim = ','
inpf = sys.argv[1]
outf = sys.argv[2]
newline = os.linesep
with open(inpf, 'r') as inf, open(outf, 'w') as of:
cache = []
for line in inf:
line = line.strip()
ls = line.split(delim)
if len(ls) < nf or cache:
if not cache:
cache = cache + ls
elif cache:
cache[-1] += ls[0]
cache = cache + ls[1:]
if len(cache) == nf:
of.write(f'{delim}'.join(cache) + newline)
cache = []
else:
of.write(line + newline)
像这样称呼
./fix_csv input.dat output.dat
输出:
record1, 111. texta, textb
record2, 111. texta, textb
record3, 111. texta, textb
这篇关于从具有数百万条记录的大型CSV文件中删除不需要的不可打印字符-在Python 3或2.7中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!