问题描述
我有一个标签分隔文件,其中包含 10亿行(Imagine 200列,而不是3):
abc -0.123 0.6524 0.325
foo -0.9808 0.874 -0.2341
bar 0.23123 -0.123124 -0.1232
我想创建一个字典,其中第一列中的字符串是键,其余的是值。我一直这样做,但它的计算昂贵:
import io
dictionary = { }
with io.open('bigfile','r')as fin:
for line in fin:
kv = line.strip()。split b $ bk,v = kv [0],kv [1:]
dictionary [k] = list(map(float,v))
我还能如何获得所需的字典?实际上numpy数组将比该值的浮点列表更合适。
您可以使用pandas加载df,然后根据需要构造一个新的df,然后调用 to_dict
:
在[99]:
t = abc -0.123 0.6524 0.325
foo -0.9808 0.874 -0.2341
bar 0.23123 -0.123124 -0.1232
df = pd.read_csv(io.StringIO(t),sep ='\ s +',header = None)
df = pd.DataFrame(columns = df [0],data = df.ix [:,1:]。values)
df.to_dict()
Out [99]:
{'abc':{0:-0.12300000000000001,
1:-0.98080000000000001,
2:0.23123000000000002},
'bar' 0.32500000000000001,1:-0.2341,2:-0.1232},
'foo':{0:0.65239999999999998,1:0.87400000000000011,2:-0.123124}}
EDIT
更动态的方法,构造一个临时df:
在[121]:
t =abc -0.123 0.6524 0.325
foo -0.9808 0.874 -0.2341
bar 0.23123 -0.123124 -0.1232
#确定cols的数量,我们将在usecols中使用它
col_len = pd。 read_csv(io.StringIO(t),sep ='\s +',nrows = 1).shape [1]
col_len
#读取第一个col我们将在名称中使用
cols = pd.read_csv(io.StringIO(t),sep ='\s +',usecols = [0],header = None)[0] .values
#现在读取并构造df确定的usecols和上面的名称
df = pd.read_csv(io.StringIO(t),sep ='\s +',header = None,usecols = list(range(1,col_len)),names = cols )
df.to_dict()
Out [121]:
{'abc':{0:-0.12300000000000001,
1:-0.98080000000000001,
2:0.23123000000000002 },
'bar':{0:0.32500000000000001,1:-0.2341,2:-0.1232},
'foo':{0:0.65239999999999998,1:0.87400000000000011,2:-0.123124}}
进一步更新
实际上,您不需要第一次读取,列长度可以通过第一列中的列数隐式导出:
In [128]:
pre>
t =abc -0.123 0.6524 0.325
foo -0.9808 0.874 -0.2341
bar 0.23123 -0.123124 -0.1232
cols = pd.read_csv(io.StringIO(t),sep ='\s +',usecols = [0],header = None)[0] .values
df = pd.read_csv .StringIO(t),sep ='\s +',header = None,usecols = list(range(1,len(cols)+1)),names = cols)
df.to_dict()
Out [128]:
{'abc':{0:-0.12300000000000001,
1:-0.98080000000000001,
2:0.23123000000000002},
'bar':{0 :0.32500000000000001,1:-0.2341,2:-0.1232},
'foo':{0:0.65239999999999998,1:0.87400000000000011,2:-0.123124}}
I have a tab separated file with 1 billion lines of these (Imagine 200 columns, instead of 3):
abc -0.123 0.6524 0.325 foo -0.9808 0.874 -0.2341 bar 0.23123 -0.123124 -0.1232
I want to create a dictionary where the string in the first column is the key and the rest are the values. I've been doing it like this but it's computationally expensive:
import io dictionary = {} with io.open('bigfile', 'r') as fin: for line in fin: kv = line.strip().split() k, v = kv[0], kv[1:] dictionary[k] = list(map(float, v))
How else can I do get the desired dictionary? Actually a numpy array would be more appropriate than a list of floats for the value.
解决方案You can use pandas to load the df, then construct a new df as desired and then call
to_dict
:In [99]: t="""abc -0.123 0.6524 0.325 foo -0.9808 0.874 -0.2341 bar 0.23123 -0.123124 -0.1232""" df = pd.read_csv(io.StringIO(t), sep='\s+', header=None) df = pd.DataFrame(columns = df[0], data = df.ix[:,1:].values) df.to_dict() Out[99]: {'abc': {0: -0.12300000000000001, 1: -0.98080000000000001, 2: 0.23123000000000002}, 'bar': {0: 0.32500000000000001, 1: -0.2341, 2: -0.1232}, 'foo': {0: 0.65239999999999998, 1: 0.87400000000000011, 2: -0.123124}}
EDIT
A more dynamic method and one which would reduce the need to construct a temporary df:
In [121]: t="""abc -0.123 0.6524 0.325 foo -0.9808 0.874 -0.2341 bar 0.23123 -0.123124 -0.1232""" # determine the number of cols, we'll use this in usecols col_len = pd.read_csv(io.StringIO(t), sep='\s+', nrows=1).shape[1] col_len # read the first col we'll use this in names cols = pd.read_csv(io.StringIO(t), sep='\s+', usecols=[0], header=None)[0].values # now read and construct the df using the determined usecols and names from above df = pd.read_csv(io.StringIO(t), sep='\s+', header=None, usecols = list(range(1, col_len)), names = cols) df.to_dict() Out[121]: {'abc': {0: -0.12300000000000001, 1: -0.98080000000000001, 2: 0.23123000000000002}, 'bar': {0: 0.32500000000000001, 1: -0.2341, 2: -0.1232}, 'foo': {0: 0.65239999999999998, 1: 0.87400000000000011, 2: -0.123124}}
Further update
Actually you don't need the first read, the column length can be implicitly derived by the number of columns in the first column anyway:
In [128]: t="""abc -0.123 0.6524 0.325 foo -0.9808 0.874 -0.2341 bar 0.23123 -0.123124 -0.1232""" cols = pd.read_csv(io.StringIO(t), sep='\s+', usecols=[0], header=None)[0].values df = pd.read_csv(io.StringIO(t), sep='\s+', header=None, usecols = list(range(1, len(cols)+1)), names = cols) df.to_dict() Out[128]: {'abc': {0: -0.12300000000000001, 1: -0.98080000000000001, 2: 0.23123000000000002}, 'bar': {0: 0.32500000000000001, 1: -0.2341, 2: -0.1232}, 'foo': {0: 0.65239999999999998, 1: 0.87400000000000011, 2: -0.123124}}
这篇关于读取选项卡分隔的文件,第一列为键,其余为值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!