本文介绍了如何将数据帧列分成多个列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

经过多番努力,我开始将我的R脚本迁移到Python。我在R中的大部分工作都涉及数据框架,我使用的是来自pandas包的 DataFrame 对象。在我的脚本中,我需要读入一个csv文件,并将数据导入到一个 DataFrame 对象。接下来,我需要将十六进制值转换为标记为 DATA 的列到按位数据,然后创建16个新列,每个位一个。



我在文件 test.txt 中的输入数据示例如下,

我的python脚本 test.py 如下,

  import glob 

import pandas as pd

import numpy as np

fname ='test.txt'

df = pd.read_csv(fname,comment =#)

dfs = df [df.TEST =='READ ']

#函数将hexstring转换为二进制字符串

def hex2bin(hstr):

return bin(int(hstr,16 )[2:]


#将列DATA中的hexstring转换为binarystring ROWDATA

dfs ['BINDATA'] = dfs ['DATA']。 apply(hex2bin)

#删除列DATA

del dfs ['DATA']

当我运行这个脚本,并检查对象 dfs ,我得到以下,



将名为 BINDATA 的列拆分为16个新列(可命名为B0,B0,B2,...,B15)。任何帮助将不胜感激。



谢谢&

解决方案

我不知道是否它可以做得更简单(没有for循环),但这是诀窍:

  for i in range(16) 
dfs ['B'+ str(i)] = dfs ['BINDATA']。str [i]


b $ b

本系列的 str 属性允许访问一些对每个元素起作用的矢量化字符串方法(参见docs:)。在这种情况下,我们只是索引字符串以访问不同的字符。

这给我:

  [20]:dfs 
Out [20]:
BINDATA B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15
0 1011111111101101 1 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1
1 1011101101111101 1 0 1 1 1 0 1 1 0 1 1 1 1 1 0 1
2 1111111111110111 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
3 1110011111111111 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1
4 1111101111111000 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0
5 1101111001110101 1 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1
6 1101111111111110 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0

如果你想要它们为int而不是字符串,你可以添加 .astype(int)






编辑:另一种方法(一个工作,但你必须更改列名第二步):

 在[34]:splitted = dfs ['BINDATA']。apply(lambda x:pd。系列(列表(x)))

In [35]:splitted.columns = ['B'+ str(x)for x in splitted.columns]

[36]:dfs.join(splitted)
Out [36]:
BINDATA B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15
0 1011111111101101 1 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1
1 1011101101111101 1 0 1 1 1 0 1 1 0 1 1 1 1 1 0 1
2 1111111111110111 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
3 1110011111111111 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1
4 1111101111111000 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0
5 1101111001110101 1 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1
6 1101111111111110 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0


After much prodding I am starting migrating my R scripts to Python. Most of my work in R involved data frames, and I am using the DataFrame object from the pandas package. In my script I need to read in a csv file and import the data into a DataFrame object. Next I need to convert the hex values into a column labelled DATA into bitwise data, and then create 16 new columns, one for each bit.

My example input data in file test.txt looks as follows,

My python script test.py is as follows,

import glob

import pandas as pd

import numpy as np

fname = 'test.txt'

df = pd.read_csv(fname, comment="#")

dfs = df[df.TEST == 'READ']

# function to convert the hexstring into a binary string

def hex2bin(hstr):

    return bin(int(hstr,16))[2:]


# convert the hexstring in column DATA to binarystring ROWDATA

dfs['BINDATA'] = dfs['DATA'].apply(hex2bin)

# get rid of the column DATA

del dfs['DATA']

When I run this script, and inspect the object dfs, I get the following,

So now I am not sure how to split the column named BINDATA into 16 new columns (could be named B0, B0, B2, ...., B15). Any help will be appreciated.

Thanks & Regards,

Derric.

解决方案

I don't know if it can be done simpler (without the for loop), but this does the trick:

for i in range(16):
    dfs['B'+str(i)] = dfs['BINDATA'].str[i]

The str attribute of the Series gives access to some vectorized string methods which act upon each element (see docs: http://pandas.pydata.org/pandas-docs/stable/basics.html#vectorized-string-methods). In this case we just index the string to acces the different characters.
This gives me:

In [20]: dfs
Out[20]:
            BINDATA B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15
0  1011111111101101  1  0  1  1  1  1  1  1  1  1   1   0   1   1   0   1
1  1011101101111101  1  0  1  1  1  0  1  1  0  1   1   1   1   1   0   1
2  1111111111110111  1  1  1  1  1  1  1  1  1  1   1   1   0   1   1   1
3  1110011111111111  1  1  1  0  0  1  1  1  1  1   1   1   1   1   1   1
4  1111101111111000  1  1  1  1  1  0  1  1  1  1   1   1   1   0   0   0
5  1101111001110101  1  1  0  1  1  1  1  0  0  1   1   1   0   1   0   1
6  1101111111111110  1  1  0  1  1  1  1  1  1  1   1   1   1   1   1   0

If you want them as ints instead of strings, you can add .astype(int) in the for loop.


EDIT: Another way to do it (a oneliner, but you have to change the column names in a second step):

In [34]: splitted = dfs['BINDATA'].apply(lambda x: pd.Series(list(x)))

In [35]: splitted.columns = ['B'+str(x) for x in splitted.columns]

In [36]: dfs.join(splitted)
Out[36]:
            BINDATA B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15
0  1011111111101101  1  0  1  1  1  1  1  1  1  1   1   0   1   1   0   1
1  1011101101111101  1  0  1  1  1  0  1  1  0  1   1   1   1   1   0   1
2  1111111111110111  1  1  1  1  1  1  1  1  1  1   1   1   0   1   1   1
3  1110011111111111  1  1  1  0  0  1  1  1  1  1   1   1   1   1   1   1
4  1111101111111000  1  1  1  1  1  0  1  1  1  1   1   1   1   0   0   0
5  1101111001110101  1  1  0  1  1  1  1  0  0  1   1   1   0   1   0   1
6  1101111111111110  1  1  0  1  1  1  1  1  1  1   1   1   1   1   1   0

这篇关于如何将数据帧列分成多个列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-11 13:40