冷凝多列相同的值

冷凝多列相同的值

我正在尝试集成多个数据源,但我发现精简地合并我的列有点棘手。我有一个看起来像这样的数据框:

df = pd.DataFrame([['Address Data','City data','State Data', 'Zip Data', np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
     [ np.nan,np.nan,np.nan,np.nan,'Address Data','City data','State Data', 'Zip Data',np.nan,np.nan,np.nan,np.nan],
                  [ np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'Address Data','City data','State Data', 'Zip Data']],
                 columns = ['Address1','City1','State1','Zip1','Address2','City2','State2','Zip2','Address3','City3','State3','Zip3'])

print df

       Address1      City1      State1      Zip1      Address2      City2/
0  Address Data  City Data  State Data  Zip Data           NaN        NaN
1           NaN        NaN         NaN       NaN  Address Data  City Data
2           NaN        NaN         NaN       NaN           NaN        NaN

       State2      Zip2      Address3      City3      State3      Zip3
0         NaN       NaN           NaN        NaN         NaN       NaN
1  State Data  Zip Data           NaN        NaN         NaN       NaN
2         NaN       NaN  Address Data  City Data  State Data  Zip Data


我试图结合到这样的事情:

#run code here to combine multiple versions of the columns

print df

   Address       City       State       Zip
0  Address Data  City data  State Data  Zip Data
1  Address Data  City data  State Data  Zip Data
2  Address Data  City data  State Data  Zip Data


我希望可以使用这样的行一次设置多个列:

df.loc[df['State1'].notnull(),['Address','city','state','State','Zip']] = df.loc[df['State1'].notnull(),['Address1','City1','State1','Zip1']].values


但是,.loc不允许您一次创建多个列。任何有关如何合并数据的想法将不胜感激!

根据以下解决方案进行编辑

该解决方案的问题在于我的数据实际上可能看起来像这样:

df = pd.DataFrame([['Address Data','City data','State Data', 'Zip Data', 'Address Data','City data','State Data', 'Zip Data',np.nan,np.nan,np.nan,np.nan],
 [ np.nan,np.nan,np.nan,np.nan,'Address Data','City data','State Data', 'Zip Data',np.nan,np.nan,np.nan,np.nan],
              [ np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'Address Data','City data','State Data', 'Zip Data']],
             columns = ['Address1','City1','State1','Zip1','Address2','City2','State2','Zip2','Address3','City3','State3','Zip3'])


在这种情况下,当我只想要原始3时,下面的解决方案给了我4行。 3具有数据,请使用节3“。每行都必须保持唯一,因为我没有合并行的许多其他属性。谢谢!

最佳答案

我认为解决方案完全不同,因此我决定创建新答案:

import pandas as pd
import numpy as np

#random dataframe
np.random.seed(1)
df1 = pd.DataFrame(np.random.randint(10, size=(3,9)))
df1.columns = ['a1','b1','c1','a2','b2','c2','a3','b3','c3']

df1.loc[[1,2],['a1','b1','c1']] = np.nan
print (df1)
    a1   b1   c1  a2  b2  c2  a3  b3  c3
0  5.0  8.0  9.0   5   0   0   1   7   6
1  NaN  NaN  NaN   5   2   4   2   4   7
2  NaN  NaN  NaN   7   0   6   9   9   7

#stack dataframe and extract string and numbers from column e
df = df1.stack().reset_index()
df.columns= ['d','e','f']
df[['g','h']] = df.e.str.extract(r'([a-zA-Z]+)([0-9]+)', expand=True)

#append 1 to d, because index starts from 1 and compare with h
df = df[df.d + 1 == df.h.astype(int)]
#remove columns h, e
df = df.drop(['h', 'e'], axis=1)
#reshaping
df = df.pivot(index='d', columns='g', values='f')
#remove index and columns names (pandas 0.18.0+)
df = df.rename_axis(None).rename_axis(None, axis=1)

print (df)
     a    b    c
0  5.0  8.0  9.0
1  5.0  2.0  4.0
2  9.0  9.0  7.0


编辑:我尝试修改您的示例:

import pandas as pd
import numpy as np

df1 = pd.DataFrame([['Address Data1','City data1','State Data1', 'Zip Data1', 'Address Data2','City data2','State Data2', 'Zip Data2',np.nan,np.nan,np.nan,np.nan],
 [ np.nan,np.nan,np.nan,np.nan,'Address Data3','City data3','State Data3', 'Zip Data3',np.nan,np.nan,np.nan,np.nan],
              [ np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'Address Data4','City data4','State Data4', 'Zip Data4']],
             columns = ['Address1','City1','State1','Zip1','Address2','City2','State2','Zip2','Address3','City3','State3','Zip3'])

print (df1)
        Address1       City1       State1       Zip1       Address2  \
0  Address Data1  City data1  State Data1  Zip Data1  Address Data2
1            NaN         NaN          NaN        NaN  Address Data3
2            NaN         NaN          NaN        NaN            NaN

        City2       State2       Zip2       Address3       City3       State3  \
0  City data2  State Data2  Zip Data2            NaN         NaN          NaN
1  City data3  State Data3  Zip Data3            NaN         NaN          NaN
2         NaN          NaN        NaN  Address Data4  City data4  State Data4

        Zip3
0        NaN
1        NaN
2  Zip Data4




#stack dataframe and extract string and numbers from column e
df = df1.stack().reset_index()
df.columns= ['d','e','f']
df[['g','h']] = df.e.str.extract(r'([a-zA-Z]+)([0-9]+)', expand=True)

#append 1 to d, because index starts from 1 and compare with h
df = df[df.d + 1 == df.h.astype(int)]
#remove columns h, e
df = df.drop(['h', 'e'], axis=1)
#reshaping
df = df.pivot(index='d', columns='g', values='f')

df = df.rename_axis(None).rename_axis(None, axis=1)
print (df)
         Address        City        State        Zip
0  Address Data1  City data1  State Data1  Zip Data1
1  Address Data3  City data3  State Data3  Zip Data3
2  Address Data4  City data4  State Data4  Zip Data4

关于python - Pandas 冷凝多列相同的值(value),我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/37438564/

10-12 19:24