问题描述
我已经使用熊猫整理了数据.然后我按照以下步骤填写程序
I've organized my data using pandas. and I fill my procedure out like below
import pandas as pd
import numpy as np
df1 = pd.read_table(r'E:\빅데이터 캠퍼스\골목상권 프로파일링 - 서울 열린데이터 광장 3.초기-16년5월분1\17.상권-추정매출\201301-201605\tbsm_trdar_selng.txt\tbsm_trdar_selng_utf8.txt' , sep='|' ,header=None
,dtype = { '0' : pd.np.int})
df1 = df1.replace('201301', int(201301))
df2 = df1[[0 ,1, 2, 3 ,4, 11,12 ,82 ]]
df2_rename = df2.columns = ['STDR_YM_CD', 'TRDAR_CD', 'TRDAR_CD_NM', 'SVC_INDUTY_CD', 'SVC_INDUTY_CD_NM', 'THSMON_SELNG_AMT', 'THSMON_SELNG_CO', 'STOR_CO' ]
print(df2.head(40))
df3_groupby = df2.groupby(['STDR_YM_CD', 'TRDAR_CD' ])
df4_agg = df3_groupby.agg(np.sum)
print(df4_agg.head(30))
当我打印df2时,我可以在TRDAR_CD列中看到11947和11948的值.就像下面的图片一样
When I print df2 I can see the 11947 and 11948 values in my TRDAR_CD column. like below picture
之后,我使用了groupby函数,并且在TRDAR_CD列中丢失了11948的值.您可以在下图中看到这种情况
after that, I used groupby function and I lose my 11948 values in my TRDAR_CD column. You can see this situation in below picture
可能是警告消息中的此问题??警告消息是'sys:1:DtypeWarning:列(0)具有混合类型.在导入时指定dtype选项,或将low_memory = False设置为'.
probably, this problem from the warning message?? warning message is 'sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.'
请帮助我
print(df2.info())是
print(df2.info()) is
RangeIndex:1089023条目,0到1089022
RangeIndex: 1089023 entries, 0 to 1089022
数据列(共8列):
STDR_YM_CD 1089023非空对象
STDR_YM_CD 1089023 non-null object
TRDAR_CD 1089023非空int64
TRDAR_CD 1089023 non-null int64
TRDAR_CD_NM 1085428非空对象
TRDAR_CD_NM 1085428 non-null object
SVC_INDUTY_CD 1089023非空对象
SVC_INDUTY_CD 1089023 non-null object
SVC_INDUTY_CD_NM 1089023非空对象
SVC_INDUTY_CD_NM 1089023 non-null object
THSMON_SELNG_AMT 1089023非空int64
THSMON_SELNG_AMT 1089023 non-null int64
THSMON_SELNG_CO 1089023非空int64
THSMON_SELNG_CO 1089023 non-null int64
STOR_CO 1089023非空int64
STOR_CO 1089023 non-null int64
dtypes:int64(4),object(4)
dtypes: int64(4), object(4)
内存使用量:66.5+ MB
memory usage: 66.5+ MB
没有
推荐答案
MultiIndex
被称为第一列和第二列,并且如果默认情况下第一级具有重复项,它将分散"更高级别的索引,以使控制台输出在眼睛.
MultiIndex
is called first and second columns and if first level has duplicates by default it 'sparsified' the higher levels of the indexes to make the console output a bit easier on the eyes.
您可以通过设置MultiIndex中显示数据> display.multi_sparse
到False
.
You can show data in first level of MultiIndex
by setting display.multi_sparse
to False
.
示例:
df = pd.DataFrame({'A':[1,1,3],
'B':[4,5,6],
'C':[7,8,9]})
df.set_index(['A','B'], inplace=True)
print (df)
C
A B
1 4 7
5 8
3 6 9
#temporary set multi_sparse to False
#http://pandas.pydata.org/pandas-docs/stable/options.html#getting-and-setting-options
with pd.option_context('display.multi_sparse', False):
print (df)
C
A B
1 4 7
1 5 8
3 6 9
通过问题编辑进行
我认为问题在于值11948
的类型是string
,因此被忽略了.
I think problem is type of value 11948
is string
, so it is omited.
按文件
您可以通过在中添加参数usecols
来简化解决方案. read_csv
,然后通过 GroupBy.sum
:
You can simplify your solution by add parameter usecols
in read_csv
and then aggregating by GroupBy.sum
:
import pandas as pd
import numpy as np
df2 = pd.read_table(r'tbsm_trdar_selng_utf8.txt' ,
sep='|' ,
header=None ,
usecols=[0 ,1, 2, 3 ,4, 11,12 ,82],
names=['STDR_YM_CD', 'TRDAR_CD', 'TRDAR_CD_NM', 'SVC_INDUTY_CD', 'SVC_INDUTY_CD_NM', 'THSMON_SELNG_AMT', 'THSMON_SELNG_CO', 'STOR_CO'],
dtype = { '0' : int})
df4_agg = df2.groupby(['STDR_YM_CD', 'TRDAR_CD' ]).sum()
print(df4_agg.head(10))
THSMON_SELNG_AMT THSMON_SELNG_CO STOR_CO
STDR_YM_CD TRDAR_CD
201301 11947 1966588856 74798 73
11948 3404215104 89064 116
11949 1078973946 42005 45
11950 1759827974 93245 71
11953 779024380 21042 84
11954 2367130386 94033 128
11956 511840921 23340 33
11957 329738651 15531 50
11958 1255880439 42774 118
11962 1837895919 66692 68
这篇关于我在列中丢失了我的价值观的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!