问题描述
我尝试打开.dta作为DataFrame.但是出现一个错误:"ValueError:列的值标签不是唯一的.重复的标签是:",其后是在列中两次将其包围的标签.
i try to open a .dta as DataFrame.But an Error appears: "ValueError: Value labels for column ... are not unique. The repeated labels are:" followed by labels wich apper twice in a column.
我知道在stata中使用完全相同的值来标记乘法代码并不聪明(这不是我的错:))经过一些研究,我知道熊猫不会接受重复的值标签(这很聪明).
I know labeling multiplie codes with the exact same value label in stata is not clever (not my fault :))After some research i know, pandas will not accept repeated value labels (this IS clever).
但是我不知道一个(好的)解决方案:有吗?
But i can't figure out a (good) solution:Is there:
a.一个简单的方法来打开熊猫数据并在此过程中将双打重命名(例如,将"label"更改为"label(2)")?
a. a smooth way to open the data with pandas and just rename the doubles (like "label" to "label(2)") in this process?
这是数据的样子(括号内的值标签):
here is what the data looks like (value labels in brackets):
| multilabel
1 | 11 (oneone or twotwo)
2 | 22 (oneone or twotwo)
3 | 33 (other-label-which-is-unique)
到目前为止我的代码:
import pandas as pd
#followed by any option that delivers this solution:
dataframe = pd.read_stata('file.dta')
或
b.一种快速简便的告诉状态的方法:仅将所有重复值标签重命名为"label(2)",而不是"label"?是的,到目前为止的代码也很无聊:
b. a fast an easy way to tell stata: just rename all repeated value labels by "label(2)" instead of "label"?and yes, the code so far is also rather boring:
use "file.dta"
*followed by a loop wich finds repeated labels and changes them
save "file.dta", replace
是的,有很多重复的值标签可以一一对应.
And yes, there are to many repeated value labels to go trough it one by one.
在这里,Stata-Commands产生一个最小的示例:
And here the Stata-Commands to produce a minimal example:
set obs 1
generate var1 = 1 in 1
set obs 2
replace var1 = 2 in 2
set obs 3
replace var1 = 3 in 3
generate var2 = 11 in 1
replace var2 = 22 in 2
replace var2 = 33 in 3
rename var2 multilabel
label define labelrepeat 11 "oneone or twotwo" 22 "oneone or twotwo"
label values multilabel labelrepeat
每个建议我都很高兴!
推荐答案
如果您的变量带有重复的标签,则
If you have a variable with repeated labels, then
decode multilabel, gen(valuelabel)
label values multilabel
将值标签放入字符串变量中,然后撤消multilabel
值与先前附加的值标签的关联.我不知道您还需要做什么,以及为什么您还要做其他事情.您现在拥有与以前相同的信息.我不知道熊猫是否会忽略价值标签的定义.
puts the value labels in a string variable and then undoes the association of multilabel
values and the previously attached value labels. I don't know what else you need to do and thus why you do anything else. You now have the same information as before. I don't know whether pandas will ignore the definition of value labels.
为了完整起见,这是一种找出哪些变量的值标签与数字值不一一对应的方法.
For completeness, here's a way to find out which variables have value labels that aren't in one-to-one correspondence with numeric values.
* your sandbox, simplified and extended
clear
set obs 3
generate var1 = _n
generate multilabel = 11 * _n
label define labelrepeat 11 "oneone or twotwo" 22 "oneone or twotwo"
label values multilabel labelrepeat
label define var1 1 "frog" 2 "toad" 3 "newt"
label val var1 var1
* my code
local bad
ds *, has(vallabel)
quietly foreach v in `r(varlist)' {
tempvar decoded diff
decode `v', gen(`decoded')
bysort `decoded' (`v') : gen `diff' = `v'[1] != `v'[_N] & !missing(`decoded')
count if `diff'
if r(N) > 0 local bad `bad' `v'
drop `decoded' `diff'
}
di "`bad'"
这篇关于Stats to Pandas:即使有重复的价值标签?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!