来自采样文件数据的示例行:[[''8'',''2.33'',''A'',''BB'',''你好 那里''''100,000,000,000''],[下一行...] ....] 除了缺少属性指示符外,我们可以假设相同的 类型的数据通过coll继续。例如,一个字符串,int8, int16,float等。 1. python中测试天气的最有效方法是什么? /> 可以转换成给定的数字类型,如果它的 真的是像''A''或''hello'这样的字符串,那就单独留下?速度是关键吗?有什么想法吗? 2.那里有什么东西可以处理这个问题吗? 谢谢, ConorHello,I''m importing large text files of data using csv. I would like to addsome more auto sensing abilities. I''m considing sampling the datafile and doing some fuzzy logic scoring on the attributes (colls in adata base/ csv file, eg. height weight income etc.) to determine themost efficient ''type'' to convert the attribute coll into for furtherprocessing and efficient storage...Example row from sampled file data: [ [''8'',''2.33'', ''A'', ''BB'', ''hellothere'' ''100,000,000,000''], [next row...] ....]Aside from a missing attribute designator, we can assume that the sametype of data continues through a coll. For example, a string, int8,int16, float etc.1. What is the most efficient way in python to test weather a stringcan be converted into a given numeric type, or left alone if itsreally a string like ''A'' or ''hello''? Speed is key? Any thoughts?2. Is there anything out there already which deals with this issue?Thanks,Conor 这是未经测试的,但这里有一个大纲可以做你想要的。 首先将行转换为列: 列= zip(*行) 好​​的,这是很多打字。现在,你应该在列中运行,使用限制性最强的类型进行测试并使用限制较少的 类型。你还需要记住你的数字中可能有逗号的数字 - 所以你需要编写自己的转换器,为自己确定 是什么文字映射到什么价值观。只有你可以在这里决定你真正想要的。以下是我如何做到这一点的最小想法: def make_int(astr): 如果不是astr: 返回0 else: 返回int(astr.replace('','','''')) def make_float( astr): 如果不是astr: 返回0.0 否则: 返回浮动(astr.replace( '','','''')) make_str = lambda s:s 现在你可以将转换器放在一个列表中,记住订购它们。 converter = [make_int,make_float,make_str] 现在,进入列检查,移动到下一个,限制较少, $ b特定转换器发生故障时的$ b转换器。我们假设make_str 身份运算符永远不会失败。为了提高效率,我们可以把它留下来并且有一个 标志等,但这是留作练习。 new_columns = [] $ b列中列的$ b: 转换器中的转换器: 尝试: new_column = [convert(v)for v in column] 休息 除外: 继续 new_columns.append(new_column) 无缘无故,转换回行: new_rows = zip(* new_columns) 你必须自己决定如何处理歧义。例如, 将''1.0''浮点数或整数?以上假设您希望列中的所有值 具有相同的类型。重新排序循环可以在列中提供混合的 类型,但不符合您的规定要求。一些 的东西不像它们那样高效(例如,消除了笨拙的make_str的)。但是添加测试以提高效率会使云计算成逻辑。 JamesThis is untested, but here is an outline to do what you want.First convert rows to columns:columns = zip(*rows)Okay, that was a lot of typing. Now, you should run down the columns,testing with the most restrictive type and working to less restrictivetypes. You will also need to keep in mind the potential for commas inyour numbers--so you will need to write your own converters, determiningfor yourself what literals map to what values. Only you can decide whatyou really want here. Here is a minimal idea of how I would do it:def make_int(astr):if not astr:return 0else:return int(astr.replace('','', ''''))def make_float(astr):if not astr:return 0.0else:return float(astr.replace('','', ''''))make_str = lambda s: sNow you can put the converters in a list, remembering to order them.converters = [make_int, make_float, make_str]Now, go down the columns checking, moving to the next, less restrictive,converter when a particular converter fails. We assume that the make_stridentity operator will never fail. We could leave it out and have aflag, etc., for efficiency, but that is left as an exercise.new_columns = []for column in columns:for converter in converters:try:new_column = [converter(v) for v in column]breakexcept:continuenew_columns.append(new_column)For no reason at all, convert back to rows:new_rows = zip(*new_columns)You must decide for yourself how to deal with ambiguities. For example,will ''1.0'' be a float or an int? The above assumes you want all valuesin a column to have the same type. Reordering the loops can give mixedtypes in columns, but would not fulfill your stated requirements. Somethings are not as efficient as they might be (for example, eliminatingthe clumsy make_str). But adding tests to improve efficiency would cloudthe logic.James On 19 / 05/2007 10:04 AM,James Stroud写道:On 19/05/2007 10:04 AM, James Stroud wrote: py_genetic写道:py_genetic wrote: >你好, 我正在使用csv导入大型文本数据文件。我想补充一些自动感应能力。我正在对数据进行采样并对属性进行一些模糊逻辑评分(数据库/ csv文件中的colls,例如身高体重收入等)以确定最有效的''type''将属性coll转换为进一步的处理和高效存储... 来自采样文件数据的示例行:[[''8'', ''2.33'',''A'',''BB'',''你好那里有'100,000,000,000''],[下一行...] ....] 除了缺少属性指示符之外,我们可以假设相同的数据类型的数据继续通过coll。例如,一个字符串,int8, int16,float等。 1。 python中测试天气的最有效方法是什么?可以转换成给定的数字类型,如果它真的像''A'或''你好'那么字符串就可以单独留下来?速度是关键吗?有什么想法吗? 2。还有什么东西可以处理这个问题吗? 谢谢, Conor>Hello,I''m importing large text files of data using csv. I would like to addsome more auto sensing abilities. I''m considing sampling the datafile and doing some fuzzy logic scoring on the attributes (colls in adata base/ csv file, eg. height weight income etc.) to determine themost efficient ''type'' to convert the attribute coll into for furtherprocessing and efficient storage...Example row from sampled file data: [ [''8'',''2.33'', ''A'', ''BB'', ''hellothere'' ''100,000,000,000''], [next row...] ....]Aside from a missing attribute designator, we can assume that the sametype of data continues through a coll. For example, a string, int8,int16, float etc.1. What is the most efficient way in python to test weather a stringcan be converted into a given numeric type, or left alone if itsreally a string like ''A'' or ''hello''? Speed is key? Any thoughts?2. Is there anything out there already which deals with this issue?Thanks,Conor 这是未经测试的,但这里是做你想做的大纲。 首先将行转换为列: columns = zip(* rows) 好​​的,那是很多打字。现在,你应该在列中运行,使用限制性最强的类型进行测试并使用限制较少的 类型。你还需要记住你的数字中可能有逗号的数字 - 所以你需要编写自己的转换器,为自己确定 是什么文字映射到什么价值观。只有你可以在这里决定你真正想要的。这是我如何做到这一点的最小想法: def make_int(astr): 如果不是astr: 返回0 否则: 返回int(astr.replace('','','''')) def make_float(astr): 如果不是astr: 返回0.0 否则: 返回浮动( astr.replace('','','''') make_str = lambda s:s 现在你可以把列表中的转换器,记得订购它们。 converter = [make_int,make_float,make_str] 现在,沿着列向下当特定转换器发生故障时,检查,移动到下一个限制较少的转换器。我们假设make_str 身份运算符永远不会失败。为了提高效率,我们可以把它留下来并且有一个 标志等等,但这只是一个练习。 new_columns = [] 列中的列: 转换器中的转换器: 尝试: new_column = [converter(v)for v in the column] break 除外: 继续 new_columns.append(new_column) 无缘无故,转换回行: new_rows = zip(* new_columns) 你必须自己决定如何处理歧义。例如, 将''1.0''浮点数或整数?以上假设您希望列中的所有值 具有相同的类型。重新排序循环可以在列中提供混合的 类型,但不符合您的规定要求。一些 的东西不像它们那样高效(例如,消除了笨拙的make_str的)。但是添加测试以提高效率会使云计算成为逻辑。This is untested, but here is an outline to do what you want.First convert rows to columns:columns = zip(*rows)Okay, that was a lot of typing. Now, you should run down the columns,testing with the most restrictive type and working to less restrictivetypes. You will also need to keep in mind the potential for commas inyour numbers--so you will need to write your own converters, determiningfor yourself what literals map to what values. Only you can decide whatyou really want here. Here is a minimal idea of how I would do it:def make_int(astr): if not astr: return 0 else: return int(astr.replace('','', ''''))def make_float(astr): if not astr: return 0.0 else: return float(astr.replace('','', ''''))make_str = lambda s: sNow you can put the converters in a list, remembering to order them.converters = [make_int, make_float, make_str]Now, go down the columns checking, moving to the next, less restrictive,converter when a particular converter fails. We assume that the make_stridentity operator will never fail. We could leave it out and have aflag, etc., for efficiency, but that is left as an exercise.new_columns = []for column in columns: for converter in converters: try: new_column = [converter(v) for v in column] break except: continue new_columns.append(new_column)For no reason at all, convert back to rows:new_rows = zip(*new_columns)You must decide for yourself how to deal with ambiguities. For example,will ''1.0'' be a float or an int? The above assumes you want all valuesin a column to have the same type. Reordering the loops can give mixedtypes in columns, but would not fulfill your stated requirements. Somethings are not as efficient as they might be (for example, eliminatingthe clumsy make_str). But adding tests to improve efficiency would cloudthe logic. [如果出现不止一次,请提前道歉] 这种方法非常合理,如果: (1)所涉及的类型遵循一个简单的阶梯。层次结构[ints通过 浮点测试,浮点数通过str测试] (2)数据供应商已确保列中的所有值都是 实际上是预期类型的​​实例。 如果你需要约会,约束(1)就会崩溃。考虑31/12/99, 31/12 / 1999,311299 [int?],31121999 [int?],31DEC99,......那就是 在您允许三个不同订单(dmy,mdy,ymd)的日期之前。 约束(2)刚刚崩溃 - 用户提供的数据,似乎 不是规则,但是Rafferty没有规则,但没有法律,但Murphy'。 我采用的方法是测试一个值所有 类型的列,并选择成功率最高的非文本类型 (如果速率大于某个阈值,例如90%,否则 它的文本)。 对于大文件,采用1 / N样本可以节省大量时间,而且很少 误诊的可能性。 示例:1,079,000条记录的文件,包含15列,最终 被诊断为8 x文本,3 x int,1 x浮动,2 x日期(dmy订单), 和[no kidding] 1 x date(ymd order)。使用N == 101花了大约15 秒[Python 2.5.1,Win XP Pro SP2,3.2GHz双核]; N == 1需要 大约900秒。 转换器是指转换器。日期的功能用C语言写成。 干杯, John[apologies in advance if this appears more than once]This approach is quite reasonable, IF:(1) the types involved follow a simple "ladder" hierarchy [ints pass thefloat test, floats pass the str test](2) the supplier of the data has ensured that all values in a column areactually instances of the intended type.Constraint (1) falls apart if you need dates. Consider 31/12/99,31/12/1999, 311299 [int?], 31121999 [int?], 31DEC99, ... and that''sbefore you allow for dates in three different orders (dmy, mdy, ymd).Constraint (2) just falls apart -- with user-supplied data, there seemto be no rules but Rafferty''s and no laws but Murphy''s.The approach that I''ve adopted is to test the values in a column for alltypes, and choose the non-text type that has the highest success rate(provided the rate is greater than some threshold e.g. 90%, otherwiseit''s text).For large files, taking a 1/N sample can save a lot of time with littlechance of misdiagnosis.Example: file of 1,079,000 records, with 15 columns, ultimatelydiagnosed as being 8 x text, 3 x int, 1 x float, 2 x date (dmy order),and [no kidding] 1 x date (ymd order). Using N==101 took about 15seconds [Python 2.5.1, Win XP Pro SP2, 3.2GHz dual-core]; N==1 takesabout 900 seconds. The "converter" function for dates is written in C.Cheers,John 这篇关于将字符串转换为大多数有效类型'1' - > 1,'A'---> 'A','1.2'---> 1.2的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
08-14 13:27