本文介绍了R 如何用小数秒格式化 POSIXct的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我相信 R 错误地将 POSIXct 类型格式化为小数秒.我通过 R-bugs 作为增强请求提交了这个,并被我们认为当前的行为是正确的 - 错误已删除"刷掉了.虽然我非常感谢他们所做的和继续做的工作,但我希望其他人对这个特定问题的看法,以及如何更有效地表达观点的建议.

I believe that R incorrectly formats POSIXct types with fractional seconds. I submitted this via R-bugs as an enhancement request and got brushed off with "we think the current behavior is correct -- bug deleted." While I am very appreciative of the work they have done and continue to do, I wanted to get other peoples' take on this particular issue, and perhaps advice on how to make the point more effectively.

这是一个例子:

 > tt <- as.POSIXct('2011-10-11 07:49:36.3')
 > strftime(tt,'%Y-%m-%d %H:%M:%OS1')
 [1] "2011-10-11 07:49:36.2"

也就是说,tt 被创建为 POSIXct 时间,小数部分为 0.3 秒.当用一位十进制数字打印时,显示的值为 0.2.我经常使用毫秒精度的时间戳,这让我很头疼,因为打印出来的时间通常比实际值低一个档位.

That is, tt is created as a POSIXct time with fractional part .3 seconds. When it is printed with one decimal digit, the value shown is .2. I work a lot with timestamps of millisecond precision and it causes me a lot of headaches that times are often printed one notch lower than the actual value.

这是正在发生的事情:POSIXct 是自纪元以来的浮点秒数.所有整数值都被精确处理,但在基数为 2 的浮点中,最接近 0.3 的值比 .3 小很多.strftime() 对于格式 %OSn 的规定行为是向下舍入到要求的十进制数字位数,因此显示的结果是 0.2.对于其他小数部分,浮点值略高于输入的值,显示结果为预期:

Here is what is happening: POSIXct is a floating-point number of seconds since the epoch. All integer values are handled precisely, but in base-2 floating point, the closest value to .3 is very slightly smaller than .3. The stated behavior of strftime() for format %OSn is to round down to the requested number of decimal digits, so the displayed result is .2. For other fractional parts the floating point value is slightly above the value entered and the display gives the expected result:

 > tt <- as.POSIXct('2011-10-11 07:49:36.4')
 > strftime(tt,'%Y-%m-%d %H:%M:%OS1')
 [1] "2011-10-11 07:49:36.4"

开发人员的论点是,对于时间类型,我们应该始终向下舍入到要求的精度.例如,如果时间是 11:59:59.8,那么用 %H:%M 格式打印它应该给出11:59"而不是12:00",并且 %H:%M:%S 应该给出11:59:59"而不是12:00:00".我同意整数秒数和格式标志 %S 的这一点,但我认为对于为秒的小数部分设计的格式标志,行为应该有所不同.我希望看到 %OSn 使用舍入到最近的行为,即使 n = 0%S 使用舍入,所以以 %H:%M:%OS0 格式打印 11:59:59.8 将给出12:00:00".这不会影响整数秒的任何内容,因为它们总是精确表示,但它会更自然地处理小数秒的舍入错误.

The developers' argument is that for time types we should always round down to the requested precision. For example, if the time is 11:59:59.8 then printing it with format %H:%M should give "11:59" not "12:00", and %H:%M:%S should give "11:59:59" not "12:00:00". I agree with this for integer numbers of seconds and for format flag %S, but I think the behavior should be different for format flags that are designed for fractional parts of seconds. I would like to see %OSn use round-to-nearest behavior even for n = 0 while %S uses round-down, so that printing 11:59:59.8 with format %H:%M:%OS0 would give "12:00:00". This would not affect anything for integer numbers of seconds because those are always represented precisely, but it would more naturally handle round-off errors for fractional seconds.

这是在例如 C 中处理小数部分打印的方式,因为整数转换向下舍入:

This is how printing of fractional parts is handled in, for example C, because integer casting rounds down:

 double x = 9.97;
 printf("%d\n",(int) x);   //  9
 printf("%.0f\n",x);       //  10
 printf("%.1f\n",x);       //  10.0
 printf("%.2f\n",x);       //  9.97

我对其他语言和环境中如何处理小数秒进行了快速调查,似乎确实没有达成共识.大多数构造都是为整数秒而设计的,小数部分是事后的想法.在我看来,在这种情况下,R 开发人员做出的选择并非完全不合理,但实际上并不是最好的选择,并且与其他地方显示浮点数的约定不一致.

I did a quick survey of how fractional seconds are handled in other languages and environments, and there really doens't seem to be a consensus. Most constructs are designed for integer numbers of seconds and the fractional parts are an afterthought. It seems to me that in this case the R developers made a choice that is not completely unreasonable but is in fact not the best one, and is not consistent with the conventions elsewhere for displaying floating-point numbers.

人们的想法是什么?R 行为是否正确?这是您自己设计的方式吗?

What are peoples' thoughts? Is the R behavior correct? Is it the way you yourself would design it?

推荐答案

一个潜在的问题是 POSIXct 表示不如 POSIXlt 表示精确,并且 POSIXct 表示在格式化之前被转换为 POSIXlt 表示.下面我们看到,如果我们的字符串直接转换为POSIXlt表示,它会正确输出.

One underlying problem is that the POSIXct representation is less precise than the POSIXlt representation, and the POSIXct representation gets converted to the POSIXlt representation before formatting. Below we see that if our string is converted directly to POSIXlt representation, it outputs correctly.

> as.POSIXct('2011-10-11 07:49:36.3')
[1] "2011-10-11 07:49:36.2 CDT"
> as.POSIXlt('2011-10-11 07:49:36.3')
[1] "2011-10-11 07:49:36.3"

我们还可以通过查看两种格式的二进制表示与通常的 0.3 表示之间的差异来看到这一点.

We can also see that by looking at the difference between the binary representation of the two formats and the usual representation of 0.3.

> t1 <- as.POSIXct('2011-10-11 07:49:36.3')
> as.numeric(t1 - round(unclass(t1))) - 0.3
[1] -4.768372e-08

> t2 <- as.POSIXlt('2011-10-11 07:49:36.3')
> as.numeric(t2$sec - round(unclass(t2$sec))) - 0.3
[1] -2.831069e-15

有趣的是,看起来两种表示实际上都小于通常的 0.3 表示,但是第二个表示要么足够接近,要么以与我在这里想象的不同的方式截断.鉴于此,我不会担心浮点表示困难;它们可能仍然会发生,但如果我们谨慎使用哪种表示形式,它们有望被最小化.

Interestingly, it looks like both representations are actually less than the usual representation of 0.3, but that the second one is either close enough, or truncates in a way different than I'm imagining here. Given that, I'm not going to worry about floating point representation difficulties; they may still happen, but if we're careful about which representation we use, they will hopefully be minimized.

Robert 对四舍五入输出的渴望只是一个输出问题,可以通过多种方式解决.我的建议是这样的:

Robert's desire for rounded output is then simply an output problem, and could be addressed in any number of ways. My suggestion would be something like this:

myformat.POSIXct <- function(x, digits=0) {
  x2 <- round(unclass(x), digits)
  attributes(x2) <- attributes(x)
  x <- as.POSIXlt(x2)
  x$sec <- round(x$sec, digits)
  format.POSIXlt(x, paste("%Y-%m-%d %H:%M:%OS",digits,sep=""))
}

这从一个 POSIXct 输入开始,并首先四舍五入到所需的数字;然后它转换为 POSIXlt 并再次舍入.当我们处于分钟/小时/天边界时,第一次舍入确保所有单位适当增加;转换为更精确的表示后的第二个舍入.

This starts with a POSIXct input, and first rounds to the desired digits; it then converts to POSIXlt and rounds again. The first rounding makes sure that all of the units increase appropriately when we are on a minute/hour/day boundary; the second rounding rounds after converting to the more precise representation.

> options(digits.secs=1)
> t1 <- as.POSIXct('2011-10-11 07:49:36.3')
> format(t1)
[1] "2011-10-11 07:49:36.2"
> myformat.POSIXct(t1,1)
[1] "2011-10-11 07:49:36.3"

> t2 <- as.POSIXct('2011-10-11 23:59:59.999')
> format(t2)
[1] "2011-10-11 23:59:59.9"
> myformat.POSIXct(t2,0)
[1] "2011-10-12 00:00:00"
> myformat.POSIXct(t2,1)
[1] "2011-10-12 00:00:00.0"

最后一点:您知道标准允许最多两个闰秒吗?

A final aside: Did you know the standard allows for up to two leap seconds?

> as.POSIXlt('2011-10-11 23:59:60.9')
[1] "2011-10-11 23:59:60.9"

好的,还有一件事.由于 OP 提交的错误(Bug 14579);在此之前,它确实舍入了小数秒.不幸的是,这意味着有时它可以四舍五入到不可能的一秒.在错误报告中,它本应滚动到下一分钟时上升到 60.决定截断而不是舍入的一个原因是它从 POSIXlt 表示打印,其中每个单元单独存储.因此,滚动到下一分钟/小时/等比简单的四舍五入操作更困难.为了轻松四舍五入,有必要在 POSIXct 表示中进行四舍五入,然后再转换回来,正如我所建议的那样.

OK, one more thing. The behavior actually changed in May due to a bug filed by the OP (Bug 14579); before that it did round fractional seconds. Unfortunately that meant that sometimes it could round up to a second that wasn't possible; in the bug report, it went up to 60 when it should have rolled over to the next minute. One reason the decision was made to truncate instead of round is that it's printing from the POSIXlt representation, where each unit is stored separately. Thus rolling over to the next minute/hour/etc is more difficult than just a straightforward rounding operation. To round easily, it's necessary to round in POSIXct representation and then convert back, as I suggest.

这篇关于R 如何用小数秒格式化 POSIXct的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-24 08:52