本文介绍了如何在BigQuery Standard SQL中进行可重复采样?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
在此博客一位Google Cloud员工介绍了如何对BigQuery中的机器学习进行数据集的可重复采样.这对于创建(和复制)数据的训练/验证/测试分区非常重要.
In this blog a Google Cloud employee explains how to do repeatable sampling of data sets for machine learning in BigQuery. This is very important for creating (and replicating) train/validation/test partitions of your data.
但是该博客使用旧版SQL,该旧版SQL Google现在已弃用,转而支持标准SQL.
However the blog uses Legacy SQL, which Google has now deprecated in favor of Standard SQL.
如何使用标准SQL重新编写如下所示的博客示例代码?
How would you re-write the blog's sampling code shown below, but using Standard SQL?
#legacySQL
SELECT
date,
airline,
departure_airport,
departure_schedule,
arrival_airport,
arrival_delay
FROM
[bigquery-samples:airline_ontime_data.flights]
WHERE
ABS(HASH(date)) % 10 < 8
推荐答案
标准SQL会这样重写查询:
Standard SQL would re-write the query thus:
#standardSQL
SELECT
date,
airline,
departure_airport,
departure_schedule,
arrival_airport,
arrival_delay
FROM
`bigquery-samples.airline_ontime_data.flights`
WHERE
ABS(MOD(FARM_FINGERPRINT(date), 10)) < 8
具体是这些更改:
- 句点(不是冒号),用于将Google Cloud项目与表名分开.
- 反引号(而不是方括号)以转义表名称中的连字符.
-
MOD
函数(不是%
). -
FARM_FINGERPRINT
(不是HASH
).实际上,这是与旧版SQLHASH
不同的哈希函数,该函数实际上随着时间的推移并不一致就像博客所暗示的那样.
- a period (not colon) to separate the Google Cloud project from table name.
- backticks (not square brackets) to escape hyphen in the table name.
MOD
function (not%
).FARM_FINGERPRINT
(notHASH
). This is actually a different hashing function than Legacy SQL'sHASH
, which wasn't in fact consistent over time as the blog had implied.
这篇关于如何在BigQuery Standard SQL中进行可重复采样?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!