问题描述
我知道正在做的工作是为了改进Bigquery上的Join功能,而不是在这里咆哮,但如果Joins无法正确使用,将很难分析'Terabyte'数据集为'广告'。
好的,回到问题中,我有两个表格,一个是600 Megs,另一个是50Megs,我确实尝试了一个连接,并且出现了关于较小表格的错误被留下。我做了一些调查,发现 Bigquery
认为这两个表大于7MB?
所以根据我在网上找到的一些建议,我选择了较小表格中存在问题的数据集,并将其保存在一个新表格中,新数据集在两列上分别为12MB和600K行。然后我再次尝试查询,但仍然收到相同的错误:
查询失败:错误:大表cdrs_test.geoIP_Left必须显示为连接查询中最左边的表格
geoIP_Left是12 Megs,另一个表格是600 Megs ..
这可以以某种方式解决,或者我是否被捣毁?如果有的话,任何人都知道我可以用来分析支持连接的大型数据集的其他服务?
编辑:这是实际的查询; SELECT COUNT(results.cc_card)AS计数
sum(results.sessiontime)AS时间
geoIP_Left.place AS place
FROM cdrs_test.cdrs_2010_5 AS results
JOIN cdrs_test.geoIP_Left AS geoIP_table
ON results.cc_card = geoIP_table.vcard
WHERE results.sessiontime> 0 AND results.countryName包含'India'
GROUP BY place;
您可以将此查询表示为只引用您感兴趣的 cdrs_test.geoIP_Left
列。请参阅第二个示例。
I know that work is being done to improve the Join feature on Bigquery, not to rant here but it will be hard to analyze 'Terabyte' sets of data as 'advertised' if Joins can not be used properly.
OK, back to the problem, I have two tables one is 600 Megs and the other one is 50Megs, I did try to make a join and I got an error about smaller table must be left. I did some research and I found out that Bigquery
considers both tables as big if they are greater than 7MB?
So based on some advice I found online I did select the dataset in question of the smaller table and saved it in a new table, the new data set is 12MB and 600K rows over two columns. Then I tried my query again but I am still getting the same error :
Query Failed : Error: Large table cdrs_test.geoIP_Left must appear as the leftmost table in a join query
geoIP_Left is 12 Megs, the other table is 600Megs..
Can this be fixed somehow or am I busted? If so anyone knows about any other service that I can use to analyse big sets of data that does support joins?
EDIT: This is the actual query;
SELECT COUNT(results.cc_card) AS count,
sum(results.sessiontime) AS time,
geoIP_Left.place AS place
FROM cdrs_test.cdrs_2010_5 AS results
JOIN cdrs_test.geoIP_Left AS geoIP_table
ON results.cc_card = geoIP_table.vcard
WHERE results.sessiontime > 0 AND results.countryName Contains 'India'
GROUP BY place;
You could express this query as a subselect that references only the columns from cdrs_test.geoIP_Left
that you are interested in. See the second example here.
SELECT
COUNT(results.cc_card) AS count,
sum(results.sessiontime) AS time,
geoIP_table.place AS place
FROM
cdrs_test.cdrs_2010_5 AS results
JOIN
(SELECT place, vcard FROM cdrs_test.geoIP_Left)
AS
geoIP_table
ON
results.cc_card = geoIP_table.vcard
WHERE
results.sessiontime > 0 AND results.countryName CONTAINS 'India'
GROUP BY
place;
You can also simply run multiple queries - note that you can save query results explicitly as a named table, and use that table in a later query.
Finally, another option would be to pre-join the data, using (for example) a MapReduce transformation pipeline, or an ETL tool provided by one of the BigQuery ETL partners.
这篇关于加入Google Bigquery的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!