本文介绍了BigQuery:何时刷新GHTorrent以及如何获取最新信息?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

ghtorrent-bq 数据非常适合GitHub的快照,但是,它不清楚它何时更新以及如何获得更新的数据

解决方案

(与可以提供帮助),但同时您可以合并两个数据集(查找GHTorrent快照数据,然后添加最新的星星GitHub Archive):

  #standardSQL 
SELECT COUNT(DISTINCT登录)c
FROM(
SELECT登录
FROM(
SELECT login
FROM`ghtorrent-bq。 ght_2017_01_19.watchers` a
JOIN`ghtorrent-bq.ght_2017_01_19.projects` b
ON a.repo_id = b.id
JOIN`ghtorrent-bq.ght_2017_01_19.users` c
ON a.user_id = c.id
WHERE url ='https://api.github.com/repos/angular/angular'

UNION ALL(
SELECT actor.login
FROM`githubarchive.month.2017 *`
WHERE repo.name ='angular / angular'
AND type =WatchEvent



The ghtorrent-bq data is great to have snapshot of GitHub, however, it is not clear when it is updated and how I could get more up to date data

解决方案

(related to https://stackoverflow.com/a/42930963/132438)

GHTorrent only provides a periodical snapshot of their data on BigQuery, while GitHub Archive updates daily (or even hourly - let me check that).

It would be great to have a more frequent snapshot of GHTorrent (maybe https://twitter.com/gousiosg can help), but in the meantime you can merge both datasets (look for the GHTorrent snapshot data, and then add the latest stars from GitHub Archive):

#standardSQL
SELECT COUNT(DISTINCT login) c
FROM (
  SELECT login
  FROM (
    SELECT login
    FROM `ghtorrent-bq.ght_2017_01_19.watchers` a
    JOIN `ghtorrent-bq.ght_2017_01_19.projects` b
    ON a.repo_id=b.id
    JOIN `ghtorrent-bq.ght_2017_01_19.users` c
    ON a.user_id=c.id
    WHERE url = 'https://api.github.com/repos/angular/angular'
  )
  UNION ALL (
    SELECT actor.login
    FROM `githubarchive.month.2017*` 
    WHERE repo.name='angular/angular'
    AND type = "WatchEvent"
  )
)

这篇关于BigQuery:何时刷新GHTorrent以及如何获取最新信息?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-19 22:51