问题描述
我在 Google Analytics(分析)中拥有点击流数据,例如引用 URL、顶部着陆页、顶部退出页面以及诸如页面浏览量、访问次数、跳出率等指标.目前还没有可以存储所有这些信息的数据库.我需要根据这些数据从头开始构建一个数据仓库(我认为这被称为 web-house).所以我需要从 Google Analytics 中提取数据并每天自动将其加载到仓库中.我的问题是:-
I have click stream data such as referring URL, top landing pages, top exit pages and metrics such as page views, number of visits, bounces all in Google Analytics. There is no database yet where all this information might be stored. I am required to build a data warehouse from scratch(which I believe is known as web-house) from this data.So I need to extract data from Google Analytics and load it into a warehouse on a daily automated basis. My questions are:-
1) 有可能吗?数据每天都在增加(有些是指标或衡量标准,例如访问量,有些是新推荐网站),加载仓库的过程将如何进行?
1)Is it possible? Every day data increases (some in terms of metrics or measures such as visits and some in terms of new referring sites), how would the process of loading the warehouse go about?
2)什么 ETL 工具可以帮助我实现这一目标?Pentaho 我相信有一种方法可以从 Google Analytics 中提取数据,有人用过吗?这个过程是如何进行的?除了答案之外,任何参考资料、链接都将不胜感激.
2)What ETL tool would help me to achieve this? Pentaho I believe has a way to pull out data from Google Analytics, has anyone used it? How does that process go?Any references, links would be appreciated besides answers.
推荐答案
与往常一样,了解底层事务数据的结构——用于构建 DW 的原子组件——是第一步,也是最重要的一步.
As always, knowing the structure of the underlying transaction data--the atomic components used to build a DW--is the first and biggest step.
根据您检索数据的方式,基本上有两种选择.其中之一,在之前对这个问题的回答中已经提到,是通过 GA API 访问您的 GA 数据.这与数据出现在 GA 报告中的形式非常接近,而不是交易数据.使用它作为您的数据源的优势在于您的ETL"非常简单,只需解析来自 XML 容器的数据即可.
There are essentially two options, based on how you retrieve the data. One of these, already mentioned in a prior answer to this question, is to access your GA data via the GA API. This is pretty close to the form that the data appears in the GA Report, rather than transactional data. The advantage of using this as your data source is that your "ETL" is very simple, just parsing the data from the XML container is about all that's needed.
第二个选项是获取更接近源的数据.
The second option involves grabbing the data much closer to the source.
没什么复杂的,不过,这里的几行背景可能会有所帮助.
Nothing complicated, still, a few lines of background are perhaps helpful here.
GA Web 仪表板是由解析/过滤 GA 事务日志(容器持有 GA 数据对应于一个 Profile in one帐户).
The GA Web Dashboard is created byparsing/filtering a GA transaction log(the containerthat holds the GA data thatcorresponds to one Profile in oneAccount).
这个日志中的每一行代表一个单笔交易并交付以 GA 服务器的形式来自客户端的 HTTP 请求.
Each line in this log represents asingle transaction and is deliveredto the GA server in the form of anHTTP Request from the client.
附加到该请求(即名义上对于单像素 GIF) 是一个包含所有的字符串从那里返回的数据_TrackPageview 函数调用加上来自客户端 DOM 的数据,GA cookie为这个客户设置,并且浏览器位置的内容栏(http://www....).
Appended to that Request (which isnominally for a single-pixel GIF) isa single string that contains all ofthe data returned from that_TrackPageview function call plus data from the client DOM, GA cookiesset for this client, and thecontents of the Browser's locationbar (http://www....).
虽然这个请求来自客户端,它由 GA 调用脚本(驻留在客户端)在 GA 的主要执行之后立即数据收集功能(_TrackPageview).
Though this Request is from theclient, it is invoked by the GAscript (which resides on the client)immediately after execution of GA's primarydata-collecting function(_TrackPageview).
因此,直接使用这些交易数据可能是构建数据仓库最自然的方式;另一个优点是您可以避免中间 API 的额外开销).
So working directly with this transaction data is probably the most natural way to build a Data Warehouse; another advantage is that you avoid the additional overhead of an intermediate API).
GA 日志的各个行通常对 GA 用户不可用.不过,获取它们很简单.这两个步骤应该足够了:
The individual lines of the GA log are not normally avaialble to GA users. Still, it's simple to get them. These two steps should suffice:
修改您网站每个页面上的 GA 跟踪代码,使其发送每个 GIF 请求的副本(GA 日志文件中的一行)到您的自己的服务器,特别是,立即之前调用_trackPageview(),添加这一行:
modify the GA tracking code on each page of your Site so that itsends a copy of each GIF Request(one line in the GA logfile) to yourown server, specifically,immeidately before the call to_trackPageview(), add this line:
pageTracker._setLocalRemoteServerMode();
接下来,只需放一个单像素的gif文档根目录中的图像并调用它__utm.gif".
因此,现在您的服务器活动日志将包含这些单独的交易行,同样是根据附加到 GA 跟踪像素的 HTTP 请求的字符串以及请求中的其他数据(例如,用户代理字符串)构建的.前面的字符串只是键值对的串联,每个键都以字母utm"开头(可能用于urching tracker").并非每个 utm 参数都出现在每个 GIF 请求中,例如,其中一些仅用于电子商务交易——这取决于交易.
So now your server activity log will contain these individual transction lines, again built from a string appended to an HTTP Request for the GA tracking pixel as well as from other data in the Request (e.g., the User Agent string). This former string is just a concatenation of key-value pairs, each key begins with the letters "utm" (probably for "urching tracker"). Not every utm parameter appears in every GIF Request, several of them, for instance, are used only for e-commerce transactions--it depends on the transaction.
这是一个实际的 GIF 请求(帐户 ID 已被清理,否则完好无损):
Here's an actual GIF Request (account ID has been sanitized, otherwise it's intact):
如您所见,此字符串由一组键值对组成,每个键值对以&"分隔.只需两个微不足道的步骤:(i)在&符号上拆分此字符串;(ii) 用简短的描述性短语替换每个 gif 参数(键),使其更易于阅读:
As you can see, this string is comprised of a set of key-value pairs each separated by an "&". Just two trivial steps: (i) Splitting this string on the ampersand; and (ii) replacing each gif parameter (key) with a short descriptive phrase, make this much easier to read:
gatc_version 1
GIF_req_unique_id 1669045322
GIF_req_unique_id 1669045322
language_encoding UTF-8
screen_resolution 1280x800
screen_color_depth 24 位
browser_language en-us
java_enabled 1
flash_version 10.0%20r45
flash_version 10.0%20r45
campaign_session_new 1
page_title 位置%20Listings%20%7C%20Linden%20Lab
page_title Position%20Listings%20%7C%20Linden%20Lab
host_name lindenlab.hrmdirect.com
host_name lindenlab.hrmdirect.com
referral_url http://lindenlab.com/employment
page_request/employment/openings.php?sort=da
page_request /employment/openings.php?sort=da
account_string UA-XXXXXX-X
account_string UA-XXXXXX-X
__utma%3D87045125.1669045322.1274256051.1274256051.1274256051.1%3B%2B__utmb%3D87045125%3B%2B__utmc%3D87045125%3B%2B__utmz%3D87045125.1274256051.1.1.utmccn%3D(引荐)%7Cutmcsr%3Dlindenlab.com%7Cutmcct%3D%2Femployment%7Cutmcmd%3Dreferral%3B%2B
cookies __utma%3D87045125.1669045322.1274256051.1274256051.1274256051.1%3B%2B__utmb%3D87045125%3B%2B__utmc%3D87045125%3B%2B__utmz%3D87045125.1274256051.1.1.utmccn%3D(referral)%7Cutmcsr%3Dlindenlab.com%7Cutmcct%3D%2Femployment%7Cutmcmd%3Dreferral%3B%2B
cookies 也很容易解析(见谷歌的简明描述这里):例如,
The cookies are also simple to parse (see Google's concise description here): for instance,
__utma 是唯一访问者 cookie,
__utma is the unique-visitor cookie,
__utmb、__utmc 是会话 cookie,并且
__utmb, __utmc are session cookies, and
__utmz 是引荐类型.
GA cookie 存储记录用户每次交互的大部分数据(例如,单击标记的下载链接、单击站点上另一个页面的链接、第二天的后续访问等).例如,__utma cookie 由一组整数组成,每组以."分隔;最后一组是该用户的访问计数(在本例中为1").
The GA cookies store the majority of the data that record each interaction by a user (e.g., clicking a tagged download link, clicking a link to another page on the Site, subsequent visit the next day, etc.). So for instance, the __utma cookie is comprised of a groups of integers, each group separated by a "."; the last group is the visit count for that user (a "1" in this case).
这篇关于如何从 Google Analytics 中提取数据并从中构建数据仓库(webhouse)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!