我正在使用来自Hadoop权威指南的非常受欢迎的天气数据示例。
一个数据行看起来像这样。

0184010010999992015010100004+70933-008667FM-12+000999999V0200401N01701000301CN000100199-00631-00741098801ADDAA106002031AY171021AY231021GF109991999999999999999999MA1999999098681MD1110121+9999MW1731OD139902601999REMSYN07001001 11/01 90417 11063 21074 39868 49880 51012 60021 77373 333 91126=

数据格式为:
[1-10]   # USAF weather station identifier
[11-15]  # WBAN weather station identifier
[16-23] # observation date

等等...

数据集引用自
网址:http://ce.sysu.edu.cn/hope/UploadFiles/Education/2011/10/201110221516245419.pdf
第39页

现在我有两个选择

1.)要使用RegexSeDe,它将删除第一个 token 中的数字,例如0-5代表工作站ID 6-12表示日期,依此类推...等等,您能帮我写一个正则表达式吗?

2.)去customSerDe,在这里我可以处理 token 行并将数据加载到配置单元中。我已经实现了SerDe接口(interface)并试图编写一个customerDe.But,但是我遇到了这个异常。
org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: com.alind.project.hivedatamanager.core.WeatherDataSerDe
    at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:314)
    at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:146)
    at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:69)
    at org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:196)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
    at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:536)
    at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:208)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Cannot validate serde: com.alind.project.hivedatamanager.core.WeatherDataSerDe
    at org.apache.hadoop.hive.ql.exec.DDLTask.validateSerDe(DDLTask.java:3722)
    at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3857)
    at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:295)
    at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
    at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
    at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1604)
    at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1364)
    at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1177)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1004)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:999)
    at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:144)
    ... 12 more
Caused by: java.lang.ClassNotFoundException: Class com.alind.project.hivedatamanager.core.WeatherDataSerDe not found
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1980)
    at org.apache.hadoop.hive.ql.exec.DDLTask.validateSerDe(DDLTask.java:3716)

我已经加了 jar
ADD JAR /path-to/MyCustomSerde.jar;

两种选择我都有些卡住,请帮助我完成。
我什至找不到很好的文档来阅读!

最佳答案

您看过本书第440页上的示例吗:

CREATE TABLE stations (usaf STRING, wban STRING, name STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(\\d{6}) (\\d{5}) (.{29}) .*"
);

那里也有一个解释。
希望能帮助到你 :)

关于regex - 如何编写Hive Regex或Custom Serde来解析天气数据格式?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/28416160/

10-09 05:46
查看更多