本文介绍了Hadoop Hive UDF与外部库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 29岁程序员,3月因学历无情被辞! 我正在尝试编写一个用于解析用户代理的Hadoop Hive的UDF。下面的代码在我的本地机器上工作正常,但在Hadoop上,我得到: 代码: import java.io.IOException; 导入org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.hive.ql.metadata.HiveException; import org.apache.hadoop.io.Text; import org.apache.hadoop。*; import com.decibel.uasparser.OnlineUpdater; import com.decibel.uasparser.UASparser; import com.decibel.uasparser.UserAgentInfo; public class MyUDF extends UDF { public String evaluate(String i){ UASparser parser = null; parser = new UASparser(); String key =; OnlineUpdater update = new OnlineUpdater(parser,key); UserAgentInfo info = null; info = parser.parse(i); 返回info.getDeviceType(); 我想到的事实应该提及: 我使用导出可运行jar文件编译Eclipse,并将所需的库提取到生成的jar选项中 我用Hue上传这个fat jar文件 : $ b public String evaluate(String i){ returnhello+ i.toString(); } 我猜这个问题出现rel =nofollow> https://udger.com )我正在使用,但我不知道在哪里。 有什么建议吗? 谢谢,Michal 解决方案这可能是一些事情,最好的办法是检查日志,但是这里有一些你可以在一分钟内检查的快速事情列表。 jar不包含所有依赖关系。我不确定eclipse如何构建可运行的jar,但它可能不包含所有的依赖关系。你可以这样做: 你的udf-jar.jar 查看包含的内容。你应该看到来自 com.decibel.uasparser 的东西。如果没有,你必须建立与适当的依赖关系的罐子(通常你使用maven做到这一点)。 Hive版本。有时Hive API会稍微改变,足以导致不兼容。这里可能不是这种情况,但要确保根据群集中相同版本的hadoop和hive编译UDF。 您应该经常检查if info 在调用 parse() 看起来像库使用密钥,这意味着实际上从联机服务(udger.com)获取数据,所以它没有真正的密钥可能无法正常工作。更重要的是,图书馆在线更新,联系每个记录的在线服务 。这意味着,查看代码,它会为每个记录创建一个更新线程。你应该改变代码在构造函数中只做一次,如下所示: public class MyUDF extends UDF { UASparser parser = new UASparser(); public MyUDF(){$​​ b $ b super() String key =在这里放个钥匙; //在UDF实例化时只更新一次 OnlineUpdater update = new OnlineUpdater(parser,key); } public String evaluate(String i){ UserAgentInfo info = parser.parse(i); if(info!= null)return info.getDeviceType(); //如果它不可解析,你希望它返回null //否则一个坏记录将会停止你的处理 //有一个异常 else else null; $ / code $ / pre 但要确定地知道,你必须看看日志...纱线日志,还可以查看您提交作业的计算机上的配置单元日志(可能位于/ var / log / hive中,但取决于您的安装)。 I'm trying to write a UDF for Hadoop Hive, that parses User Agents. Following code works fine on my local machine, but on Hadoop I'm getting:Code:import java.io.IOException;import org.apache.hadoop.hive.ql.exec.UDF;import org.apache.hadoop.hive.ql.metadata.HiveException;import org.apache.hadoop.io.Text;import org.apache.hadoop.*;import com.decibel.uasparser.OnlineUpdater;import com.decibel.uasparser.UASparser;import com.decibel.uasparser.UserAgentInfo;public class MyUDF extends UDF { public String evaluate(String i) { UASparser parser = null; parser = new UASparser(); String key = ""; OnlineUpdater update = new OnlineUpdater(parser, key); UserAgentInfo info = null; info = parser.parse(i); return info.getDeviceType(); }}Facts that come to my mind I should mention:I'm compiling with Eclipse with "export runnable jar file" and extract required libraries into generated jar optionI'm uploading this "fat jar" file with HueMinimum working example I managed to run:public String evaluate(String i) {return "hello" + i.toString()";}I guess the problem lies somewhere around that library (downloaded from https://udger.com) I'm using, but I have no idea where.Any suggestions?Thanks, Michal 解决方案 It could be a few things. Best thing is to check the logs, but here's a list of a few quick things you can check in a minute.jar does not contain all dependencies. I am not sure how eclipse builds a runnable jar, but it may not include all dependencies. You can dojar tf your-udf-jar.jarto see what was included. You should see stuff from com.decibel.uasparser. If not, you have to build the jar with the appropriate dependencies (usually you do that using maven).Different version of the JVM. If you compile with jdk8 and the cluster runs jdk7, it would also failHive version. Sometimes the Hive APIs change slightly, enough to be incompatible. Probably not the case here, but make sure to compile the UDF against the same version of hadoop and hive that you have in the clusterYou should always check if info is null after the call to parse()looks like the library uses a key, meaning that actually gets data from an online service (udger.com), so it may not work without an actual key. Even more important, the library updates online, contacting the online service for each record. This means, looking at the code, that it will create one update thread per record. You should change the code to do that only once in the constructor like the following:Here's how to change it:public class MyUDF extends UDF { UASparser parser = new UASparser(); public MyUDF() { super() String key = "PUT YOUR KEY HERE"; // update only once, when the UDF is instantiated OnlineUpdater update = new OnlineUpdater(parser, key); } public String evaluate(String i) { UserAgentInfo info = parser.parse(i); if(info!=null) return info.getDeviceType(); // you want it to return null if it's unparseable // otherwise one bad record will stop your processing // with an exception else return null; }}But to know for sure, you have to look at the logs...yarn logs, but also you can look at the hive logs on the machine you're submitting the job on ( probably in /var/log/hive but it depends on your installation). 这篇关于Hadoop Hive UDF与外部库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
08-24 03:15