使用分布式缓存访问Hadoop中的Maxmind Geo API

本文介绍了使用分布式缓存访问Hadoop中的Maxmind Geo API的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述 29岁程序员，3月因学历无情被辞！我正在编写MapReduce作业来分析网络日志。我的代码旨在将IP地址映射到地理位置，并且我使用Maxmind Geo API（ https：// github.com/maxmind/geoip-api-java ）。我的代码有一个LookupService方法需要数据库文件与ip到位置匹配。我正尝试使用分布式缓存传递此数据库文件。我尝试了两种不同的方式来完成这项工作：从HDFS传递文件，但它总是抛出一个错误，说文件未找到 sudo - u hdfs hadoop jar \ WebLogProcessing-0.0.1-SNAPSHOT-jar -with-dependencies.jar \ GeoLocationDatasetDriver / user / hdfs / input / user / hdfs / out_put \ /user/hdfs/GeoLiteCity.dat OR sudo -u hdfs hadoop jar \ WebLogProcessing-0.0.1 -SNAPSHOT-jar -with-dependencies.jar \ GeoLocationDatasetDriver / user / hdfs / input / user / hdfs / out_put \ hdfs：//sandbox.hortonworks.com：8020 / user / hdfs / GeoLiteCity.dat 驱动程序类代码：配置conf = getConf（）; Job job = Job.getInstance（conf）; job.addCacheFile（new Path（args [2]）。toUri（））; Mapper类代码： public void setup（Context context）throws IOException { URI [] uriList = context.getCacheFiles（）; Path database_path = new Path（uriList [0] .toString（））; LookupService cl = new LookupService（database_path.toString（）， LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE）; } 案例2：运行代码通过-files选项从本地文件系统传递文件。错误：行中的空指针异常 LookupService cl = new LookupService（database_path） sudo -u hdfs hadoop jar \ WebLogProcessing-0.0.1-SNAPSHOT-jar -with-dependencies.jar \ com.prithvi.mapreduce.logprocessing.ipgeo.GeoLocationDatasetDriver \ -files / tmp / jobs / GeoLiteCity.dat / user / hdfs / input / user / hdfs / out_put \ GeoLiteCity.dat 驱动程序代码：配置conf = getConf（）; Job job = Job.getInstance（conf）; String dbfile = args [2]; conf.set（maxmind.geo.database.file，dbfile）; 映射程式码： public void setup（Context context）throws IOException { Configuration conf = context.getConfiguration（）; String database_path = conf.get（maxmind.geo.database.file）; LookupService cl = new LookupService（database_path， LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE）; } 我需要在我的所有任务跟踪器中使用这个数据库文件来完成这项工作。任何人都可以给我建议正确的方法吗？解决方案尝试这样做：从驱动程序指定HDFS中的文件的位置是如何使用作业对象： job.addCacheFile（new URI（hdfs：// localhot：8020 / GeoLite2-City.mmdb＃GeoLite2-City.mmdb））; 其中，＃表示别名（符号链接）由hadoop创建之后，您可以从 setup（）中的Mapper访问该文件，方法： $ b @Override protected void setup（Context context）{ File file = new File（GeoLite2-City.mmdb）; $ / code> 以下是一个例子：驱动程序代码： http://goo.gl/COqysa li> 映射器代码： http://goo.gl/0SbQQP I am writing a MapReduce job to analyze web logs. My code is intended to map ip addresses to geo locations and I am using Maxmind Geo API(https://github.com/maxmind/geoip-api-java) for that purpose. My code has a LookupService method that needs database file with ip to location matchings. I am trying to pass this database file using distributed cache. I tried doing this in 2 different waysCase1:Run the job passing the file from HDFS but it always throws an error saying "FILE NOT FOUND"sudo -u hdfs hadoop jar \ WebLogProcessing-0.0.1-SNAPSHOT-jar-with-dependencies.jar \GeoLocationDatasetDriver /user/hdfs/input /user/hdfs/out_put \/user/hdfs/GeoLiteCity.datORsudo -u hdfs hadoop jar \WebLogProcessing-0.0.1-SNAPSHOT-jar-with-dependencies.jar \GeoLocationDatasetDriver /user/hdfs/input /user/hdfs/out_put \hdfs://sandbox.hortonworks.com:8020/user/hdfs/GeoLiteCity.datDriver Class Code:Configuration conf = getConf();Job job = Job.getInstance(conf);job.addCacheFile(new Path(args[2]).toUri());Mapper Class Code:public void setup(Context context) throws IOException{URI[] uriList = context.getCacheFiles();Path database_path = new Path(uriList[0].toString());LookupService cl = new LookupService(database_path.toString(), LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE);}CASE 2:Run the code by the passing the file from local file system through the -files option. Error: Null Pointer exception in the line LookupService cl = new LookupService(database_path)sudo -u hdfs hadoop jar \WebLogProcessing-0.0.1-SNAPSHOT-jar-with-dependencies.jar \com.prithvi.mapreduce.logprocessing.ipgeo.GeoLocationDatasetDriver \-files /tmp/jobs/GeoLiteCity.dat /user/hdfs/input /user/hdfs/out_put \GeoLiteCity.datDriver Code:Configuration conf = getConf();Job job = Job.getInstance(conf);String dbfile = args[2];conf.set("maxmind.geo.database.file", dbfile);Mapper Code:public void setup(Context context) throws IOException{ Configuration conf = context.getConfiguration(); String database_path = conf.get("maxmind.geo.database.file"); LookupService cl = new LookupService(database_path, LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE);}I need this database file in all my task trackers to accomplish the job. Can any one please suggest me the right way to do so? 解决方案 Try doing this:From the driver specify where the file in HDFS is like so using the Job object:job.addCacheFile(new URI("hdfs://localhot:8020/GeoLite2-City.mmdb#GeoLite2-City.mmdb"));where, # represents an alias name (symbolic link) to be created by hadoopAfter that you can access the file from Mapper in the setup() method:@Overrideprotected void setup(Context context) { File file = new File("GeoLite2-City.mmdb");}Here is an example:Driver code: http://goo.gl/COqysaMapper code: http://goo.gl/0SbQQP 这篇关于使用分布式缓存访问Hadoop中的Maxmind Geo API的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！上岸，阿里云！

mapper

使用分布式缓存访问Hadoop中的Maxmind Geo API