如何使用Jupyter + SparkR和自定义R安装

本文介绍了如何使用Jupyter + SparkR和自定义R安装的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用Dockerized映像和Jupyter笔记本以及SparkR内核.创建SparkR笔记本时，它使用Microsoft R(3.3.2)安装而不是香草CRAN R安装(3.2.3).

I am using a Dockerized image and Jupyter notebook along with SparkR kernel. When I create a SparkR notebook, it uses an install of Microsoft R (3.3.2) instead of vanilla CRAN R install (3.2.3).

我正在使用的Docker映像安装了一些自定义R库和Python程序，但我没有明确安装MicrosoftR.无论我是否可以删除Microsoft R或并排安装它，如何使我的SparkR内核使用R的自定义安装?

The Docker image I'm using installs some custom R libraries and Python pacakages but I don't explicitly install Microsoft R. Regardless of whether or not I can remove Microsoft R or have it side-by-side, how I can get my SparkR kernel to use a custom installation of R?

预先感谢

推荐答案

与Docker相关的问题之外，Jupyter内核的设置是在名为kernel.json的文件中配置的，这些文件位于特定的目录(每个内核一个)中，可以看到使用命令jupyter kernelspec list;例如，在我的(Linux)机器上就是这种情况:

Docker-related issues aside, the settings for Jupyter kernels are configured in files named kernel.json, residing in specific directories (one per kernel) which can be seen using the command jupyter kernelspec list; for example, here is the case in my (Linux) machine:

$ jupyter kernelspec list
Available kernels:
  python2       /usr/lib/python2.7/site-packages/ipykernel/resources
  caffe         /usr/local/share/jupyter/kernels/caffe
  ir            /usr/local/share/jupyter/kernels/ir
  pyspark       /usr/local/share/jupyter/kernels/pyspark
  pyspark2      /usr/local/share/jupyter/kernels/pyspark2
  tensorflow    /usr/local/share/jupyter/kernels/tensorflow

再次，例如，这是我的R内核(ir)的kernel.json的内容

Again, as an example, here are the contents of the kernel.json for my R kernel (ir)

{
  "argv": ["/usr/lib64/R/bin/R", "--slave", "-e", "IRkernel::main()", "--args", "{connection_file}"],
  "display_name": "R 3.3.2",
  "language": "R"
}

这是我的pyspark2内核的相应文件:

And here is the respective file for my pyspark2 kernel:

{
 "display_name": "PySpark (Spark 2.0)",
 "language": "python",
 "argv": [
  "/opt/intel/intelpython27/bin/python2",
  "-m",
  "ipykernel",
  "-f",
  "{connection_file}"
 ],
 "env": {
  "SPARK_HOME": "/home/ctsats/spark-2.0.0-bin-hadoop2.6",
  "PYTHONPATH": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python:/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/lib/py4j-0.10.1-src.zip",
  "PYTHONSTARTUP": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/pyspark/shell.py",
  "PYSPARK_PYTHON": "/opt/intel/intelpython27/bin/python2"
 }
}

如您所见，在两种情况下，argv的第一个元素都是相应语言的可执行文件-在我的情况下，我的ir内核是GNU R，而我的pyspark2内核是Intel Python 2.7.对此进行更改，使其指向您的GNU R可执行文件，应该可以解决您的问题.

As you can see, in both cases the first element of argv is the executable for the respective language - in my case, GNU R for my ir kernel and Intel Python 2.7 for my pyspark2 kernel. Changing this, so that it points to your GNU R executable, should resolve your issue.

这篇关于如何使用Jupyter + SparkR和自定义R安装的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！