本文介绍了如何从Google Cloud加载数据到Jupyter Notebook VM?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将存储在我的Google云中的一堆csv文件加载到我的jupyter笔记本中.我使用python 3,gsutil无法正常工作.

I am trying to load a bunch of csv files stored on my google cloud into my jupyter notebook. I use python 3 and gsutil does not work.

让我们假设我在'\ bucket1 \ 1'中有6个.csv文件.有人知道我该怎么做吗?

Lets's assume I have 6 .csv files in '\bucket1\1'. does anybody know what I should do?

推荐答案

您正在运行 Jupyter Notebook Google Cloud VM实例.并且您想要将6个.csv文件(您当前在云存储中)加载到其中.

You are running a Jupyter Notebook on a Google Cloud VM instance. And you want to load 6 .csv files (that you currently have on your Cloud Storage) into it.

安装依赖项:

pip install google-cloud-storage
pip install pandas

在笔记本上运行以下脚本:

Run the following script on your Notebook:

from google.cloud import storage
import pandas as pd

bucket_name = "my-bucket-name"

storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)

# When you have your files in a subfolder of the bucket.
my_prefix = "csv/" # the name of the subfolder
blobs = bucket.list_blobs(prefix = my_prefix, delimiter = '/')

for blob in blobs:
    if(blob.name != my_prefix): # ignoring the subfolder itself
        file_name = blob.name.replace(my_prefix, "")
        blob.download_to_filename(file_name) # download the file to the machine
        df = pd.read_csv(file_name) # load the data
        print(df)

# When you have your files on the first level of your bucket

blobs = bucket.list_blobs()

for blob in blobs:
    file_name = blob.name
    blob.download_to_filename(file_name) # download the file to the machine
    df = pd.read_csv(file_name) # load the data
    print(df)

注意:

  • 熊猫是很好的依赖项用python处理数据分析时,它可以使您更轻松地处理csv文件.

  • Pandas is a good dependency used when dealing with data analysis in python, so it will make it easier for you to work with the csv files.

代码包含2种选择:一种是在子文件夹中包含对象,另一种是在第一层中包含对象,请使用适用于您的案例的一种.

The code contains 2 alternatives: one if you have the objects inside a subfolder and other one if you have the objects on the first level, use the one that applies to your case.

代码循环遍历所有对象,因此如果其中还有其他类型的对象,则可能会出错.

The code cycles through all the objects, so you might get errors if you have some other kind of objects in there.

如果您在运行Notebook的计算机上已经有文件,则可以忽略Google Cloud Storage行,而只需在"read_csv"方法上指定每个文件的根目录/相对路径.

In case you already have the files on the machine where you are running the Notebook, then you can ignore the Google Cloud Storage lines and just specify the root/relative path of each file on the "read_csv" method.

有关列出Cloud Storage对象的更多信息,请在此处而要下载Cloud Storage对象,请此处.

For more information about listing Cloud Storage objects go here and for downloading Cloud Storage objects go here.

这篇关于如何从Google Cloud加载数据到Jupyter Notebook VM?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-03 03:43