问题描述
更新:此问题与 Google Colab 的笔记本设置:硬件加速器:GPU"有关.这个问题是在添加TPU"选项之前写的.
update: this question is related to Google Colab's "Notebook settings: Hardware accelerator: GPU". This question was written before the "TPU" option was added.
阅读有关 Google Colaboratory 提供免费 Tesla K80 GPU 的多个激动人心的公告,我尝试运行 fast.ai 课程它永远不会完成 - 很快就会耗尽内存.我开始调查原因.
Reading multiple excited announcements about Google Colaboratory providing free Tesla K80 GPU, I tried to run fast.ai lesson on it for it to never complete - quickly running out of memory. I started investigating of why.
最重要的是,免费的 Tesla K80"并非对所有人都免费"——对某些人来说,只有一小部分是免费的".
The bottom line is that "free Tesla K80" is not "free" for all - for some only a small slice of it is "free".
我从加拿大西海岸连接到 Google Colab,我只得到了 0.5GB 的 24GB GPU RAM.其他用户可以使用 11GB 的 GPU RAM.
I connect to Google Colab from West Coast Canada and I get only 0.5GB of what supposed to be a 24GB GPU RAM. Other users get access to 11GB of GPU RAM.
对于大多数 ML/DL 工作来说,显然 0.5GB GPU RAM 是不够的.
Clearly 0.5GB GPU RAM is insufficient for most ML/DL work.
如果你不确定你得到了什么,这里是我拼凑起来的小调试功能(仅适用于笔记本的 GPU 设置):
If you're not sure what you get, here is little debug function I scraped together (only works with the GPU setting of the notebook):
# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize
import psutil
import humanize
import os
import GPUtil as GPU
GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isn’t guaranteed
gpu = GPUs[0]
def printm():
process = psutil.Process(os.getpid())
print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ), " | Proc size: " + humanize.naturalsize( process.memory_info().rss))
print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()
在运行任何其他代码之前在 jupyter notebook 中执行它给了我:
Executing it in a jupyter notebook before running any other code gives me:
Gen RAM Free: 11.6 GB | Proc size: 666.0 MB
GPU RAM Free: 566MB | Used: 10873MB | Util 95% | Total 11439MB
获得全卡访问权限的幸运用户将看到:
The lucky users who get access to the full card will see:
Gen RAM Free: 11.6 GB | Proc size: 666.0 MB
GPU RAM Free: 11439MB | Used: 0MB | Util 0% | Total 11439MB
您是否发现我从 GPUtil 借用的 GPU RAM 可用性计算中有任何缺陷?
Do you see any flaw in my calculation of the GPU RAM availability, borrowed from GPUtil?
如果您在 Google Colab notebook 上运行此代码,您能否确认得到类似的结果?
Can you confirm that you get similar results if you run this code on Google Colab notebook?
如果我的计算是正确的,有什么办法可以在空闲盒子上获得更多的 GPU RAM?
If my calculations are correct, is there any way to get more of that GPU RAM on the free box?
更新:我不知道为什么我们中的一些人得到的只是其他用户得到的 1/20.例如帮助我调试这个的人来自印度,他得到了整个事情!
update: I'm not sure why some of us get 1/20th of what other users get. e.g. the person who helped me to debug this is from India and he gets the whole thing!
注意:请不要再发送关于如何杀死可能消耗部分 GPU 的潜在卡住/失控/并行笔记本的任何建议.不管你如何切片,如果你和我在同一条船上并且要运行调试代码,你会发现你仍然可以获得总共 5% 的 GPU RAM(截至本次更新仍然如此).
note: please don't send any more suggestions on how to kill the potentially stuck/runaway/parallel notebooks that might be consuming parts of the GPU. No matter how you slice it, if you are in the same boat as I and were to run the debug code you'd see that you still get a total of 5% of GPU RAM (as of this update still).
推荐答案
因此,为了防止在此线程建议的上下文中出现另外十几个建议无效的答案!kill -9 -1,让我们关闭此线程:
So to prevent another dozen of answers suggesting invalid in the context of this thread suggestion to !kill -9 -1, let's close this thread:
答案很简单:
在撰写本文时,Google 仅向我们中的一些人提供 5% 的 GPU,而向其他人提供 100% 的 GPU.期间.
2019 年 12 月更新:问题仍然存在 - 此问题的投票仍在继续.
dec-2019 update: The problem still exists - this question's upvotes continue still.
mar-2019 更新:一年后,一位 Google 员工@AmiF 对事情的状态发表了评论,称该问题不存在,任何似乎有此问题的人都需要简单地重置其运行时以恢复内存.然而,投票仍在继续,这对我来说表明问题仍然存在,尽管@AmiF 提出了相反的建议.
mar-2019 update: A year later a Google employee @AmiF commented on the state of things, stating that the problem doesn't exist, and anybody who seems to have this problem needs to simply reset their runtime to recover memory. Yet, the upvotes continue, which to me this tells that the problem still exists, despite @AmiF's suggestion to the contrary.
2018 年 12 月更新:我有一个理论,即当 Google 的机器人检测到非标准行为时,它可能会将某些帐户或浏览器指纹列入黑名单.这可能完全是巧合,但很长一段时间以来,我在任何碰巧需要它的网站上都遇到了 Google Re-captcha 的问题,在我被允许通过之前,我经常需要通过数十道谜题花了我 10 分钟以上的时间来完成.这持续了好几个月.突然间,到本月为止,我根本没有遇到任何难题,只需单击鼠标即可解决任何谷歌重新验证码问题,就像过去几乎一年前一样.
dec-2018 update: I have a theory that Google may have a blacklist of certain accounts, or perhaps browser fingerprints, when its robots detect a non-standard behavior. It could be a total coincidence, but for quite some time I had an issue with Google Re-captcha on any website that happened to require it, where I'd have to go through dozens of puzzles before I'd be allowed through, often taking me 10+ min to accomplish. This lasted for many months. All of a sudden as of this month I get no puzzles at all and any google re-captcha gets resolved with just a single mouse click, as it used to be almost a year ago.
我为什么要讲这个故事?嗯,因为同时我在 Colab 上获得了 100% 的 GPU RAM.这就是为什么我怀疑如果您在理论上的 Google 黑名单上,那么您就不会被信任免费获得大量资源.我想知道你们中是否有人发现有限的 GPU 访问与 Re-captcha 噩梦之间存在相同的相关性.正如我所说,这也可能完全是巧合.
And why I'm telling this story? Well, because at the same time I was given 100% of the GPU RAM on Colab. That's why my suspicion is that if you are on a theoretical Google black list then you aren't being trusted to be given a lot of resources for free. I wonder if any of you find the same correlation between the limited GPU access and the Re-captcha nightmare. As I said, it could be totally a coincidence as well.
这篇关于Google Colaboratory:关于其 GPU 的误导性信息(某些用户只能使用 5% 的 RAM)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!