App Engine memcache / ndb.get_multi的性能问题

本文介绍了App Engine memcache / ndb.get_multi的性能问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在App Engine（Python）中使用 ndb.get_multi（）从Memcache中获取多个密钥时，性能很差。

我提取了500个小对象，所有这些对象都在memcache中。如果我使用 ndb.get_multi（keys）执行此操作，则需要1500ms或更多。以下是App Stats的典型输出：

和

正如你所看到的，所有的数据都来自memcache。大部分时间被报告为RPC调用之外。但是，我的代码大概是尽可能少的，所以如果时间花在CPU上，它必须位于ndb中的某处：

＃获取项目的密钥集。这运行得非常快。 item_keys = memcache.get（items_memcache_key）＃从memcache获取〜500个小项目。这非常慢（〜1500ms）。 items = ndb.get_multi（item_keys）
您在App中看到的第一个memcache.get统计信息是获取一组密钥的单个提取。第二个memcache.get是 ndb.get_multi 调用。

我提取的项目非常简单：

class Item（ndb.Model）： name = ndb.StringProperty（indexed = False） image_url = ndb.StringProperty（indexed = False） image_width = ndb.IntegerProperty（indexed = False） image_height = ndb.IntegerProperty（indexed = False）
这是一种已知的ndb性能问题吗？与反序列化成本有关吗？或者它是一个memcache问题？我发现如果不是取500个对象，而是将所有数据聚合成一个blob，我的函数运行时间为20ms，而不是> 1500ms：

＃获取项目的键集。这运行得非常快。 item_keys = memcache.get（items_memcache_key）＃获取单个项目数据。＃如果我们将所有来自memcache的数据作为一个blob，它非常快（〜20ms）。 item_data = memcache.get（items_data_key）如果不是item_data： items = ndb.get_multi（item_keys） flat_data = json.dumps（[{'name'：item。名称}中的项目]） memcache.add（items_data_key，flat_data）
是有趣的，但是对我来说并不是真正的解决方案，因为我需要获取的一组项目不是静态的。

是我看到的典型/预期？所有这些度量都在默认App Engine生产配置（F1实例，共享内存缓存）上。它是否是反序列化成本？或者由于从memcache中获取多个键也许？
我认为问题不是实例上升时间。我使用time.clock（）调用逐行分析代码，并且我看到大致相似的数字（比我在AppStats中看到的快3倍，但仍然非常慢）。这是一个典型的配置文件：

＃提取键：20 ms ＃ndb.get_multi：500 ms ＃键数为521，每键的读取时间为0.96毫秒

更新：出于兴趣，我还将所有应用程序引擎性能设置增加到最大（F4实例，2400Mhz，专用内存缓存）。表现并不好。在更快的实例上，应用统计信息计时现在与我的time.clock（）配置文件相匹配（因此500ms可获取500个小对象而不是1500ms）。但是，它看起来似乎非常缓慢。
解决方案
我仔细研究了一下，问题是ndb和Python，不是memcache。事情非常缓慢的原因是部分反序列化（解释大约30％的时间），其余部分似乎是ndb任务队列实现中的开销。

这意味着如果你真的想，你可以避免使用ndb，而是直接从memcache获取和反序列化。在我的500个小实体的测试案例中，这给出了2.5倍的大幅提升（生产中F1实例为650ms vs 1600ms，或者F4实例为200ms vs 500ms）。
此要点显示如何操作：

以下是手动memcache读取和反序列化的appstats输出：

现在比较一下，获取完全相同的实体 ndb.get_multi（键）：

差不多3倍!!

每个步骤的分析如下所示。请注意，计时与appstats不匹配，因为它们在F1实例上运行，所以实时时间为3倍计时时间。

手动版本：
＃memcache.get_multi：50.0 ms ＃反序列化：140.0 ms ＃键数是521，每个键的提取时间是0.364683301344 ms
vs ndb版本：

＃ndb.get_multi：500 ms ＃键的数量是521，每个键的获取时间是0.96 ms
因此，即使实体具有一个属性并位于memcache中，ndb也会每提取一个实体1ms。这是一个F4实例。在F1实例上它需要3ms。这是一个严重的实际限制：如果你想保持合理的延迟，当处理F1实例上的用户请求时，你不能获取超过100个任何类型的实体。

显然，ndb正在做一些非常昂贵的事情，并且（至少在这种情况下）是不必要的。我认为这与它的任务队列和它设立的所有未来有关。是否值得绕过ndb并手动执行取决于您的应用程序。如果你有一些memcache未命中，那么你将不得不去做数据存储提取。所以你基本上最终会部分重新实现ndb。然而，由于ndb似乎有这么大的开销，这可能值得去做。至少它似乎是基于我对很多get_multi调用小对象的用例，并且预期的memcache命中率很高。

它似乎也表明如果谷歌将把一些关键的ndb和/或反序列化作为C模块来实现，Python App Engine可能会更快。
I'm seeing very poor performance when fetching multiple keys from Memcache using ndb.get_multi() in App Engine (Python).
I am fetching ~500 small objects, all of which are in memcache. If I do this using ndb.get_multi(keys), it takes 1500ms or more. Here is typical output from App Stats:
and
As you can see, all the data is served from memcache. Most of the time is reported as being outside of RPC calls. However, my code is about as minimal as you can get, so if the time is spent on CPU it must be somewhere inside ndb:
# Get set of keys for items. This runs very quickly. item_keys = memcache.get(items_memcache_key) # Get ~500 small items from memcache. This is very slow (~1500ms). items = ndb.get_multi(item_keys)
The first memcache.get you see in App Stats is the single fetch to get a set of keys. The second memcache.get is the ndb.get_multi call.
The items I am fetching are super-simple:
class Item(ndb.Model): name = ndb.StringProperty(indexed=False) image_url = ndb.StringProperty(indexed=False) image_width = ndb.IntegerProperty(indexed=False) image_height = ndb.IntegerProperty(indexed=False)
Is this some kind of known ndb performance issue? Something to do with deserialization cost? Or is it a memcache issue?
I found that if instead of fetching 500 objects, I instead aggregate all the data into a single blob, my function runs in 20ms instead of >1500ms:
# Get set of keys for items. This runs very quickly. item_keys = memcache.get(items_memcache_key) # Get individual item data. # If we get all the data from memcache as a single blob it is very fast (~20ms). item_data = memcache.get(items_data_key) if not item_data: items = ndb.get_multi(item_keys) flat_data = json.dumps([{'name': item.name} for item in items]) memcache.add(items_data_key, flat_data)
This is interesting, but isn't really a solution for me since the set of items I need to fetch isn't static.
Is the performance I'm seeing typical/expected? All these measurements are on the default App Engine production config (F1 instance, shared memcache). Is it deserialization cost? Or due to fetching multiple keys from memcache maybe?I don't think the issue is instance ramp-up time. I profiled the code line by line using time.clock() calls and I see roughly similar numbers (3x faster than what I see in AppStats, but still very slow). Here's a typical profile:
# Fetch keys: 20 ms # ndb.get_multi: 500 ms # Number of keys is 521, fetch time per key is 0.96 ms
Update: Out of interest I also profiled this with all the app engine performance settings increased to maximum (F4 instance, 2400Mhz, dedicated memcache). The performance wasn't much better. On the faster instance the App Stats timings now match my time.clock() profile (so 500ms to fetch 500 small objects instead of 1500ms). However, it seem seems extremely slow.
解决方案
I investigated this in a bit of detail, and the problem is ndb and Python, not memcache. The reason things are so incredibly slow is partly deserialization (explains about 30% of the time), and the rest seems to be overhead in ndb's task queue implementation.
This means that, if you really want to, you can avoid ndb and instead fetch and deserialize from memcache directly. In my test case with 500 small entities, this gives a massive 2.5x speedup (650ms vs 1600ms on an F1 instance in production, or 200ms vs 500ms on an F4 instance).This gist shows how to do it:https://gist.github.com/mcummins/600fa8852b4741fb2bb1
Here is the appstats output for the manual memcache fetch and deserialization:
Now compare this to fetching exactly the same entities using ndb.get_multi(keys):
Almost 3x difference!!
Profiling each step is shown below. Note the timings don't match appstats because they're running on an F1 instance, so real time is 3x clock time.
Manual version:
# memcache.get_multi: 50.0 ms # Deserialization: 140.0 ms # Number of keys is 521, fetch time per key is 0.364683301344 ms
vs ndb version:
# ndb.get_multi: 500 ms # Number of keys is 521, fetch time per key is 0.96 ms
So ndb takes 1ms per entity fetched, even if the entity has one single property and is in memcache. That's on an F4 instance. On an F1 instance it takes 3ms. This is a serious practical limitation: if you want to maintain reasonable latency, you can't fetch more than ~100 entities of any kind when handling a user request on an F1 instance.
Clearly ndb is doing something really expensive and (at least in this case) unnecessary. I think it has something to do with its task queue and all the futures it sets up. Whether it is worth going around ndb and doing things manually depends on your app. If you have some memcache misses then you will have to go do the datastore fetches. So you essentially end up partly reimplementing ndb. However, since ndb seems to have such massive overhead, this may be worth doing. At least it seems so based on my use case of a lot of get_multi calls for small objects, with a high expected memcache hit rate.
It also seems to suggest that if Google were to implement some key bits of ndb and/or deserialization as C modules, Python App Engine could be massively faster.

这篇关于App Engine memcache / ndb.get_multi的性能问题的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！