有没有办法确定 Azure 应用服务重新启动的原因?

本文介绍了有没有办法确定 Azure 应用服务重新启动的原因?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一堆网站在 Azure 应用服务的单个实例上运行，它们都设置为 Always On.它们都突然同时重新启动，导致一切都慢了几分钟，因为一切都遇到了冷请求.

I have a bunch of websites running on a single instance of Azure App Service, and they're all set to Always On. They all suddenly restarted at the same time, causing everything to go slow for a few minutes as everything hit a cold request.

如果服务已将我转移到新主机，我会预料到这一点，但这并没有发生——我仍然使用相同的主机名.

I would expect this if the service had moved me to a new host, but that didn't happen -- I'm still on the same hostname.

重启时 CPU 和内存使用情况正常，我没有启动任何部署或类似的事情.我没有看到重启的明显原因.

CPU and memory usage were normal at the time of the restart, and I didn't initiate any deployments or anything like that. I don't see an obvious reason for the restart.

是否有任何我可以看到的任何日志记录来找出为什么他们都重新启动了?还是这只是 App Service 时常做的一件很正常的事情?

Is there any logging anywhere that I can see to figure out why they all restarted? Or is this just a normal thing that App Service does from time to time?

推荐答案

所以，这个问题的答案似乎是不，你不能真正知道为什么，你可以推断出它做到了."

So, it seems the answer to this is "no, you can't really know why, you can just infer that it did."

我的意思是，您可以添加一些 Application Insights 日志记录，例如

I mean, you can add some Application Insights logging like

    private void Application_End()
    {
        log.Warn($"The application is shutting down because of '{HostingEnvironment.ShutdownReason}'.");

        TelemetryConfiguration.Active.TelemetryChannel.Flush();

        // Server Channel flush is async, wait a little while and hope for the best
        Thread.Sleep(TimeSpan.FromSeconds(2));
    }

您最终会得到 应用程序因 'ConfigurationChange' 而关闭." 或 应用程序因 'HostingEnvironment' 而关闭."，但它并不能真正告诉您主机级别发生了什么.

and you will end up with "The application is shutting down because of 'ConfigurationChange'." or "The application is shutting down because of 'HostingEnvironment'.", but it doesn't really tell you what's going on at the host level.

我需要接受的是，App Service 会不时重启，并问自己为什么在乎.App Service 应该足够聪明，可以在向应用程序池发送请求之前等待应用程序池预热(如重叠回收).然而，我的应用在回收后会在 CPU 上运行 1-2 分钟.

What I needed to accept is that App Service is going to restart things from time to time, and ask myself why I cared. App Service is supposed to be smart enough to wait for the application pool to be warmed up before sending requests to it (like overlapped recycling). Yet, my apps would sit there CPU-crunching for 1-2 minutes after a recycle.

我花了一段时间才弄明白，但罪魁祸首是我所有的应用程序都有一个重写规则，可以从 HTTP 重定向到 HTTPS.这不适用于应用程序初始化模块:它向根发送一个请求，并且它从 URL Rewrite 模块获得了一个 301 重定向，并且 ASP.NET 管道根本没有受到影响，辛苦的工作没有t实际上完成了.App Service/IIS 然后认为工作进程已准备好，然后向它发送流量.但第一个真实的"请求实际上遵循 301 重定向到 HTTPS URL，并且 bam！该用户遇到了冷启动的痛苦.

It took me a while to figure out, but the culprit was that all of my apps have a rewrite rule to redirect from HTTP to HTTPS. This does not work with the Application Initialization module: it sends a request to the root, and all it gets its a 301 redirect from the URL Rewrite module, and the ASP.NET pipeline isn't hit at all, the hard work wasn't actually done. App Service/IIS then thought the worker process was ready and then sends traffic to it. But the first "real" request actually follows the 301 redirect to the HTTPS URL, and bam! that user hits the pain of a cold start.

我添加了此处描述的重写规则，以免除应用程序初始化模块需要 HTTPS，因此当它到达站点的根目录时，它实际上会触发页面加载，从而触发整个管道:

I added a rewrite rule described here to exempt the Application Initialization module from needing HTTPS, so when it hits the root of the site, it will actually trigger the page load and thus the whole pipeline:

<rewrite>
  <rules>
    <clear />
    <rule name="Do not force HTTPS for application initialization" enabled="true" stopProcessing="true">
      <match url="(.*)" />
      <conditions>
        <add input="{HTTP_HOST}" pattern="localhost" />
        <add input="{HTTP_USER_AGENT}" pattern="Initialization" />
      </conditions>
      <action type="Rewrite" url="{URL}" />
    </rule>
    <rule name="Force HTTPS" enabled="true" stopProcessing="true">
      <match url="(.*)" ignoreCase="false" />
      <conditions>
        <add input="{HTTPS}" pattern="off" />
      </conditions>
      <action type="Redirect" url="https://{HTTP_HOST}/{R:1}" appendQueryString="true" redirectType="Permanent" />
    </rule>
  </rules>
</rewrite>

这是将旧应用程序迁移到 Azure 的日记中的众多条目之一 - 事实证明，当某些东西在很少重新启动的传统 VM 上运行时，您可以摆脱很多事情，但它需要一些 TLC 来解决在迁移到我们勇敢的云端新世界时解决问题....

It's one of many entries in a diary of moving old apps into Azure -- turns out there's a lot of things you can get away with when something's running on a traditional VM that seldom restarts, but it'll need some TLC to work out the kinks when migrating to our brave new world in the cloud....

2017 年 10 月 27 日更新: 自撰写本文以来，Azure 在诊断和解决问题"下添加了一个新工具.点击Web App Restarted"，它会告诉你原因，通常是因为存储延迟或基础设施升级.尽管如此，上述内容仍然有效，因为当迁移到 Azure 应用服务时，最好的前进方式是你真的只需要哄你的应用适应随机重启.

UPDATE 10/27/2017: Since this writing, Azure has added a new tool under "Diagnose and solve problems". Click "Web App Restarted", and it'll tell you the reason, usually because of storage latency or infrastructure upgrades. The above still stands though, in that when moving to Azure App Service, the best way forward is you really just have to coax your app into being comfortable with random restarts.

2018 年 2 月 11 日更新: 在将多个旧系统迁移到中型应用服务计划的单个实例(具有大量 CPU 和内存开销)之后，我遇到了一个令人烦恼的问题，我的暂存槽的部署可以无缝进行，但每当我因为 Azure 基础设施维护而被引导到新主机时，一切都会陷入混乱，停机时间为 2-3 分钟.我一直在努力弄清楚为什么会这样，因为应用服务应该等到它收到来自您的应用的成功响应后，才能将您引导到新主机.

UPDATE 2/11/2018: After migrating several legacy systems to a single instance of a medium App Service Plan (with plenty of CPU and memory overhead), I was having a vexing problem where my deployments from staging slots would go seamlessly, but whenever I'd get booted to a new host because of Azure infrastructure maintenance, everything would go haywire with downtime of 2-3 minutes. I was driving myself nuts trying to figure out why this was happening, because App Service is supposed to wait until it receives a successful response from your app before booting you to the new host.

对此我感到非常沮丧，以至于我准备将应用服务归类为企业垃圾并返回到 IaaS 虚拟机.

I was so frustrated by this that I was ready to classify App Service as enterprise garbage and go back to IaaS virtual machines.

事实证明这是多个问题，我怀疑其他人在将他们自己的野兽般的遗留 ASP.NET 应用程序移植到应用服务时会遇到它们，所以我想我会在这里解决所有问题.

It turned out to be multiple issues, and I suspect others will come across them while porting their own beastly legacy ASP.NET apps to App Service, so I thought I'd run through them all here.

首先要检查的是您实际上是在 Application_Start 中进行实际工作.例如，我正在使用 NHibernate，虽然它在很多方面都很擅长加载其配置，但我确保在 Application_Start 期间实际创建 SessionFactory 以确保完成艰苦的工作.

The first thing to check is that you're actually doing real work in your Application_Start. For example, I'm using NHibernate, which while good at many things is quite a pig at loading its configuration, so I make sure to actually create the SessionFactory during Application_Start to make sure that the hard work is done.

如上所述，要检查的第二件事是您没有干扰应用服务预热检查的 SSL 重写规则.如上所述，您可以从重写规则中排除预热检查.或者，自从我最初编写该解决方法以来，应用服务添加了一个 HTTPS Only 标志，允许您在负载均衡器而不是在 web.config 文件中执行 HTTPS 重定向.由于它是在应用程序代码上方的间接层处理的，因此您不必考虑它，因此我建议将 HTTPS Only 标志作为要走的路.

The second thing to check, as mentioned above, is that you don't have a rewrite rule for SSL that is interfering with App Service's warmup check. You can exclude the warmup checks from your rewrite rule as mentioned above. Or, in the time since I originally wrote that work around, App Service has added an HTTPS Only flag that allows you to do the HTTPS redirect at the load balancer instead of within your web.config file. Since it's handled at a layer of indirection above your application code, you don't have to think about it, so I would recommend the HTTPS Only flag as the way to go.

要考虑的第三件事是您是否使用应用服务本地缓存选项.简而言之，这是一个选项，应用服务会将应用程序的文件复制到运行它的实例的本地存储中，而不是在网络共享之外，如果你的应用程序不关心它，这是一个很好的选择丢失写入本地文件系统的更改.它加快了 I/O 性能(这很重要，因为请记住，应用服务在土豆上运行) 并消除由网络共享上的任何维护引起的重新启动.但是，关于应用服务的基础架构升级有一个特定的细微之处，记录不充分，您需要注意.具体来说，Local Cache 选项在第一次请求后在单独的应用程序域中在后台启动，然后当本地缓存准备好时切换到应用程序域.这意味着应用服务将对您的站点发出预热请求，获得成功的响应，将流量指向该实例，但是(哎呀！)现在本地缓存正在后台研磨 I/O，如果您有很多站点在这种情况下，您已经停下来，因为应用服务 I/O 非常可怕.如果您不知道这种情况正在发生，它在日志中看起来很诡异，因为就好像您的应用程序在同一个实例上启动了两次(因为确实如此).解决方案是遵循此 Jet 博客文章和创建一个应用程序初始化预热页面以监视环境变量，该变量告诉您本地缓存何时准备就绪.这样，您可以强制应用服务延迟启动到新实例，直到本地缓存完全准备好.这是我用来确保我也可以与数据库通信的一个:

The third thing to consider is whether or not you're using the App Service Local Cache Option. In brief, this is an option where App Service will copy your app's files to the local storage of the instances that it's running on rather than off of a network share, and is a great option to enable if your app doesn't care if it loses changes written to the local filesystem. It speeds up I/O performance (which is important because, remember, App Service runs on potatoes) and eliminates restarts that are caused by any maintenance on network share. But, there is a specific subtlety regarding App Service's infrastructure upgrades that is poorly documented and you need to be aware of. Specifically, the Local Cache option is initiated in the background in a separate app domain after the first request, and then you're switched to the app domain when the local cache is ready. That means that App Service will hit a warmup request against your site, get a successful response, point traffic to that instance, but (whoops!) now Local Cache is grinding I/O in the background, and if you have a lot of sites on this instance, you've ground to a halt because App Service I/O is horrendous. If you don't know this is happening, it looks spooky in the logs because it's as if your app is starting up twice on the same instance (because it is). The solution is to follow this Jet blog post and create an application initialization warmup page to monitors for the environment variable that tells you when the Local Cache is ready. This way, you can force App Service to delay booting you to the new instance until the Local Cache is fully prepped. Here's one that I use to make sure I can talk to the database, too:

public class WarmupHandler : IHttpHandler
{
    public bool IsReusable
    {
        get
        {
            return false;
        }
    }

    public ISession Session
    {
        get;
        set;
    }

    public void ProcessRequest(HttpContext context)
    {
        if (context == null)
        {
            throw new ArgumentNullException("context");
        }

        var request = context.Request;
        var response = context.Response;

        var localCacheVariable = Environment.GetEnvironmentVariable("WEBSITE_LOCAL_CACHE_OPTION");
        var localCacheReadyVariable = Environment.GetEnvironmentVariable("WEBSITE_LOCALCACHE_READY");
        var databaseReady = true;

        try
        {
            using (var transaction = this.Session.BeginTransaction())
            {
                var query = this.Session.QueryOver<User>()
                    .Take(1)
                    .SingleOrDefault<User>();
                transaction.Commit();
            }
        }
        catch
        {
            databaseReady = false;
        }

        var result = new
        {
            databaseReady,
            machineName = Environment.MachineName,
            localCacheEnabled = "Always".Equals(localCacheVariable, StringComparison.OrdinalIgnoreCase),
            localCacheReady = "True".Equals(localCacheReadyVariable, StringComparison.OrdinalIgnoreCase),
        };

        response.ContentType = "application/json";

        var warm = result.databaseReady && (!result.localCacheEnabled || result.localCacheReady);

        response.StatusCode = warm ? (int)HttpStatusCode.OK : (int)HttpStatusCode.ServiceUnavailable;

        var serializer = new JsonSerializer();
        serializer.Serialize(response.Output, result);
    }
}

还记得映射一个路由并在你的 web.config 中添加应用程序初始化:

Also remember to map a route and add the application initialization your web.config:

<applicationInitialization doAppInitAfterRestart="true">
  <add initializationPage="/warmup" />
</applicationInitialization>

要考虑的第四件事是，有时应用服务会因为看似垃圾的原因重新启动您的应用.似乎将 fcnMode 属性设置为 Disabled 会有所帮助；如果有人在服务器上使用配置文件或代码，它会阻止运行时重新启动您的应用程序.如果您正在使用暂存槽并以这种方式进行部署，那么这不应该打扰您.但是，如果您希望能够通过 FTP 输入文件并欺骗文件并看到该更改反映在生产中，那么请不要使用此选项:

The fourth thing to consider is that sometimes App Service will restart your app for seemingly garbage reasons. It seems that setting the fcnMode property to Disabled can help; it prevents the runtime from restarting your app if someone diddles with configuration files or code on the server. If you're using staging slots and doing deployments that way, this shouldn't bother you. But if you expect to be able to FTP in and diddle with a file and see that change reflected in production, then don't use this option:

     <httpRuntime fcnMode="Disabled" targetFramework="4.5" />

要考虑的第五件事，这主要是我一直以来的问题，是您是否使用启用了 AlwaysOn 选项的暂存槽.AlwaysOn 选项的工作原理是每分钟左右对您的站点进行一次 ping 操作，以确保它是温暖的，这样 IIS 就不会停止它.莫名其妙，这不是粘性设置，因此您可能在生产和暂存插槽上都打开了 AlwaysOn，因此您没有每次都弄乱它.这会导致应用服务基础结构升级在将您引导到新主机时出现问题.以下是发生的情况:假设您在一个实例上托管了 7 个站点，每个站点都有自己的暂存槽，所有都启用了 AlwaysOn.应用服务对您的 7 个生产槽进行预热和应用程序初始化，并尽职尽责地等待它们成功响应，然后再重定向流量.但它不会对暂存槽执行此操作.因此它将流量引导到新实例，但随后 AlwaysOn 会在 1-2 分钟后在暂存槽上启动，所以现在您还有 7 个站点同时启动.请记住，应用服务在土豆上运行，所以所有这些额外的 I/O 都会发生同时会破坏生产槽的性能，并会被视为停机.

The fifth thing to consider, and this was primarily my problem all along, is whether or not you are using staging slots with the AlwaysOn option enabled. The AlwaysOn option works by pinging your site every minute or so to make sure it's warm so that IIS doesn't spin it down. Inexplicably, this isn't a sticky setting, so you may have turned on AlwaysOn on both your production and staging slots so you don't have to mess with it every time. This causes a problem with App Service infrastructure upgrades when they boot you to a new host. Here's what happens: let's say you have 7 sites hosted on an instance, each with its own staging slot, everything with AlwaysOn enabled. App Service does the warmup and application initialization to your 7 production slots and dutifully waits for them to respond successfully before redirecting traffic over. But it doesn't do this for the staging slots. So it directs traffic over to the new instance, but then AlwaysOn kicks in 1-2 minutes later on the staging slots, so now you have 7 more sites starting up at the same time. Remember, App Service runs on potatoes, so all this additional I/O happening at the same time is going to destroy the performance of your production slots and will be perceived as downtime.

解决方案是在您的临时插槽上保持 AlwaysOn 关闭，这样您就不会在基础架构更新后被这种同时发生的 I/O 狂热所困扰.如果您通过 PowerShell 使用交换脚本，请保持此在暂存中关闭，在生产中打开"；令人惊讶的冗长:

The solution is to keep AlwaysOn off on your staging slots so you don't get nailed by this simultaneous I/O frenzy after an infrastructure update. If you are using a swap script via PowerShell, maintaining this "Off in staging, On in production" is surprisingly verbose to do:

Login-AzureRmAccount -SubscriptionId {{ YOUR_SUBSCRIPTION_ID }}

$resourceGroupName = "YOUR-RESOURCE-GROUP"
$appName = "YOUR-APP-NAME"
$slotName = "YOUR-SLOT-NAME-FOR-EXAMPLE-STAGING"

$props = @{ siteConfig = @{ alwaysOn = $true; } }

Set-AzureRmResource `
    -PropertyObject $props `
    -ResourceType "microsoft.web/sites/slots" `
    -ResourceGroupName $resourceGroupName `
    -ResourceName "$appName/$slotName" `
    -ApiVersion 2015-08-01 `
    -Force

Swap-AzureRmWebAppSlot `
    -SourceSlotName $slotName `
    -ResourceGroupName $resourceGroupName `
    -Name $appName

$props = @{ siteConfig = @{ alwaysOn = $false; } }

Set-AzureRmResource `
    -PropertyObject $props `
    -ResourceType "microsoft.web/sites/slots" `
    -ResourceGroupName $resourceGroupName `
    -ResourceName "$appName/$slotName" `
    -ApiVersion 2015-08-01 `
    -Force

此脚本将暂存槽设置为打开 AlwaysOn，进行交换以使暂存现在处于生产状态，然后将暂存槽设置为关闭 AlwaysOn，所以它不会在基础设施升级后搞砸.

This script sets the staging slot to have AlwaysOn turned on, does the swap so that staging is now production, then sets the staging slot to have AlwaysOn turned off, so it doesn't blow things up after an infrastructure upgrade.

一旦你完成这项工作，拥有一个为你处理安全更新和硬件故障的 PaaS 确实很棒.但在实践中实现它比营销材料可能暗示的要困难一些.希望这对某人有所帮助.

Once you get this working, it is indeed nice to have a PaaS that handles security updates and hardware failures for you. But it's a little bit more difficult to achieve in practice than the marketing materials might suggest. Hope this helps someone.

2020 年 7 月 17 日更新: 在上面的简介中，我谈到需要玩弄AlwaysOn".如果您使用暂存插槽，因为它会与插槽交换，并且将其放在所有插槽上可能会导致性能问题.在某些时候我不清楚，他们似乎已经解决了这个问题，所以AlwaysOn"没有交换.我的脚本实际上仍然在使用 AlwaysOn，但实际上它现在最终变成了无操作.因此，为您的暂存槽关闭 AlwaysOn 的建议仍然有效，但您不应该再在脚本中做这个小杂耍了.

UPDATE 07/17/2020: In the blurb above, I talk about needing to diddle with "AlwaysOn" if you're using staging slots, as it would swap with the slots, and having it on all slots can cause performance issues. At some point that isn't clear to me, they seem to have fixed this so that "AlwaysOn" isn't swapped. My script actually still does the diddling with AlwaysOn, but in effect it ends up being a no-op now. So the advice to keep AlwaysOn off for your staging slots still stands, but you shouldn't have to do this little juggle in a script anymore.

这篇关于有没有办法确定 Azure 应用服务重新启动的原因?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！