http://www.cnblogs.com/me-sa/archive/2012/01/10/erlang0030.html

Supervisors are used to build an hierarchical process structure called a supervision tree, a nice way to structure a fault tolerant application.
                                                                                                                                                                                     --Erlang/OTP Doc

Supervisor的基本思想就是通过建立层级结构实现错误隔离和管理,具体方法是通过重启的方式保持子进程一直活着.如果supervisor是进程树的一部分,它会被它的supervisor自动终止,当它的supervisor让它shutdown的时候,它会按照子进程启动顺序的逆序终止其所有的子进程,最后终止掉自己.重启的目的是让系统回归到一个稳定的状态,回归稳定状态后再出现异常可以进行重试,如果初始化都不稳定,后续的监控-重启策略意义不大.换句话说,Application初始化的阶段要有可靠性的保障,初始化阶段可能读取配置文件或者从数据库加载恢复数据,哪怕执行时间长一点都等待同步执行完.如果application依赖非本地数据库或外部服务就可以采取更快的异步启动,因为这种服务在正常使用过程中也经常出状况,早一点还是晚一点启动没有什么关系.

[Erlang 0025]理解Erlang/OTP - Application以log4erl项目为学习了Erlang/OTP application,我们说到application在start的方法中启动了log4erl的顶层监控树.今天我们继续跟进,看log4erl的监控树是怎么构建起来的,并做实验看supervisor如何通过重启恢复服务的.使用application:start(log4erl).启动起来之后的进程树:

理解Erlang/OTP Supervisor-LMLPHP

下面是log4erl_sup文件的start_link方法,supervisor:start_link方法的执行是同步的,直到所有的子进程都启动了才会返回. supervisor:start_link会使用回调函数init/1.

理解Erlang/OTP Supervisor-LMLPHP
start_link(Default_logger) ->
R = supervisor:start_link({local, ?MODULE}, ?MODULE, []),
%log4erl:start_link(Default_logger),
add_logger(Default_logger),
?LOG2("Result in supervisor is ~p~n",[R]),
R. %%回调的方法init/1
init([]) ->
{ok,
{
{one_for_one,3,10},
[]
}
}.
理解Erlang/OTP Supervisor-LMLPHP
  log4erl的顶层监控树的初始化相当简单仅仅定义了重启策略(RestartStrategy)和最大重启频率(maximum restart frequency):{one_for_one,3,10}.
 {one_for_one,3,10}表达的语义是{How, Max, Within}:在多长时间内(Within)重启了几次(Max),如何重启(HOW 重启策略);设计最大重启频率是为了避免反复重启进入死循环,一旦超出了此阈值,supervisor进程会结束掉自己以及它所有的子进程,并通过进程树传递退出消息,更上层的supervisor就会采取适当的措施,要么重启终止的supervisor要么自己也终止掉.可能比较纠结这几个值怎么配置,多数资料上都会告诉你"如何配置完全取决于你的应用程序".这个还是有经验值的,生成环境的经验值是一小时内重启4次,也可以参考一些和你应用类似的开源项目看看它们是怎么配置的.如果填写的是{one_for_one,0,1}就是不允许重启,下面的示例中可以看到YAWS项目采用了这样的策略.
 
下面几个开源项目顶层supervisor的init方法:
理解Erlang/OTP Supervisor-LMLPHP
%%rabbit_sup.erl  来自大名鼎鼎的rabbitmq
init([]) ->
{ok, {{one_for_all, 0, 1}, []}}. %%yaws_sup.erl Yaws项目 - Yet Another Web Server
init([]) -> ChildSpecs = child_specs(),
%% 0, 1 means that we never want supervisor restarts
{ok,{{one_for_all, 0, 1}, ChildSpecs}}. %%ejabberd_sup ejabberd项目
init([]) ->
Hooks =
{ejabberd_hooks,
{ejabberd_hooks, start_link, []},
%%......................... 省略代码
{ok, {{one_for_one, 10, 1},
[Hooks,
GlobalRouter,
Cluster,
..................
Listener]}}.
理解Erlang/OTP Supervisor-LMLPHP
 
重启策略
 
 one_for_one : 把子进程当成各自独立的,一个进程出现问题其它进程不会受到崩溃的进程的影响.该子进程死掉,只有这个进程会被重启
 one_for_all : 如果子进程终止,所有其它子进程也都会被终止,然后所有进程都会被重启.
 rest_for_one:如果一个子进程终止,在这个进程启动之后启动的进程都会被终止掉.然后终止掉的进程和连带关闭的进程都会被重启.
 simple_one_for_one 是one_for_one的简化版 ,所有子进程都动态添加同一种进程的实例

one-for-one维护了一个按照启动顺序排序的子进程列表,而simple_one_for_one 由于所有的子进程都是同样的(相同的MFA),使用的是字典来维护子进程信息;

在log4erl_sup.erl的start_link中启动了顶层supervisor之后,添加了一个默认的logger: add_logger(Default_logger),
理解Erlang/OTP Supervisor-LMLPHP
add_logger(Name) when is_atom(Name) ->
N = atom_to_list(Name),
add_logger(N);
add_logger(Name) when is_list(Name) ->
C1 = {Name,
{log_manager, start_link ,[Name]},
permanent,
10000,
worker,
[log_manager]}, ?LOG2("Adding ~p to ~p~n",[C1, ?MODULE]),
supervisor:start_child(?MODULE, C1).
理解Erlang/OTP Supervisor-LMLPHP
添加的logger是log4erl_sup的子进程,子进程启动和监控的方式通过child specification来指定.
 C1 =  {Name,{log_manager, start_link ,[Name]},permanent,10000,worker,[log_manager]}
 
C1的六个数据项分别为: {ID, StartEntery, Restart, Shutdown, Type, Modules}:
ID :supervisor 用来在内部区分specification的,所以只要子进程规格说明之间不重复就可以.
Start : 启动参数{M,F,A}
Restart : 这个进程遇到错误之后是否重启
                permanent:遇到任何错误导致进程终止就会重启
                temporary:进程永远都不会被重启
                transient: 只有进程异常终止的时候会被重启
Shutdown 进程如何被干掉,这里是使用整型值2000的意思是,进程在被强制干掉之前有2000毫秒的时间料理后事自行终止.
              实际过程是supervisor给子进程发送一个exit(Pid,shutdown)然后等待exit信号返回,在指定时间没有返回则将子进程使用exit(Child,kill)
             这里的参数还有 brutal_kill 意思是进程马上就会被干掉
             infinity :当一个子进程是supervisor那么就要用infinity,意思是给supervisor足够的时间进行重启.
Type 这里只有两个值:supervisor worker ; 只要没有实现supervisor behavior的进程都是worker;
                    可以通过supervisor的层级结构来精细化对进程的控制.这个值主要作用是告知监控进程它的子进程是supervisor还是worker
Modules 是进程依赖的模块,这个信息只有在代码热更新的时候才会被用到:标注了哪些模块需要按照什么顺序进行更新;通常这里只需要列出进程依赖的主模块. 如果子进程是supervisor gen_server gen_fsm Module名是回调模块的名称,这时Modules的值是只有一个元素的列表,元素就是回调模块的名称;如果子进程是gen_event Modules的值是 dynamic;关于dynamic参数余锋有一篇专门的分析:Erlang supervisor规格的dynamic行为分析 http://blog.yufeng.info/archives/1455
 
 
实际应用中log4erl中的logger会根据业务逻辑添加多个,我们也不是直接通过application:start(log4erl).而是调用  log4erl:conf(log4erl.conf)这个方法简单的封装了内部逻辑,实际调用的是 log4erl_conf:conf(File).我们这里定义一个简单的log4erl.conf文件,使用log4erl:conf(log4erl.conf).启动之后,我们看看它的进程树是什么样的:
对应的进程树是这样的,进程之间的红线表示link关系:

理解Erlang/OTP Supervisor-LMLPHP

我们沿着调用关系,逐步跟进代码:

理解Erlang/OTP Supervisor-LMLPHP
%==== File : log4erl_conf =======

%%log4erl_conf:conf(File).
conf(File) ->
application:start(log4erl), %%启动log4erl
Tree = parse(leex(File)), %%解析配置文件
traverse(Tree). %%遍历配置项构造监控树 %%跟进遍历的逻辑,对于每一条配置执行的是element/1方法
traverse([]) ->
ok;
traverse([H|Tree]) ->
element(H),
traverse(Tree). %%对于我们自定义的logger走的是{logger, Logger, Appenders}逻辑
element({cutoff_level, CutoffLevel}) ->
log_filter_codegen:set_cutoff_level(CutoffLevel);
element({default_logger, Appenders}) ->
appenders(Appenders);
element({logger, Logger, Appenders}) ->
log4erl:add_logger(Logger),
appenders(Logger, Appenders). %==== File : log4erl =======
%%继续跟进我们走到log4erl:add_logger/1
add_logger(Logger) ->
try_msg({add_logger, Logger}). %%try_msg 是的添加了异常捕获的通用方法
try_msg(Msg) ->
try
handle_call(Msg)
catch
exit:{noproc, _M} ->
io:format("log4erl has not been initialized yet. To do so, please run~n"),
io:format("> application:start(log4erl).~n"),
{error, log4erl_not_started};
E:M ->
?LOG2("Error message received by log4erl is ~p:~p~n",[E, M]),
{E, M}
end. %%handle_call的代码片段
handle_call({add_logger, Logger}) ->
log_manager:add_logger(Logger); %==== File : log_manager =======
%%逻辑转到log_manager的add_logger(Logger)
%%最终调用的是log4erl_sup:add_logger(Logger).这个我们上面已经分析过了
add_logger(Logger) ->
log4erl_sup:add_logger(Logger). %%element方法在添加loger之后会添加appender
appenders([]) ->
ok;
appenders([H|Apps]) ->
appender(H),
appenders(Apps). appenders(_, []) ->
ok;
appenders(Logger, [H|Apps]) ->
appender(Logger, H),
appenders(Logger, Apps). appender({appender, App, Name, Conf}) ->
log4erl:add_appender({App, Name}, {conf, Conf}). appender(Logger, {appender, App, Name, Conf}) ->
log4erl:add_appender(Logger, {App, Name}, {conf, Conf}). %==== File : log4erl =======
%% Appender = {Appender, Name}
add_appender(Logger, Appender, Conf) ->
try_msg({add_appender, Logger, Appender, Conf}). handle_call({add_appender, Logger, Appender, Conf}) ->
log_manager:add_appender(Logger, Appender, Conf); %==== File : log_manager =======
add_appender(Logger, {Appender, Name} , Conf) ->
?LOG2("add_appender ~p with name ~p to ~p with Conf ~p ~n",[Appender, Name, Logger, Conf]),
log4erl_sup:add_guard(Logger, Appender, Name, Conf). %==== File : log4erl_sup =======
add_guard(Logger, Appender, Name, Conf) ->
C = {Name,
{logger_guard, start_link ,[Logger, Appender, Name, Conf]},
permanent,
10000,
worker,
[logger_guard]},
?LOG2("Adding ~p to ~p~n",[C, ?MODULE]),
supervisor:start_child(?MODULE, C). %==== File : logger_guard =======
start_link(Logger, Appender, Name, Conf) ->
%?LOG2("starting guard for logger ~p~n",[Logger]),
{ok, Pid} = gen_server:start_link(?MODULE, [Appender, Name], []),
case add_sup_handler(Pid, Logger, Conf) of
{error, E} ->
gen_server:call(Pid, stop),
{error, E};
_R ->
{ok, Pid}
end. add_sup_handler(G_pid, Logger, Conf) ->
?LOG("add_sup()~n"),
gen_server:call(G_pid, {add_sup_handler, Logger, Conf}). handle_call({add_sup_handler, Logger, Conf}, _From, [{appender, Appender, Name}] = State) ->
?LOG2("Adding handler ~p with name ~p for ~p From ~p~n",[Appender, Name, Logger, _From]),
try
Res = gen_event:add_sup_handler(Logger, {Appender, Name}, Conf),
{reply, Res, State}
catch
E:R ->
{reply, {error, {E,R}}, State}
end;
理解Erlang/OTP Supervisor-LMLPHP

gen_event:add_sup_handler会建立EventManager与Event Handler之间的link的关系,所以我们修改一下,注释掉这段,看看监控树是什么样子:

add_sup_handler(G_pid, Logger, Conf) ->

%    ?LOG("add_sup()~n"),
%    gen_server:call(G_pid, {add_sup_handler, Logger, Conf}).
  ok.

注释掉之后可以看到logger和guard之间的link关系就不存在了.

理解Erlang/OTP Supervisor-LMLPHP

 
kill进程的实验
 
后面我们会用各种情况杀掉进程,看这个进程树对异常的处理情况;我们的实验步骤:
1.发送退出消息Reason:some_reason给default_logger
2.发送退出消息Reason:kill 给default_logger
3.发送退出消息Reason:some_reason给logger_guard
4.发送退出消息Reason:some_reason给log4erl_sup
5.发送退出消息Reason:kill 给log4erl_sup
理解Erlang/OTP Supervisor-LMLPHP
3> whereis(default_logger).
<0.45.0>
4> exit(whereis(default_logger),some_reason).
true
5> whereis(default_logger).
<0.45.0>
6> exit(whereis(default_logger),some_reason). %%由于gen_event默认process_flag(trap_exit, true),所以some_reason的退出消息并没有把它干掉
true
7> whereis(default_logger).
<0.45.0>
8> exit(whereis(default_logger),kill). %%向进程发送强制退出消息,
true =SUPERVISOR REPORT==== 10-Jan-2012::10:35:21 === %首先能够看到log4erl报出的子进程终止的报告
Supervisor: {local,log4erl_sup}
Context: child_terminated
Reason: killed
Offender: [{pid,<0.45.0>},
{name,"default_logger"},
{mfargs,{log_manager,start_link,["default_logger"]}},
{restart_type,permanent},
{shutdown,10000},
{child_type,worker}] =PROGRESS REPORT==== 10-Jan-2012::10:35:21 === %log4erl_sup重建default_logger,新进程pid是<0.69.0>
supervisor: {local,log4erl_sup}
started: [{pid,<0.69.0>},
{name,"default_logger"},
{mfargs,{log_manager,start_link,["default_logger"]}},
{restart_type,permanent},
{shutdown,10000},
{child_type,worker}] =SUPERVISOR REPORT==== 10-Jan-2012::10:35:21 === %default_logger退出消息转变成为killed继续广播给link的进程,对应的logger_guard终止
Supervisor: {local,log4erl_sup}
Context: child_terminated
Reason: killed
Offender: [{pid,<0.46.0>},
{name,default_app},
{mfargs,
{logger_guard,start_link,
[default_logger,file_appender,default_app,
{conf, [{dir,"./log"},{level,debug},{file,default_log},{type,size},
{max,1000000},{suffix,log}, {rotation,50},
{format," %d %h:%m:%s.%i %l%n"}]}]}},
{restart_type,permanent},
{shutdown,10000},
{child_type,worker}] =PROGRESS REPORT==== 10-Jan-2012::10:35:21 === %logger_guard 重建
supervisor: {local,log4erl_sup}
started: [{pid,<0.70.0>},
{name,default_app},
{mfargs,
{logger_guard,start_link,
[default_logger,file_appender,default_app,
{conf,
[{dir,"./log"},{level,debug}, {file,default_log},{type,size},
{max,1000000}, {suffix,log},{rotation,50},
{format," %d %h:%m:%s.%i %l%n"}]}]}},
{restart_type,permanent},
{shutdown,10000},
{child_type,worker}]
9> whereis(default_logger).
<0.69.0>
10> is_process_alive(pid(0,70,0)). %这是新启动的logger_guard进程
true
11> exit(pid(0,70,0),some_reason). %向进程发送一个退出消息
true =SUPERVISOR REPORT==== 10-Jan-2012::11:07:51 ===
Supervisor: {local,log4erl_sup}
Context: child_terminated
Reason: some_reason
Offender: [{pid,<0.70.0>},
{name,default_app},
{mfargs,
{logger_guard,start_link,
[default_logger,file_appender,default_app,
{conf,
[{dir,"./log"},{level,debug},{file,default_log},{type,size},{max,1000000},
{suffix,log},{rotation,50},{format," %d %h:%m:%s.%i %l%n"}]}]}},
{restart_type,permanent},
{shutdown,10000},
{child_type,worker}] 12>
=PROGRESS REPORT==== 10-Jan-2012::11:07:51 ===
supervisor: {local,log4erl_sup}
started: [{pid,<0.76.0>},
{name,default_app},
{mfargs,
{logger_guard,start_link,
[default_logger,file_appender,default_app,
{conf,
[{dir,"./log"},{level,debug},{file,default_log},{type,size},{max,1000000},
{suffix,log},{rotation,50},{format," %d %h:%m:%s.%i %l%n"}]}]}},
{restart_type,permanent},
{shutdown,10000},{child_type,worker}]
12> is_process_alive(pid(0,70,0)).
false
13> whereis(default_logger). %退出消息广播对default_logger没有影响
<0.69.0>
14> whereis(log4erl_sup).
<0.44.0>
15> exit(whereis(log4erl_sup),some_reason). % Supervisor 初始化的时候也会设置 process_flag(trap_exit, true),
true
16> whereis(log4erl_sup).
<0.44.0>
17> exit(whereis(log4erl_sup),kill). %杀掉log4erl_sup 应用程序停止
true =CRASH REPORT==== 10-Jan-2012::13:26:23 ===
crasher:
initial call: gen_event:init_it/6
pid: <0.69.0>
registered_name: default_logger
exception exit: killed
in function gen_event:terminate_server/4
ancestors: [log4erl_sup,<0.43.0>]
messages: [{'EXIT',<0.76.0>,killed}]
links: [#Port<0.1891>,#Port<0.1885>]
dictionary: []
trap_exit: true
status: running
heap_size: 610
stack_size: 24
reductions: 720
neighbours:
18>
=CRASH REPORT==== 10-Jan-2012::13:26:23 ===
crasher:
initial call: gen_event:init_it/6
pid: <0.47.0>
registered_name: mail_logger
exception exit: killed
in function gen_event:terminate_server/4
ancestors: [log4erl_sup,<0.43.0>]
messages: [{'EXIT',<0.48.0>,killed}]
links: [#Port<0.546>]
dictionary: []
trap_exit: true
status: running
heap_size: 377
stack_size: 24
reductions: 411
neighbours:
18>
=CRASH REPORT==== 10-Jan-2012::13:26:25 ===
crasher:
initial call: application_master:init/4
pid: <0.42.0>
registered_name: []
exception exit: killed
in function application_master:terminate/2
ancestors: [<0.41.0>]
messages: []
links: [<0.6.0>]
dictionary: []
trap_exit: true
status: running
heap_size: 610
stack_size: 24
reductions: 1555
neighbours:
18>
=INFO REPORT==== 10-Jan-2012::13:26:25 ===
application: log4erl
exited: killed
type: temporary
18>
理解Erlang/OTP Supervisor-LMLPHP

最后再贴一次log4erl项目的地址: http://code.google.com/p/log4erl/,建议下载下来代码自己动手做一下上面的实验.

04-30 07:09