客户质询的现象是:
Slony-I运行中,log中发现FATAL信息:
FATAL storeListen: unknown node ID
出现了上述错误后,再看后继的log,又恢复正常运行了。
客户的问题在于:如何看待这个错误信息,它是否是设计上就是这样的?
言外之意,这到底是否是一个bug?
设计上是否是这样,是无从知晓的,只有问Vendor。而我的想法是,先分析源代码看看:
/* ----------
* SlonWatchdog
* ----------
*/
static void
SlonWatchdog(void)
{
…
slon_log(SLON_INFO, "slon: watchdog process started\n");
slon_log(SLON_CONFIG, "slon: watchdog ready - pid = %d\n", slon_watchdog_pid);
slon_worker_pid = fork();
if (slon_worker_pid == )
{
SlonMain();
exit(-);
}
…
if (install_signal_handler(SIGUSR1,sighandler) == SIG_ERR)
{
slon_log(SLON_FATAL, "slon: SIGUSR1 signal handler setup failed -(%d) %s\n", errno, strerror(errno));
slon_exit(-);
}
…
slon_log(SLON_CONFIG, "slon: worker process created - pid = %d\n",
slon_worker_pid);
while(!shutdown)
{
while ((pid = wait(&child_status)) != slon_worker_pid)
{
…
}
…
slon_log(SLON_CONFIG, "slon: child terminated %s: %d; pid: %d, current worker pid: %d\n",
termination_reason,return_code, pid, slon_worker_pid);
switch (watchdog_status)
{
…
case SLON_WATCHDOG_NORMAL:
case SLON_WATCHDOG_RETRY:
watchdog_status = SLON_WATCHDOG_RETRY;
if (child_status != )
{
slon_log(SLON_CONFIG, "slon: restart of worker in 10 seconds\n");
(void)sleep();
}
else
{
slon_log(SLON_CONFIG, "slon: restart of worker\n");
}
if (watchdog_status == SLON_WATCHDOG_RETRY)
{
slon_worker_pid=fork();
if(slon_worker_pid == )
{
worker_restarted=;
SlonMain();
exit(-);
}
…
watchdog_status=SLON_WATCHDOG_NORMAL;
continue;
}
break;
default:
shutdown=;
break;
} /*switch*/
}/*while*/
…
}
/* ----------
* SlonMain
* ----------
*/
static void
SlonMain(void)
{
…
for (i = , n = PQntuples(res); i < n; i++)
{
…
rtcfg_storePath(pa_server, pa_conninfo, pa_connretry);
}
PQclear(res);
…
}
/* ----------
* rtcfg_storePath
* ----------
*/
void
rtcfg_storePath(int pa_server, char *pa_conninfo, int pa_connretry)
{
…
/*
* Store the (new) conninfo to the node
*/
slon_log(SLON_CONFIG, "storePath: pa_server=%d pa_client=%d pa_conninfo=\"%s\" pa_connretry=%d\n",
pa_server, rtcfg_nodeid, pa_conninfo, pa_connretry);
…
/*
* Eventually start communicating with that node
*/
rtcfg_startStopNodeThread(node);
}
/* ----------
* rtcfg_startStopNodeThread
* ----------
*/
static void
rtcfg_startStopNodeThread(SlonNode * node)
{
…
if (sched_get_status() == SCHED_STATUS_OK && node->no_active)
{
/*
* Make sure the node worker exists
*/
switch (node->worker_status)
{
case SLON_TSTAT_NONE:
if (pthread_create(&(node->worker_thread), NULL, remoteWorkerThread_main, (void *)node) < )
{
…
}
node->worker_status = SLON_TSTAT_RUNNING;
break;
…
}
}
…
}
/* ----------
* slon_remoteWorkerThread
*
* Listen for events on the local database connection. This means, events
* generated by the local node only.
* ----------
*/
void *
remoteWorkerThread_main(void *cdata)
{
…
while (true)
{
…
else /* not SYNC */
{
…
else if (strcmp(event->ev_type, "STORE_LISTEN") == )
{
…
if (li_receiver == rtcfg_nodeid)
rtcfg_storeListen(li_origin, li_provider);
…
}
…
}
…
}
…
}
/* ----------
* rtcfg_storeListen
* ----------
*/
void
rtcfg_storeListen(int li_origin, int li_provider)
{
…
node = rtcfg_findNode(li_provider);
if (!node)
{
slon_log(SLON_FATAL,"storeListen: unknown node ID %d\n", li_provider);
slon_retry();
return;
}
…
}
#define slon_retry() \
do { \
pthread_mutex_lock(&slon_watchdog_lock); \
if (slon_watchdog_pid >= ) { \
slon_log(SLON_DEBUG2, "slon_retry() from pid=%d\n", slon_pid); \
(void) kill(slon_watchdog_pid, SIGUSR1); \
slon_watchdog_pid = -; \
} \
pthread_mutex_unlock(&slon_watchdog_lock); \
pthread_exit(NULL); \
} while () /* ----------
* sighandler
* ----------
*/
static void
sighandler(int signo)
{
switch (signo)
{
…
case SIGUSR1:
watchdog_status = SLON_WATCHDOG_RETRY;
slon_terminate_worker();
break;
…
}
}
/* ----------
* slon_terminate_worker
* ----------
*/
void
slon_terminate_worker()
{
(void) kill(slon_worker_pid, SIGKILL);
}
上述是对代码的简略整理。
在其中:
SlonWatchdog函数中,通过fork生成子进程。
此子进程的SlonMain函数里、通过rtcfg_storePath --> rtcfg_storePath -->rtcfg_startStopNodeThread的调用关系,
作了一个线程,该线程启动是,调用 remoteWorkerThread_main 函数。
remoteWorkerThread_main函数里,调用rtcfg_storeListen函数的时候,
如果获得 Node情报的时候,发生了错误,就会导致向SlonWatchdog运行时的主进程发送SIGUSR信号。
另一方面:
主进程的SlonWatchdog函数中,早已经准备了对应SIGUSR信号的函数sighandler。
在此sighandler函数中,SIGUSR信号发生时,会把上述的子进程kill掉。
而且,此主进程中通过wait调用,准备好了当上述子进程一旦被kill掉或者自己死掉时的代码逻辑:
通过while循环,再次采用fork操作,调用fork后子进程的SlonMain函数,一切又周而复始了:
如果SlonMain函数调用rtcfg_storeListen失败,就再次发生死亡,回到主进程再次fork;
如果成功,就跳出循环,进入下一步的处理。