一、现状描述
11月25日晚上8点40分接到现场电话,说我们公司所有员工的im不能正常登陆,im数据库服务器(mysql数据库)的公司表中对应的我们公司名称被删除,需要处理。接到这个异常后,第一想到的是下午下班前,收到一封ogg的警告邮件说ogg进程终止,然后我登陆数据库查看的时候,发现ogg已经工作正常,所以也就没有太多关注。既然已经出现了这个问题,那么先解决问题,再分析原因。因为这些都是ogg从oracle端同步过来的一些数据,所以直接从oracle那边初始化一份过来,然后重设同步程序就可以了。

二、错误分析
1、出现这个问题,第一想到的就是binlog,因为公司表的数据是从我们oracle那边同步过去的,而且oracle那边没有任何关于这个表的删除操作,所以我定位这个表的delete操作

[mysql@ezgclient mysqllog]$ mysqlbinlog mysqlbin.000150>/tmp/11_25.txt[mysql@ezgclient mysqllog]$ vi /tmp/11_25.txt……# at 1396789# at 1396875#111125 16:15:31 server id 2 end_log_pos 1396875 Table_map: `a`.`abc` mapped to number 5304#111125 16:15:31 server id 2 end_log_pos 1397000 Delete_rows: table id 5304 flags: STMT_END_FBINLOG 'o07PThMCAAAAVgAAAItQFQAAALgUAAAAAAAABnNydGFpcgAPdGFiX3V1bV9jb21wYW55AA4DD/b2CAwMCA8PDw8PCBAAARQAFgCAAIAAQABAAEAA+D8=o07PThkCAAAAfQAAAAhRFQAQALgUAAAAAAEADv//AP7iAAAAFAC6vNbdysC1vL/GvLzT0M/euavLvoAAAAAAOLkeQYAAAAAAAAAAAAEpAAAAAAAAANkqumA8EgAAO8kaakoSAAABAAAAAAAAAA9CVTAwMDAwMDAzNTgwNzg=……

2、因为binlog_format采用的是row模式,所以需要进一步解析binglog

[mysql@ezgclient mysqllog]$ mysqlbinlog -v -v mysqlbin.000150>/tmp/11_25.txt[mysql@ezgclient mysqllog]$ vi /tmp/11_25.txt#111125 16:15:31 server id 2 end_log_pos 1396875 Table_map: `a`.`abc` mapped to number 5304#111125 16:15:31 server id 2 end_log_pos 1397000 Delete_rows: table id 5304 flags: STMT_END_FBINLOG 'o07PThMCAAAAVgAAAItQFQAAALgUAAAAAAAABnNydGFpcgAPdGFiX3V1bV9jb21wYW55AA4DD/b2CAwMCA8PDw8PCBAAARQAFgCAAIAAQABAAEAA+D8=o07PThkCAAAAfQAAAAhRFQAQALgUAAAAAAEADv//AP7iAAAAFAC6vNbdysC1vL/GvLzT0M/euavLvoAAAAAAOLkeQYAAAAAAAAAAAAEpAAAAAAAAANkqumA8EgAAO8kaakoSAAABAAAAAAAAAA9CVTAwMDAwMDAzNTgwNzg='/*!*/;### DELETE FROM a.abc### WHERE### @1=226 /* INT meta=0 nullable=0 is_null=0 */### @2='*****有限公司' /* VARSTRING(256) meta=256 nullable=0 is_null=0 */### @3=951656001 /* DECIMAL(20,0) meta=5120 nullable=0 is_null=0 */### @4=000000001 /* DECIMAL(22,0) meta=5632 nullable=1 is_null=0 */### @5=41 /* LONGINT meta=0 nullable=1 is_null=0 */### @6=2005-05-30 15:11:29 /* DATETIME meta=0 nullable=1 is_null=0 */### @7=2011-08-17 02:02:19 /* DATETIME meta=0 nullable=1 is_null=0 */### @8=1 /* LONGINT meta=0 nullable=1 is_null=0 */### @9='BU0000000358078' /* VARSTRING(128) meta=128 nullable=1 is_null=0 */### @10=NULL /* VARSTRING(128) meta=128 nullable=1 is_null=1 */### @11=NULL /* VARSTRING(128) meta=64 nullable=1 is_null=1 */### @12=NULL /* VARSTRING(128) meta=64 nullable=1 is_null=1 */### @13=NULL /* VARSTRING(128) meta=64 nullable=1 is_null=1 */### @14=NULL /* VARSTRING(128) meta=0 nullable=1 is_null=1 */# at 1397000#111125 16:15:31 server id 2 end_log_pos 1397027 Xid = 79238866COMMIT/*!*/;DELIMITER ;# End of log fileROLLBACK /* added by mysqlbinlog */;

通过上面的分析和这里的日志情况显示,很明显有人误删除了这条记录,导致我们公司所有员工不能登录im(登录在线的,不会使用到这条记录,这个也就是导致了我们到晚上八点多才发现这个异常)

3、检查error日志
检查这个日志,发现一个很明显的bug,这个是导致数据库重启,以及那个时间因为数据库重启导致ogg进程异常收到邮件

111125 16:15:35 InnoDB: Assertion failure in thread 1095162176 in file row/row0mysql.c line 1534InnoDB: Failing assertion: index->type & DICT_CLUSTEREDInnoDB: We intentionally generate a memory trap.InnoDB: Submit a detailed bug report to http://bugs.mysql.com.InnoDB: If you get repeated assertion failures or crashes, evenInnoDB: immediately after the mysqld startup, there may beInnoDB: corruption in the InnoDB tablespace. Please refer toInnoDB: http://dev.mysql.com/doc/refman/5.1/en/forcing-recovery.htmlInnoDB: about forcing recovery.111125 16:15:35 - mysqld got signal 6 ;This could be because you hit a bug. It is also possible that this binaryor one of the libraries it was linked against is corrupt, improperly built,or misconfigured. This error can also be caused by malfunctioning hardware.We will try our best to scrape up some info that will hopefully help diagnosethe problem, but since we have already crashed, something is definitely wrongand this may fail.key_buffer_size=8388608read_buffer_size=1048576max_used_connections=30max_threads=1000threads_connected=14It is possible that mysqld could use up tokey_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 9234379 Kbytes of memoryHope that's ok; if not, decrease some variables in the equation.thd: 0xb8bf170Attempting backtrace. You can use the following information to find outwhere mysqld died. If you see no messages after this, something wentterribly wrong...stack_bottom = 0x4146d100 thread_stack 0x30000/opt/mysql/product/5.1/bin/mysqld(my_print_stacktrace+0x2e)[0x8a74ce]/opt/mysql/product/5.1/bin/mysqld(handle_segfault+0x322)[0x5dc992]/lib64/libpthread.so.0[0x357980eb10]/lib64/libc.so.6(gsignal+0x35)[0x3578c30265]/lib64/libc.so.6(abort+0x110)[0x3578c31d10]/opt/mysql/product/5.1/bin/mysqld(row_unlock_for_mysql+0x2f2)[0x7f4a52]/opt/mysql/product/5.1/bin/mysqld(row_search_for_mysql+0x22e1)[0x802591]/opt/mysql/product/5.1/bin/mysqld(_ZN11ha_innobase10index_readEPhPKhj16ha_rkey_function+0x192)[0x7724d2]/opt/mysql/product/5.1/bin/mysqld(_ZN7handler16read_range_firstEPK12st_key_rangeS2_bb+0xbe)[0x6caa9e]/opt/mysql/product/5.1/bin/mysqld(_ZN7handler22read_multi_range_firstEPP18st_key_multi_rangeS1_jbP17st_handler_buffer+0xce)[0x6c85be]/opt/mysql/product/5.1/bin/mysqld(_ZN18QUICK_RANGE_SELECT8get_nextEv+0x127)[0x6aa557]/opt/mysql/product/5.1/bin/mysqld[0x6c415d]/opt/mysql/product/5.1/bin/mysqld(_Z12mysql_deleteP3THDP10TABLE_LISTP4ItemP11st_sql_listyyb+0x86c)[0x66fc5c]/opt/mysql/product/5.1/bin/mysqld(_Z21mysql_execute_commandP3THD+0x38bf)[0x5f03af]/opt/mysql/product/5.1/bin/mysqld(_Z11mysql_parseP3THDPKcjPS2_+0x357)[0x5f25e7]/opt/mysql/product/5.1/bin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0xe63)[0x5f3453]/opt/mysql/product/5.1/bin/mysqld(_Z10do_commandP3THD+0xe6)[0x5f3d16]/opt/mysql/product/5.1/bin/mysqld(handle_one_connection+0x236)[0x5e66d6]/lib64/libpthread.so.0[0x357980673d]/lib64/libc.so.6(clone+0x6d)[0x3578cd3d1d]Trying to get some variables.Some pointers may be invalid and cause the dump to abort...thd->query at 0x2aab4890fcd0 is an invalid pointerthd->thread_id=62259thd->killed=NOT_KILLEDThe manual page at http://dev.mysql.com/doc/mysql/en/crashing.html containsinformation that should help you find out what is causing the crash.111125 16:15:35 mysqld_safe Number of processes running now: 0111125 16:15:35 mysqld_safe mysqld restartedInnoDB: Log scan progressed past the checkpoint lsn 0 694228728111125 16:15:36 InnoDB: Database was not shut down normally!InnoDB: Starting crash recovery.InnoDB: Reading tablespace information from the .ibd files...InnoDB: Restoring possible half-written data pages from the doublewriteInnoDB: buffer...InnoDB: Doing recovery: scanned up to log sequence number 0 694229872111125 16:15:36 InnoDB: Starting an apply batch of log records to the database...InnoDB: Progress in percents: 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99InnoDB: Apply batch completedInnoDB: Last MySQL binlog file position 0 1397027, file name /opt/mysql/mysqldata/mysqllog/mysqlbin.000149111125 16:15:37 InnoDB: Started; log sequence number 0 694229872111125 16:15:37 [Note] Recovering after a crash using /opt/mysql/mysqldata/mysqllog/mysqlbin111125 16:15:37 [Note] Starting crash recovery...111125 16:15:37 [Note] Crash recovery finished.

而且还有个巧合就是查看binlog发现,DELETE FROM a.abc where ……之后,数据库就因为这个bug自动重启了。

4、网友解释

# /opt/mysql/product/5.1/bin/mysqld(_Z12mysql_deleteP3THDP10TABLE_LISTP4ItemP11st_sql_listyyb+0x86c)[0x66fc5c]# /opt/mysql/product/5.1/bin/mysqld(_Z21mysql_execute_commandP3THD+0x38bf)[0x5f03af]---从这信息看是整理簇索引,导致表空间出现损坏---分析的情况,你的系统应该正在做一个DELETE操作,而且应该无索引可走,删除的数据量也比较大---可能是大量数据被缓存在innodb_buffer_pool_size中,并且其内部有创建自适应的hash索引,因删除数据而不得不重新创建,---以及你的服务器当时IO出现瓶颈,导致一时无法响应Innodb master thread,而出现问题,并且InnoDB引擎在此方面出现过BUG---解决版本是5.1.37之后,所以建议使用:5.1.40版本,较稳定
12-09 10:03
查看更多