问题描述
我的程序最近在运行时遇到了一个奇怪的段错误.我想知道之前是否有人遇到过这个错误以及如何修复它.这里有更多信息:
My program recently encountered a weird segfault when running. I want to know if somebody had met this error before and how it could be fixed. Here is more info:
基本信息:
- CentOS 5.2,内核版本为 2.6.18
- g++ (GCC) 4.1.2 20080704(红帽 4.1.2-50)
- CPU:英特尔 x86 家族
- libstdc++.so.6.0.8
- 我的程序将启动多个线程来处理数据.段错误发生在其中一个线程中.
- 虽然它是一个多线程程序,但段错误似乎发生在本地 std::string 对象上.我稍后会在代码片段中展示这一点.
- 程序使用 -g、-Wall 和 -fPIC 编译,没有 -O2 或其他优化选项.
核心转储信息:
Core was generated by `./myprog'.
Program terminated with signal 11, Segmentation fault.
#0 0x06f6d919 in __gnu_cxx::__exchange_and_add(int volatile*, int) () from /usr/lib/libstdc++.so.6
(gdb) bt
#0 0x06f6d919 in __gnu_cxx::__exchange_and_add(int volatile*, int) () from /usr/lib/libstdc++.so.6
#1 0x06f507c3 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::assign(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /usr/lib/libstdc++.so.6
#2 0x06f50834 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::operator=(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /usr/lib/libstdc++.so.6
#3 0x081402fc in Q_gdw::ProcessData (this=0xb2f79f60) at ../../../myprog/src/Q_gdw/Q_gdw.cpp:798
#4 0x08117d3a in DataParser::Parse (this=0x8222720) at ../../../myprog/src/DataParser.cpp:367
#5 0x08119160 in DataParser::run (this=0x8222720) at ../../../myprog/src/DataParser.cpp:338
#6 0x080852ed in Utility::__dispatch (arg=0x8222720) at ../../../common/thread/Thread.cpp:603
#7 0x0052c832 in start_thread () from /lib/libpthread.so.0
#8 0x00ca845e in clone () from /lib/libc.so.6
请注意,段错误开始于 basic_string::operator=().
Please note that the segfault begins within the basic_string::operator=().
相关代码:(我已经展示了比可能需要的更多的代码,现在请忽略编码风格.)
The related code:(I've shown more code than that might be needed, and please ignore the coding style things for now.)
int Q_gdw::ProcessData()
{
char tmpTime[10+1] = {0};
char A01Time[12+1] = {0};
std::string tmpTimeStamp;
// Get the timestamp from TP
if((m_BackFrameBuff[11] & 0x80) >> 7)
{
for (i = 0; i < 12; i++)
{
A01Time[i] = (char)A15Result[i];
}
tmpTimeStamp = FormatTimeStamp(A01Time, 12); // Segfault occurs on this line
这里是这个 FormatTimeStamp 方法的原型:
And here is the prototype of this FormatTimeStamp method:
std::string FormatTimeStamp(const char *time, int len)
我认为这样的字符串赋值操作应该是一种常用的操作,但我就是不明白为什么这里会发生段错误.
I think such string assignment operations should be a kind of commonly used one, but I just don't understand why a segfault could occurr here.
我调查的内容:
我在网上搜索过答案.我查看了这里.回复说尝试使用定义的 _GLIBCXX_FULLY_DYNAMIC_STRING 宏重新编译程序.我试过了,但崩溃仍然发生.
I've searched on the web for answers. I looked at here. The reply says try to recompile the program with _GLIBCXX_FULLY_DYNAMIC_STRING macro defined. I tried but the crash still happens.
我还查看了此处.它还说用_GLIBCXX_FULLY_DYNAMIC_STRING重新编译程序,但作者似乎正在处理我的不同问题,因此我认为他的解决方案不适合我.
I also looked at here. It also says to recompile the program with _GLIBCXX_FULLY_DYNAMIC_STRING, but the author seems to be dealing with a different problem with mine, thus I don't think his solution works for me.
于 2011 年 8 月 15 日更新
这是这个 FormatTimeStamp 的原始代码.我知道代码看起来不太好(例如,太多的幻数..),但让我们首先关注崩溃问题.
Here is the original code of this FormatTimeStamp. I understand the coding doesn't look very nice(too many magic numbers, for instance..), but let's focus on the crash issue first.
string Q_gdw::FormatTimeStamp(const char *time, int len)
{
string timeStamp;
string tmpstring;
if (time) // It is guaranteed that "time" is correctly zero-terminated, so don't worry about any overflow here.
tmpstring = time;
// Get the current time point.
int year, month, day, hour, minute, second;
#ifndef _WIN32
struct timeval timeVal;
struct tm *p;
gettimeofday(&timeVal, NULL);
p = localtime(&(timeVal.tv_sec));
year = p->tm_year + 1900;
month = p->tm_mon + 1;
day = p->tm_mday;
hour = p->tm_hour;
minute = p->tm_min;
second = p->tm_sec;
#else
SYSTEMTIME sys;
GetLocalTime(&sys);
year = sys.wYear;
month = sys.wMonth;
day = sys.wDay;
hour = sys.wHour;
minute = sys.wMinute;
second = sys.wSecond;
#endif
if (0 == len)
{
// The "time" doesn't specify any time so we just use the current time
char tmpTime[30];
memset(tmpTime, 0, 30);
sprintf(tmpTime, "%d-%d-%d %d:%d:%d.000", year, month, day, hour, minute, second);
timeStamp = tmpTime;
}
else if (6 == len)
{
// The "time" specifies "day-month-year" with each being 2-digit.
// For example: "150811" means "August 15th, 2011".
timeStamp = "20";
timeStamp = timeStamp + tmpstring.substr(4, 2) + "-" + tmpstring.substr(2, 2) + "-" +
tmpstring.substr(0, 2);
}
else if (8 == len)
{
// The "time" specifies "minute-hour-day-month" with each being 2-digit.
// For example: "51151508" means "August 15th, 15:51".
// As the year is not specified, the current year will be used.
string strYear;
stringstream sstream;
sstream << year;
sstream >> strYear;
sstream.clear();
timeStamp = strYear + "-" + tmpstring.substr(6, 2) + "-" + tmpstring.substr(4, 2) + " " +
tmpstring.substr(2, 2) + ":" + tmpstring.substr(0, 2) + ":00.000";
}
else if (10 == len)
{
// The "time" specifies "minute-hour-day-month-year" with each being 2-digit.
// For example: "5115150811" means "August 15th, 2011, 15:51".
timeStamp = "20";
timeStamp = timeStamp + tmpstring.substr(8, 2) + "-" + tmpstring.substr(6, 2) + "-" + tmpstring.substr(4, 2) + " " +
tmpstring.substr(2, 2) + ":" + tmpstring.substr(0, 2) + ":00.000";
}
else if (12 == len)
{
// The "time" specifies "second-minute-hour-day-month-year" with each being 2-digit.
// For example: "305115150811" means "August 15th, 2011, 15:51:30".
timeStamp = "20";
timeStamp = timeStamp + tmpstring.substr(10, 2) + "-" + tmpstring.substr(8, 2) + "-" + tmpstring.substr(6, 2) + " " +
tmpstring.substr(4, 2) + ":" + tmpstring.substr(2, 2) + ":" + tmpstring.substr(0, 2) + ".000";
}
return timeStamp;
}
于 2011 年 8 月 19 日更新
这个问题终于得到解决和修复.事实上,FormatTimeStamp() 函数与根本原因无关.段错误是由本地字符缓冲区的写溢出引起的.
This problem has finally been addressed and fixed. The FormatTimeStamp() function has nothing to do with the root cause, in fact. The segfault is caused by a writing overflow of a local char buffer.
这个问题可以用以下更简单的程序重现(请暂时忽略一些变量的错误命名):
This problem can be reproduced with the following simpler program(please ignore the bad namings of some variables for now):
(使用g++ -Wall -g main.cpp"编译)
(Compiled with "g++ -Wall -g main.cpp")
#include <string>
#include <iostream>
void overflow_it(char * A15, char * A15Result)
{
int m;
int t = 0,i = 0;
char temp[3];
for (m = 0; m < 6; m++)
{
t = ((*A15 & 0xf0) >> 4) *10 ;
t += *A15 & 0x0f;
A15 ++;
std::cout << "m = " << m << "; t = " << t << "; i = " << i << std::endl;
memset(temp, 0, sizeof(temp));
sprintf((char *)temp, "%02d", t); // The buggy code: temp is not big enough when t is a 3-digit integer.
A15Result[i++] = temp[0];
A15Result[i++] = temp[1];
}
}
int main(int argc, char * argv[])
{
std::string str;
{
char tpTime[6] = {0};
char A15Result[12] = {0};
// Initialize tpTime
for(int i = 0; i < 6; i++)
tpTime[i] = char(154); // 154 would result in a 3-digit t in overflow_it().
overflow_it(tpTime, A15Result);
str.assign(A15Result);
}
std::cout << "str says: " << str << std::endl;
return 0;
}
在继续之前,我们应该记住以下两个事实:1).我的机器是 Intel x86 机器,所以它使用 Little Endian 规则.因此对于变量m"int 类型,其值为 10,它的内存布局可能是这样的:
Here are two facts we should remember before going on:1). My machine is an Intel x86 machine so it's using the Little Endian rule. Therefore for a variable "m" of int type, whose value is, say, 10, it's memory layout might be like this:
Starting addr:0xbf89bebc: m(byte#1): 10
0xbf89bebd: m(byte#2): 0
0xbf89bebe: m(byte#3): 0
0xbf89bebf: m(byte#4): 0
2).上面的程序在主线程中运行.说到overflow_it()函数,线程栈中的变量布局是这样的(只显示重要的变量):
2). The program above runs within the main thread. When it comes to the overflow_it() function, the variables layout in the thread stack looks like this(which only shows the important variables):
0xbfc609e9 : temp[0]
0xbfc609ea : temp[1]
0xbfc609eb : temp[2]
0xbfc609ec : m(byte#1) <-- Note that m follows temp immediately. m(byte#1) happens to be the byte temp[3].
0xbfc609ed : m(byte#2)
0xbfc609ee : m(byte#3)
0xbfc609ef : m(byte#4)
0xbfc609f0 : t
...(3 bytes)
0xbfc609f4 : i
...(3 bytes)
...(etc. etc. etc...)
0xbfc60a26 : A15Result <-- Data would be written to this buffer in overflow_it()
...(11 bytes)
0xbfc60a32 : tpTime
...(5 bytes)
0xbfc60a38 : str <-- Note the str takes up 4 bytes. Its starting address is **16 bytes** behind A15Result.
我的分析:
1).m 是 overflow_it() 中的一个计数器,其值在每个 for 循环中递增 1,并且其最大值假定不大于 6.因此它的值可以完全存储在 m(byte#1)(记住它是 Little Endian)中恰好是 temp3.
1). m is a counter in overflow_it() whose value is incremented by 1 at each for loop and whose max value is supposed not greater than 6. Thus it's value could be stored completely in m(byte#1)(remember it's Little Endian) which happens to be temp3.
2).在错误行中:当 t 是 3 位整数时,例如 109,那么 sprintf() 调用将导致缓冲区溢出,因为将数字 109 序列化为字符串109"实际上需要 4 个字节:'1'、'0'、'9' 和一个终止的 ' '.因为 temp[] 只分配了 3 个字节,最后的 ' ' 肯定会被写入 temp3,这只是 m(byte#1),不幸的是它存储了 m 的值.结果,m 的值每次都被重置为 0.
2). In the buggy line: When t is a 3-digit integer, such as 109, then the sprintf() call would result in a buffer overflow, because serializing the number 109 to the string "109" actually requires 4 bytes: '1', '0', '9' and a terminating ' '. Because temp[] is allocated with 3 bytes only, the final ' ' would definitely be written to temp3, which is just the m(byte#1), which unfortunately stores m's value. As a result, m's value is reset to 0 every time.
3).然而,程序员的期望是overflow_it() 中的for 循环只会执行6 次,每次m 加1.因为m 总是被重置为0,所以实际循环时间远远超过6 次.
3). The programmer's expectation, however, is that the for loop in the overflow_it() would execute 6 times only, with each time m being incremented by 1. Because m is always reset to 0, the actual loop time is far more than 6 times.
4).我们看overflow_it()中的变量i:每次for循环执行时,i的值加2,会访问A15Result[i].但是,如果您编译并运行此程序,您将看到 i 值最终加起来为 24,这意味着 overflow_it() 将数据写入从 A15Result[0] 到 A15Result[23] 的字节.请注意,对象 str 仅在 A15Result[0] 后面 16 个字节,因此 overflow_it() 已经扫过"了 A15Result[0].str 并破坏它的正确内存布局.
4). Let's look at the variable i in overflow_it(): Every time the for loop is executed, i's value is incremented by 2, and A15Result[i] will be accessed. However, if you compile and run this program, you'll see the i value finally adds up to 24, which means the overflow_it() writes data to the bytes ranging from A15Result[0] to A15Result[23]. Note that the object str is only 16 bytes behind A15Result[0], thus the overflow_it() has "sweeped through" str and destroy it's correct memory layout.
5).我认为 std::string 的正确使用,因为它是一个非 POD 数据结构,取决于实例化的 std::string 对象必须具有正确的内部状态.但是在这个程序中,str的内部布局已经被外部强行改变了.这应该就是assign()方法调用最终会导致段错误的原因.
5). I think the correct use of std::string, as it is a non-POD data structure, depends on that that instantiated std::string object must have a correct internal state. But in this program, str's internal layout has been changed by force externally. This should be why the assign() method call would finally cause a segfault.
2011 年 8 月 26 日更新
在我 2011 年 8 月 19 日的上一次更新中,我说段错误是由本地 std::string 对象的方法调用引起的,该对象的内存布局已被破坏,因此成为破坏"的对象.目的.这不是总是"真实的故事.考虑下面的 C++ 程序:
In my previous update on 08/19/2011, I said that the segfault was caused by a method call on a local std::string object whose memory layout had been broken and thus became a "destroyed" object. This is not an "always" true story. Consider the C++ program below:
//C++
class A {
public:
void Hello(const std::string& name) {
std::cout << "hello " << name;
}
};
int main(int argc, char** argv)
{
A* pa = NULL; //!!
pa->Hello("world");
return 0;
}
Hello() 调用会成功.即使您为 pa 分配了一个明显错误的指针,它也会成功.原因是:根据 C++ 对象模型,类的非虚拟方法不驻留在对象的内存布局中.C++ 编译器将 A::Hello() 方法转换为类似 A_Hello_xxx(A * const this, ...) 的方法,它可能是一个全局函数.因此,只要您不对this"进行操作,指针,事情可能会很顺利.
The Hello() call would succeed. It would succeed even if you assign an obviously bad pointer to pa. The reason is: the non-virtual methods of a class don't reside within the memory layout of the object, according to the C++ object model. The C++ compiler turns the A::Hello() method to something like, say, A_Hello_xxx(A * const this, ...) which could be a global function. Thus, as long as you don't operate on the "this" pointer, things could go pretty well.
这一事实表明坏"对象不是导致 SIGSEGV 段错误的根本原因.assign() 方法在 std::string 中不是虚拟的,因此坏"std::string 对象不会导致段错误.一定有其他原因最终导致了段错误.
This fact shows that a "bad" object is NOT the root cause that results in the SIGSEGV segfault. The assign() method is not virtual in std::string, thus the "bad" std::string object wouldn't cause the segfault. There must be some other reason that finally caused the segfault.
我注意到段错误来自 __gnu_cxx::__exchange_and_add() 函数,所以我在 这个网页:
I noticed that the segfault comes from the __gnu_cxx::__exchange_and_add() function, so I then looked into its source code in this web page:
00046 static inline _Atomic_word
00047 __exchange_and_add(volatile _Atomic_word* __mem, int __val)
00048 { return __sync_fetch_and_add(__mem, __val); }
__exchange_and_add() 最后调用 __sync_fetch_and_add().根据这个网页,__sync_fetch_and_add() 是一个 GCC 内置函数,其行为如下:
The __exchange_and_add() finally calls the __sync_fetch_and_add(). According to this web page, the __sync_fetch_and_add() is a GCC builtin function whose behavior is like this:
type __sync_fetch_and_add (type *ptr, type value, ...)
{
tmp = *ptr;
*ptr op= value; // Here the "op=" means "+=" as this function is "_and_add".
return tmp;
}
它来了!传入的 ptr 指针在这里被取消引用.在 08/19/2011 程序中,ptr 实际上是this".坏"的指针assign() 方法中的 std::string 对象.正是在这一点上的 derefenence 实际上导致了 SIGSEGV 分段错误.
There it is! The passed-in ptr pointer is dereferenced here. In the 08/19/2011 program, the ptr is actually the "this" pointer of the "bad" std::string object within the assign() method. It is the derefenence at this point that actually caused the SIGSEGV segmentation fault.
我们可以使用以下程序对此进行测试:
We could test this with the following program:
#include <bits/atomicity.h>
int main(int argc, char * argv[])
{
__sync_fetch_and_add((_Atomic_word *)0, 10); // Would result in a segfault.
return 0;
}
推荐答案
有两种可能:
- 第 798 行之前的某些代码损坏了本地
tmpTimeStamp
对象 FormatTimeStamp()
的返回值有点糟糕.
- some code before line 798 has corrupted the local
tmpTimeStamp
object - the return value from
FormatTimeStamp()
was somehow bad.
_GLIBCXX_FULLY_DYNAMIC_STRING
很可能是一个红鲱鱼,与问题无关.
The _GLIBCXX_FULLY_DYNAMIC_STRING
is most likely a red herring and has nothing to do with the problem.
如果你为 libstdc++
安装了 debuginfo
包(我不知道它在 CentOS 上叫什么),你将能够看到"那个代码,并且可能能够判断是赋值运算符的左侧 (LHS) 还是 RHS 导致了问题.
If you install debuginfo
package for libstdc++
(I don't know what it's called on CentOS), you'll be able to "see into" that code, and might be able to tell whether the left-hand-side (LHS) or the RHS of the assignment operator caused the problem.
如果这是不可能的,则必须在程序集级别对此进行调试.进入帧 #2
并执行 x/4x $ebp
应该给你以前的 ebp
,调用者地址 (0x081402fc
)、LHS(应与 #3
帧中的 &tmpTimeStamp
匹配)和 RHS.从那里出发,祝你好运!
If that's not possible, you'll have to debug this at the assembly level. Going into frame #2
and doing x/4x $ebp
should give you previous ebp
, caller address (0x081402fc
), LHS (should match &tmpTimeStamp
in frame #3
), and RHS. Go from there, and good luck!
这篇关于来自 libstdc++.so.6 的 std::string::assign() 方法中奇怪的 SIGSEGV 分段错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!