想改善这个问题吗? Update the question,因此它是on-topic,用于堆栈溢出。
已关闭8年。
Improve this question
我想通过NFS和CIFS使2 TB左右可用。我正在寻找2个(或更多)服务器解决方案以实现高可用性,并在可能的情况下跨服务器负载平衡。对群集或高可用性解决方案有什么建议吗?
这是业务用途,计划在未来几年内增长到5-10 TB。我们的设施每周一周六天,每天24小时不间断。我们可能会有15-30分钟的停机时间,但我们想最大程度地减少数据丢失。我想尽量减少凌晨3点的通话。
目前,我们正在Solaris上使用ZFS运行一台服务器,并且正在为HA部分寻找AVS,但是Solaris出现了一些小问题(CIFS实现不适用于Vista等)使我们受阻。
我们已经开始研究
GFS上的DRDB(分布式GFS
锁定能力)
糊状(需要
客户端,没有本地CIFS
支持?)
Windows DFS(文档仅说
文件关闭后复制?)
我们正在寻找一个提供数据的“黑匣子”。
当前,我们在ZFS中对数据进行快照,并将快照通过网络发送到远程数据中心以进行异地备份。
我们最初的计划是拥有第二台计算机,并每隔10-15分钟进行rsync。失败的问题是正在进行的生产过程将丢失15分钟的数据,并“留在中间”。从头开始要比找出中间的位置要容易得多。这就是促使我们着眼于高可用性解决方案的原因。
最佳答案
我最近使用DRBD作为后端部署了hanf,在我的情况下,我正在运行活动/备用模式,但是我也已经在主要/主要模式下使用OCFS2成功地对其进行了测试。不幸的是,关于如何最好地实现这一目标的文档并不多,现有的大多数充其量也几乎没有用。如果您确实遵循drbd路线,强烈建议您加入drbd邮件列表,并阅读所有文档。这是我为处理ha失败而编写的ha / drbd设置和脚本:
DRBD8是必需的-这是由drbd8-utils和drbd8-source提供的。一旦安装了这些文件(我相信它们是由反向端口提供的),则可以使用模块辅助安装它-m-a a-i drbd8。要么depmod -a要么重新启动,如果您使用depmod -a,则需要modprobe drbd。
您将需要一个后端分区用于drbd,不要将该分区设为LVM,否则您会遇到各种各样的问题。不要将LVM放在drbd设备上,否则会遇到各种各样的问题。
Hanfs1:
/etc/drbd.conf
global {
usage-count no;
}
common {
protocol C;
disk { on-io-error detach; }
}
resource export {
syncer {
rate 125M;
}
on hanfs2 {
address 172.20.1.218:7789;
device /dev/drbd1;
disk /dev/sda3;
meta-disk internal;
}
on hanfs1 {
address 172.20.1.219:7789;
device /dev/drbd1;
disk /dev/sda3;
meta-disk internal;
}
}
Hanfs2的/etc/drbd.conf:
global {
usage-count no;
}
common {
protocol C;
disk { on-io-error detach; }
}
resource export {
syncer {
rate 125M;
}
on hanfs2 {
address 172.20.1.218:7789;
device /dev/drbd1;
disk /dev/sda3;
meta-disk internal;
}
on hanfs1 {
address 172.20.1.219:7789;
device /dev/drbd1;
disk /dev/sda3;
meta-disk internal;
}
}
配置完成后,接下来需要启动drbd。
drbdadm create-md export drbdadm attach export drbdadm connect export
We must now perform an initial synchronization of data - obviously, if this is a brand new drbd cluster, it doesn't matter which node you choose.
Once done, you'll need to mkfs.yourchoiceoffilesystem on your drbd device - the device in our config above is /dev/drbd1. http://www.drbd.org/users-guide/p-work.html is a useful document to read while working with drbd.
Heartbeat
Install heartbeat2. (Pretty simple, apt-get install heartbeat2).
/etc/ha.d/ha.cf on each machine should consist of:
hanfs1:
logfacility local0
keepalive 2
warntime 10
deadtime 30
initdead 120
ucast eth1 172.20.1.218
auto_failback no
node hanfs1
node hanfs2
hanfs2:
logfacility local0
keepalive 2
warntime 10
deadtime 30
initdead 120
ucast eth1 172.20.1.219
auto_failback no
node hanfs1
node hanfs2
两个ha盒上的/etc/ha.d/haresources应该相同:
hanfs1 IPaddr::172.20.1.230/24/eth1 hanfs1 HeartBeatWrapper
I wrote a wrapper script to deal with the idiosyncracies caused by nfs and drbd in a failover scenario. This script should exist within /etc/ha.d/resources.d/ on each machine.
!/bin/bash
heartbeat fails hard.
so this is a wrapper
to get around that stupidity
I'm just wrapping the heartbeat scripts, except for in the case of umount
as they work, mostly
if [[ -e /tmp/heartbeatwrapper ]]; then
runningpid=$(cat /tmp/heartbeatwrapper)
if [[ -z $(ps --no-heading -p $runningpid) ]]; then
echo "PID found, but process seems dead. Continuing."
else
echo "PID found, process is alive, exiting."
exit 7
fi
fi
echo $$ > /tmp/heartbeatwrapper
if [[ x$1 == "xstop" ]]; then
/etc/init.d/nfs-kernel-server stop #>/dev/null 2>&1
NFS init script isn't LSB compatible, exit codes are 0 no matter what happens.
Thanks guys, you really make my day with this bullshit.
Because of the above, we just have to hope that nfs actually catches the signal
to exit, and manages to shut down its connections.
If it doesn't, we'll kill it later, then term any other nfs stuff afterwards.
I found this to be an interesting insight into just how badly NFS is written.
sleep 1
#we don't want to shutdown nfs first!
#The lock files might go away, which would be bad.
#The above seems to not matter much, the only thing I've determined
#is that if you have anything mounted synchronously, it's going to break
#no matter what I do. Basically, sync == screwed; in NFSv3 terms.
#End result of failing over while a client that's synchronous is that
#the client hangs waiting for its nfs server to come back - thing doesn't
#even bother to time out, or attempt a reconnect.
#async works as expected - it insta-reconnects as soon as a connection seems
#to be unstable, and continues to write data. In all tests, md5sums have
#remained the same with/without failover during transfer.
#So, we first unmount /export - this prevents drbd from having a shit-fit
#when we attempt to turn this node secondary.
#That's a lie too, to some degree. LVM is entirely to blame for why DRBD
#was refusing to unmount. Don't get me wrong, having /export mounted doesn't
#help either, but still.
#fix a usecase where one or other are unmounted already, which causes us to terminate early.
if [[ "$(grep -o /varlibnfs/rpc_pipefs /etc/mtab)" ]]; then
for ((test=1; test <= 10; test++)); do
umount /export/varlibnfs/rpc_pipefs >/dev/null 2>&1
if [[ -z $(grep -o /varlibnfs/rpc_pipefs /etc/mtab) ]]; then
break
fi
if [[ $? -ne 0 ]]; then
#try again, harder this time
umount -l /var/lib/nfs/rpc_pipefs >/dev/null 2>&1
if [[ -z $(grep -o /varlibnfs/rpc_pipefs /etc/mtab) ]]; then
break
fi
fi
done
if [[ $test -eq 10 ]]; then
rm -f /tmp/heartbeatwrapper
echo "Problem unmounting rpc_pipefs"
exit 1
fi
fi
if [[ "$(grep -o /dev/drbd1 /etc/mtab)" ]]; then
for ((test=1; test <= 10; test++)); do
umount /export >/dev/null 2>&1
if [[ -z $(grep -o /dev/drbd1 /etc/mtab) ]]; then
break
fi
if [[ $? -ne 0 ]]; then
#try again, harder this time
umount -l /export >/dev/null 2>&1
if [[ -z $(grep -o /dev/drbd1 /etc/mtab) ]]; then
break
fi
fi
done
if [[ $test -eq 10 ]]; then
rm -f /tmp/heartbeatwrapper
echo "Problem unmount /export"
exit 1
fi
fi
#now, it's important that we shut down nfs. it can't write to /export anymore, so that's fine.
#if we leave it running at this point, then drbd will screwup when trying to go to secondary.
#See contradictory comment above for why this doesn't matter anymore. These comments are left in
#entirely to remind me of the pain this caused me to resolve. A bit like why churches have Jesus
#nailed onto a cross instead of chilling in a hammock.
pidof nfsd | xargs kill -9 >/dev/null 2>&1
sleep 1
if [[ -n $(ps aux | grep nfs | grep -v grep) ]]; then
echo "nfs still running, trying to kill again"
pidof nfsd | xargs kill -9 >/dev/null 2>&1
fi
sleep 1
/etc/init.d/nfs-kernel-server stop #>/dev/null 2>&1
sleep 1
#next we need to tear down drbd - easy with the heartbeat scripts
#it takes input as resourcename start|stop|status
#First, we'll check to see if it's stopped
/etc/ha.d/resource.d/drbddisk export status >/dev/null 2>&1
if [[ $? -eq 2 ]]; then
echo "resource is already stopped for some reason..."
else
for ((i=1; i <= 10; i++)); do
/etc/ha.d/resource.d/drbddisk export stop >/dev/null 2>&1
if [[ $(egrep -o "st:[A-Za-z/]*" /proc/drbd | cut -d: -f2) == "Secondary/Secondary" ]] || [[ $(egrep -o "st:[A-Za-z/]*" /proc/drbd | cut -d: -f2) == "Secondary/Unknown" ]]; then
echo "Successfully stopped DRBD"
break
else
echo "Failed to stop drbd for some reason"
cat /proc/drbd
if [[ $i -eq 10 ]]; then
exit 50
fi
fi
done
fi
rm -f /tmp/heartbeatwrapper
exit 0
elif [[x $ 1 ==“ xstart”]];然后
#start up drbd first
/etc/ha.d/resource.d/drbddisk export start >/dev/null 2>&1
if [[ $? -ne 0 ]]; then
echo "Something seems to have broken. Let's check possibilities..."
testvar=$(egrep -o "st:[A-Za-z/]*" /proc/drbd | cut -d: -f2)
if [[ $testvar == "Primary/Unknown" ]] || [[ $testvar == "Primary/Secondary" ]]
then
echo "All is fine, we are already the Primary for some reason"
elif [[ $testvar == "Secondary/Unknown" ]] || [[ $testvar == "Secondary/Secondary" ]]
then
echo "Trying to assume Primary again"
/etc/ha.d/resource.d/drbddisk export start >/dev/null 2>&1
if [[ $? -ne 0 ]]; then
echo "I give up, something's seriously broken here, and I can't help you to fix it."
rm -f /tmp/heartbeatwrapper
exit 127
fi
fi
fi
sleep 1
#now we remount our partitions
for ((test=1; test <= 10; test++)); do
mount /dev/drbd1 /export >/tmp/mountoutput
if [[ -n $(grep -o export /etc/mtab) ]]; then
break
fi
done
if [[ $test -eq 10 ]]; then
rm -f /tmp/heartbeatwrapper
exit 125
fi
#I'm really unsure at this point of the side-effects of not having rpc_pipefs mounted.
#The issue here, is that it cannot be mounted without nfs running, and we don't really want to start
#nfs up at this point, lest it ruin everything.
#For now, I'm leaving mine unmounted, it doesn't seem to cause any problems.
#Now we start up nfs.
/etc/init.d/nfs-kernel-server start >/dev/null 2>&1
if [[ $? -ne 0 ]]; then
echo "There's not really that much that I can do to debug nfs issues."
echo "probably your configuration is broken. I'm terminating here."
rm -f /tmp/heartbeatwrapper
exit 129
fi
#And that's it, done.
rm -f /tmp/heartbeatwrapper
exit 0
elif [[“ x $ 1” ==“ xstatus”]];然后
#Lets check to make sure nothing is broken.
#DRBD first
/etc/ha.d/resource.d/drbddisk export status >/dev/null 2>&1
if [[ $? -ne 0 ]]; then
echo "stopped"
rm -f /tmp/heartbeatwrapper
exit 3
fi
#mounted?
grep -q drbd /etc/mtab >/dev/null 2>&1
if [[ $? -ne 0 ]]; then
echo "stopped"
rm -f /tmp/heartbeatwrapper
exit 3
fi
#nfs running?
/etc/init.d/nfs-kernel-server status >/dev/null 2>&1
if [[ $? -ne 0 ]]; then
echo "stopped"
rm -f /tmp/heartbeatwrapper
exit 3
fi
echo "running"
rm -f /tmp/heartbeatwrapper
exit 0
科幻
完成上述所有操作之后,您只需要配置/ etc / exports
/出口172.20.1.0/255.255.255.0(rw,sync,fsid=1,no_root_squash)
然后,这只是在两台计算机上启动心跳并在其中一台计算机上发出hb_takeover的情况。您可以通过以下方法测试它的运行情况:确定发出该命令的服务器是主服务器-检查/ proc / drbd,确保设备已正确安装,并且可以访问nfs。
-
祝你好运。对我来说,从头开始设置它是一次非常痛苦的经历。