问题描述
运行Microsoft的BCP实用程序是否存在任何特定问题(在CentOS 7上, https://docs.microsoft.com/zh-cn/sql/linux/sql-server-linux-migrate-bcp?view=sql-server- 2017 )在多个线程上?谷歌搜索找不到很多,但是正在寻找一个似乎与此有关的问题.
Are there any specific problems with running Microsoft's BCP utility (on CentOS 7, https://docs.microsoft.com/en-us/sql/linux/sql-server-linux-migrate-bcp?view=sql-server-2017) on multiple threads? Googling could not find much, but am looking at a problem that seems to be related to just that.
使用某种形式的代码将一组大型TSV文件从HDFS复制到远程MSSQL Server
Copying a set of large TSV files from HDFS to a remote MSSQL Server with some code of the form
bcpexport() {
filename=$1
TO_SERVER_ODBCDSN=$2
DB=$3
TABLE=$4
USER=$5
PASSWORD=$6
RECOMMEDED_IMPORT_MODE=$7
DELIMITER=$8
echo -e "\nRemoving header from TSV file $filename"
echo -e "Current head:\n"
echo $(head -n 1 $filename)
echo "$(tail -n +2 $filename)" > $filename
echo "First line of file is now..."
echo $(head -n 1 $filename)
# temp. workaround safeguard for NFS latency
#sleep 5 #FIXME: appears to sometimes cause script to hang, workaround implemented below, throws error if timeout reached
timeout 30 sleep 5
echo -e "\nReplacing null literal values with empty chars"
NULL_WITH_TAB="null\t" # WARN: assumes the first field is prime-key so never null
TAB="\t"
sed -i -e "s/$NULL_WITH_TAB/$TAB/g" $filename
echo -e "Lines containing null (expect zero): $(grep -c "\tnull\t" $filename)"
# temp. workaround safeguard for NFS latency
#sleep 5 #FIXME: appears to sometimes cause script to hang, workaround implemented below
timeout 30 sleep 5
/opt/mssql-tools/bin/bcp "$TABLE" in "$filename" \
$TO_SERVER_ODBCDSN \
-U $USER -P $PASSWORD \
-d $DB \
$RECOMMEDED_IMPORT_MODE \
-t "\t" \
-e ${filename}.bcperror.log
}
export -f bcpexport
parallel -q -j 7 bcpexport {} "$TO_SERVER_ODBCDSN" $DB $TABLE $USER $PASSWORD $RECOMMEDED_IMPORT_MODE $DELIMITER \
::: $DATAFILES/$TARGET_GLOB
其中$DATAFILES/$TARGET_GLOB
构造一个glob,该glob列出目录中的一组文件.
where $DATAFILES/$TARGET_GLOB
constructs a glob that lists a set of files in a directory.
在为一组TSV文件运行此代码时,发现有时一些(但不是全部)并行BCP线程失败.一些文件成功复制到MSSQL Server
When running this code for a set of TSV files, finding that sometimes some (but not all) of the parallel BCP threads fail, ie. some files successfully copy to MSSQL Server
5397376行已复制.
5397376 rows copied.
网络数据包大小(字节):4096
Network packet size (bytes): 4096
时钟时间(毫秒)总计:154902平均:(34843.8行/秒)
Clock Time (ms.) Total : 154902 Average : (34843.8 rows per sec.)
而其他人输出错误消息
BCP复制失败
通常情况下,请参见此模式:在返回的前几个线程中执行了一些成功的BCP复制操作,然后一堆失败的线程返回其输出,直到用完文件为止(GNU Parallel仅在完成显示整个线程时才返回输出就像顺序的一样.
Usually, see this pattern: a few successful BCP copy-in operations in the first few threads returned, then a bunch of failing threads return their output until run out of files (GNU Parallel returns output only when whole thread done to appear same as if sequential).
在代码中请注意,有-e
选项可以为每个BCP复制操作生成错误文件(请参阅"> https://docs.microsoft.com/zh-cn/sql/tools/bcp-utility?view=sql-server- 2017#e ).在观察到这些失败行为后检查文件时,所有文件均为空白,没有错误消息.
Notice in the code there is the -e
option to produce an error file for each BCP copy-in operation (see https://docs.microsoft.com/en-us/sql/tools/bcp-utility?view=sql-server-2017#e). When examining the files after observing these failing behaviors, all are blank, no error messages.
仅在线程数> = 10时才看到这种情况(并且仅对于某些数据集(假定文件总数与文件大小有关,但是...)),没有看到错误到目前为止,使用〜7个线程时,这进一步使我怀疑这与多线程有关.
Only have seen this with the number of threads >= 10 (and only for certain sets of data (assuming has something to do with total number of files are files sizes, and yet...)), no errors seen so far when using ~7 threads, which further makes me suspect this has something to do with multi-threading.
(通过free -mh
)监视系统资源表明,通常〜13GB或RAM总是可用的.
Monitoring system resources (via free -mh
) shows that generally ~13GB or RAM is always available.
可能需要注意的是,尝试复制的数据bcp
的长度可能约为500000-1000000个记录,每个记录的上限约为100列.
May be helpful to note that the data bcp
is trying to copy-in may be ~500000-1000000 records long with an upper limit of ~100 columns per record.
有人知道这里会发生什么吗?请注意,使用BCP以及GNU并行和多线程是一个相当新的东西.
Does anyone have any idea what could be going on here? Note, am pretty new to using BCP as well as GNU Parallel and multi-threading.
推荐答案
TLDR :添加更多线程以同时运行以具有数据的bcp
副本文件似乎具有使用写指令使端点MSSQL Server压倒,导致bcp
线程失败(也许超时?).当线程数过多时,似乎取决于bcp
复制的文件大小(即文件中的记录数以及每条记录的宽度(即数列)).
TLDR: Adding more threads to run concurrently to have bcp
copy-in files of data seems to have the affect of overwhelming the endpoint MSSQL Server with write instructions, causing the bcp
threads to fail (maybe timeing out?). When the number of threads becomes too many seems to depend on the size of the files getting copy-in'ed by bcp
(ie. both the number of records in the file as well as the width of each record (ie. number of columns)).
长版(我的理论有更多原因):
1.当运行大量bcp
线程并查看计算机上启动的进程时( https://clustershell.readthedocs.io/en/latest/tools/clush.html )
1.When running a larger number of bcp
threads and looking at the processes started on the machine (https://clustershell.readthedocs.io/en/latest/tools/clush.html)
ps -aux | grep bcp
看到一堆睡眠过程(请注意S
,请参见 https://askubuntu.com/a/360253/760862 ),如下所示(添加了换行符以提高可读性)
seeing a bunch of sleeping processes (notice the S
, see https://askubuntu.com/a/360253/760862) as shown below (added newlines for readability)
/opt/mssql-tools/bin/bcp TABLENAME.tsv -D -S MyMSSQLServer -U myusername -P -d myDB -c -t \ t -e/path/to/logfile
/opt/mssql-tools/bin/bcp TABLENAME in /path/to/tsv/1_16_0.tsv -D -S MyMSSQLServer -U myusername -P -d myDB -c -t \t -e /path/to/logfile
这些线程似乎睡眠了很长时间.进一步调试这些线程为何处于休眠状态表明它们可能实际上正在完成其预期的工作(这将进一步暗示问题可能来自BCP本身(请参阅https://stackoverflow.com/a/52748660/8236733 )).来自 https://unix.stackexchange.com/a/47259/260742 和https://unix.stackexchange.com/a/36200/260742 )
These threads appear to sleep for very long time. Further debugging into why these threads are sleeping suggests that they may in fact be doing their intended job (which would further imply that the problem may be coming from BCP itself (see https://stackoverflow.com/a/52748660/8236733)). From https://unix.stackexchange.com/a/47259/260742 and https://unix.stackexchange.com/a/36200/260742)
(例如,写入ODBCDSN中指定给bcp
的MSSQL Server端点目标)
(eg. writing to the MSSQL Server endpoint destination given to bcp
in the ODBCDSN)
2..当运行不同组的文件时,每个文件的记录量不同(例如,范围为500000-1000000行/文件)和每个文件的记录宽度(〜10) -100列/行),发现在数据宽度或数据量非常大的情况下,运行一组固定的bcp
线程会失败.
2. When running different sets of files of varying record-amount-per-file (eg. ranges of 500000 - 1000000 rows/file) and record-width-per-file (~10 - 100 columns/row), found that in cases with either very large data width or amounts, running a fixed set of bcp
threads would fail.
例如.对于一组约33个TSV,每组约500,000行,每行约100列宽,一组30个线程将写入前几个OK,但随后所有其余线程将开始返回失败消息.结合@jamie的回答,返回的失败消息是"BCP copy in failed"
错误这一事实并不一定意味着它与所讨论的数据内容有关. @jamie的帖子没有将实际内容从我的进程写入-e
错误日志文件中,
Eg. for a set of ~33 TSVs with ~500000 rows each, each row being ~100 columns wide, a set of 30 threads would write the first few OK, but then all the rest would start returning failure messages. Incorporating a bit from @jamie's answer, the fact the the failure messages returned are "BCP copy in failed"
errors does not necessarily mean it has do do with the content of the data in question. Having no actual content being written into the -e
errorlog files from my process, @jamie's post says this
同时,一组约33个TSV,每组约500000行,每行约100宽,仍使用30个bcp
线程将快速完成而不会出错(减少线程数或减少线程数也将更快).文件集). 唯一的区别是bcp
复制到MSSQL Server的数据的整体大小.
Meanwhile, a set of ~33 TSVs with ~500000 rows each, each row being ~100 wide, and still using 30 bcp
threads would complete quickly and without error (also would be faster when reducing the number of threads or file set). The only difference here being the overall size of the data being bcp
copy-in'ed to the MSSQL Server.
这一切
free -mh
仍然显示在每种情况下,运行线程的计算机仍具有〜15GB的可用RAM(这又是为什么我怀疑问题与远程MSSQL Server端点有关,而不是与代码或本地计算机本身.
still showed that the machine running the threads still had ~15GB of free RAM remaining in each case (which is again why I suspect that the problem has to do with the remote MSSQL Server endpoint rather than with the code or local machine itself).
3..运行(2)中的某些测试时,发现手动杀死了parallel
进程(通过CTL+C
),然后尝试远程截断要写入的测试表.在本地计算机上使用/opt/mssql-tools/bin/sqlcmd -Q "truncate table mytable"
将花费很长时间(与手动登录MSSQL Server并在数据库中执行truncate mytable
相对).再次,这使我认为这与MSSQL Server的连接过多有关/只是不知所措.
3. When running some of the tests from (2), found that manually killing the parallel
process (via CTL+C
) and then trying to remotely truncate the testing table being written to with /opt/mssql-tools/bin/sqlcmd -Q "truncate table mytable"
on the local machine would take a very long time (as opposed to manually logging into the MSSQL Server and executing a truncate mytable
in the DB). Again this makes me think that this has something to do with the MSSQL Server having too many connections and just being overwhelmed.
**具有任何MSSQL Mgmt Studio经验的任何人都读过这篇文章(我基本上没有),如果您在此处看到任何使您认为我的理论不正确的信息,请告诉我您的想法.
** Anyone with any MSSQL Mgmt Studio experience reading this (I have basically none), if you see anything here that makes you think that my theory is incorrect please let me know your thoughts.
这篇关于在"Too many"上运行(Linux)BCP的任何特定问题都将在运行中出现.线程?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!