问题描述
我有一个名为data_file的数据:伦敦巴黎纽约意大利...另外50件物品
I have a file named data_file with data:londonparisnewyorkitaly...50 more items
具有一个包含75个文件的目录,例如dfile1,dfie2 ... afle75,我正在其中搜索data_file中的条目.
Have a directory with over 75 files, say dfile1, dfie2...afle75 in which i am performing search for entries in data_file.
files=$(find . -type f)
for f in $files; do
while read -r line; do
found=$(grep $line $f)
if [ ! -z "$found" ]; then
perform task here
fi
done < data_file
done
由于循环逐个文件地运行,因此需要花费大量时间才能完成.如何加快速度,可以同时对多个文件运行for循环吗?
As the loop runs for each file one by one, it takes lots of time to finish. How can I speed it up, can i run the for loop for multiple files at same time?
推荐答案
下面的示例是一个完整的并行执行方法,该方法处理:
The following example is a full blown parallel execution method, that deals with:
- 执行时间(将在一定的执行时间后发出警告,并在经过更多时间后停止任务)
- 异步记录(保持记录执行任务时的情况)
- 并行性(允许指定同时执行的任务数)
- 与IO相关的僵尸任务(不会阻止执行)
- 可以杀死大孩子们的小孩子们吗?
- 很多东西
在您的示例中,您的(强化)代码如下所示:
In your example, your (hardened) code would look like:
# Load the ExecTasks function described below (must be in the same directory as this one)
source ./exectasks.sh
directoryToProcess="/my/dir/to/find/stuff/into"
tasklist=""
# Prepare task list separated by semicolumn
while IFS= read -r -d $'\0' file; do
if grep "$line" "$file" > /dev/null 2>&1; then
tasklist="$tasklist""my_task;"
done < <(find "$directoryToProcess" -type f -print0)
# Run tasks
ExecTasks "$tasklist" "trivial-task-id" false 1800 3600 18000 36000 true 1 1800 true false false 8
在这里,我们使用了一个复杂的函数ExecTasks来处理任务的并行排队,并让您可以控制正在发生的事情,而不必担心由于某些挂起的任务而阻塞脚本.
Here we used a complex function ExecTasks that will deal with parallel queueing the tasks, and let you keep control of what's going on without fear to block the script because of some hanged task.
ExecTasks参数的快速说明:
Quick explanation of ExecTasks arguments:
"$tasklist" = variable containing task list
"some name" trivial task id (in order to identify in logs)
boolean: read tasks from file (you may have passed a task list from a file if there are too many to fit in a variable
1800 = maximum number of seconds a task may be executed before a warning is raised
3600 = maximum number of seconds a task may be executed before an error is raised and the tasks is stopped
18000 = maximum number of seconds the whole tasks may be executed before a warning is raised
36000 = maximum number of seconds the whole tasks may be executed before an error is raised and all the tasks are stopped
boolean: account execution time since beginning of tasks execution (true) or since script begin
1 = number of seconds between each state check (accepts float like .1)
1800 = Number of seconds between each "i am alive" log just to know everything works as expected
boolean: show spinner (true) or not (false)
boolean: log errors when reaching max times (false) or do not log them (true)
boolean: do not log any errors at all (false) or do log them (true)
And finally
8 = number of simultaneous tasks to launch (8 in our case)
这是exectasks.sh的源代码(您也可以将粘贴直接复制到脚本头中,而不是 source ./exectasks.sh
):
Here's the source to exectasks.sh (which you can also copy paste directly into your script header instead of source ./exectasks.sh
):
function Logger {
# Dummy log function, replace with whatever you need
echo "$2: $1"
}
# Nice cli spinner so we now execution is ongoing
_OFUNCTIONS_SPINNER="|/-\\"
function Spinner {
printf " [%c] \b\b\b\b\b\b" "$_OFUNCTIONS_SPINNER"
_OFUNCTIONS_SPINNER=${_OFUNCTIONS_SPINNER#?}${_OFUNCTIONS_SPINNER%%???}
return 0
}
# Portable child (and grandchild) kill function tester under Linux, BSD and MacOS X
function KillChilds {
local pid="${1}" # Parent pid to kill childs
local self="${2:-false}" # Should parent be killed too ?
# Paranoid checks, we can safely assume that $pid should not be 0 nor 1
if [ $(IsInteger "$pid") -eq 0 ] || [ "$pid" == "" ] || [ "$pid" == "0" ] || [ "$pid" == "1" ]; then
Logger "Bogus pid given [$pid]." "CRITICAL"
return 1
fi
if kill -0 "$pid" > /dev/null 2>&1; then
if children="$(pgrep -P "$pid")"; then
if [[ "$pid" == *"$children"* ]]; then
Logger "Bogus pgrep implementation." "CRITICAL"
children="${children/$pid/}"
fi
for child in $children; do
Logger "Launching KillChilds \"$child\" true" "DEBUG" #__WITH_PARANOIA_DEBUG
KillChilds "$child" true
done
fi
fi
# Try to kill nicely, if not, wait 15 seconds to let Trap actions happen before killing
if [ "$self" == true ]; then
# We need to check for pid again because it may have disappeared after recursive function call
if kill -0 "$pid" > /dev/null 2>&1; then
kill -s TERM "$pid"
Logger "Sent SIGTERM to process [$pid]." "DEBUG"
if [ $? -ne 0 ]; then
sleep 15
Logger "Sending SIGTERM to process [$pid] failed." "DEBUG"
kill -9 "$pid"
if [ $? -ne 0 ]; then
Logger "Sending SIGKILL to process [$pid] failed." "DEBUG"
return 1
fi # Simplify the return 0 logic here
else
return 0
fi
else
return 0
fi
else
return 0
fi
}
function ExecTasks {
# Mandatory arguments
local mainInput="${1}" # Contains list of pids / commands separated by semicolons or filepath to list of pids / commands
# Optional arguments
local id="${2:-base}" # Optional ID in order to identify global variables from this run (only bash variable names, no '-'). Global variables are WAIT_FOR_TASK_COMPLETION_$id and HARD_MAX_EXEC_TIME_REACHED_$id
local readFromFile="${3:-false}" # Is mainInput / auxInput a semicolon separated list (true) or a filepath (false)
local softPerProcessTime="${4:-0}" # Max time (in seconds) a pid or command can run before a warning is logged, unless set to 0
local hardPerProcessTime="${5:-0}" # Max time (in seconds) a pid or command can run before the given command / pid is stopped, unless set to 0
local softMaxTime="${6:-0}" # Max time (in seconds) for the whole function to run before a warning is logged, unless set to 0
local hardMaxTime="${7:-0}" # Max time (in seconds) for the whole function to run before all pids / commands given are stopped, unless set to 0
local counting="${8:-true}" # Should softMaxTime and hardMaxTime be accounted since function begin (true) or since script begin (false)
local sleepTime="${9:-.5}" # Seconds between each state check. The shorter the value, the snappier ExecTasks will be, but as a tradeoff, more cpu power will be used (good values are between .05 and 1)
local keepLogging="${10:-1800}" # Every keepLogging seconds, an alive message is logged. Setting this value to zero disables any alive logging
local spinner="${11:-true}" # Show spinner (true) or do not show anything (false) while running
local noTimeErrorLog="${12:-false}" # Log errors when reaching soft / hard execution times (false) or do not log errors on those triggers (true)
local noErrorLogsAtAll="${13:-false}" # Do not log any errros at all (useful for recursive ExecTasks checks)
# Parallelism specific arguments
local numberOfProcesses="${14:-0}" # Number of simulanteous commands to run, given as mainInput. Set to 0 by default (WaitForTaskCompletion mode). Setting this value enables ParallelExec mode.
local auxInput="${15}" # Contains list of commands separated by semicolons or filepath fo list of commands. Exit code of those commands decide whether main commands will be executed or not
local maxPostponeRetries="${16:-3}" # If a conditional command fails, how many times shall we try to postpone the associated main command. Set this to 0 to disable postponing
local minTimeBetweenRetries="${17:-300}" # Time (in seconds) between postponed command retries
local validExitCodes="${18:-0}" # Semi colon separated list of valid main command exit codes which will not trigger errors
local i
# Expand validExitCodes into array
IFS=';' read -r -a validExitCodes <<< "$validExitCodes"
# ParallelExec specific variables
local auxItemCount=0 # Number of conditional commands
local commandsArray=() # Array containing commands
local commandsConditionArray=() # Array containing conditional commands
local currentCommand # Variable containing currently processed command
local currentCommandCondition # Variable containing currently processed conditional command
local commandsArrayPid=() # Array containing commands indexed by pids
local commandsArrayOutput=() # Array containing command results indexed by pids
local postponedRetryCount=0 # Number of current postponed commands retries
local postponedItemCount=0 # Number of commands that have been postponed (keep at least one in order to check once)
local postponedCounter=0
local isPostponedCommand=false # Is the current command from a postponed file ?
local postponedExecTime=0 # How much time has passed since last postponed condition was checked
local needsPostponing # Does currentCommand need to be postponed
local temp
# Common variables
local pid # Current pid working on
local pidState # State of the process
local mainItemCount=0 # number of given items (pids or commands)
local readFromFile # Should we read pids / commands from a file (true)
local counter=0
local log_ttime=0 # local time instance for comparaison
local seconds_begin=$SECONDS # Seconds since the beginning of the script
local exec_time=0 # Seconds since the beginning of this function
local retval=0 # return value of monitored pid process
local subRetval=0 # return value of condition commands
local errorcount=0 # Number of pids that finished with errors
local pidsArray # Array of currently running pids
local newPidsArray # New array of currently running pids for next iteration
local pidsTimeArray # Array containing execution begin time of pids
local executeCommand # Boolean to check if currentCommand can be executed given a condition
local functionMode
local softAlert=false # Does a soft alert need to be triggered, if yes, send an alert once
local failedPidsList # List containing failed pids with exit code separated by semicolons (eg : 2355:1;4534:2;2354:3)
local randomOutputName # Random filename for command outputs
local currentRunningPids # String of pids running, used for debugging purposes only
# fnver 2019081401
# Initialise global variable
eval "WAIT_FOR_TASK_COMPLETION_$id=\"\""
eval "HARD_MAX_EXEC_TIME_REACHED_$id=false"
# Init function variables depending on mode
if [ $numberOfProcesses -gt 0 ]; then
functionMode=ParallelExec
else
functionMode=WaitForTaskCompletion
fi
if [ $readFromFile == false ]; then
if [ $functionMode == "WaitForTaskCompletion" ]; then
IFS=';' read -r -a pidsArray <<< "$mainInput"
mainItemCount="${#pidsArray[@]}"
else
IFS=';' read -r -a commandsArray <<< "$mainInput"
mainItemCount="${#commandsArray[@]}"
IFS=';' read -r -a commandsConditionArray <<< "$auxInput"
auxItemCount="${#commandsConditionArray[@]}"
fi
else
if [ -f "$mainInput" ]; then
mainItemCount=$(wc -l < "$mainInput")
readFromFile=true
else
Logger "Cannot read main file [$mainInput]." "WARN"
fi
if [ "$auxInput" != "" ]; then
if [ -f "$auxInput" ]; then
auxItemCount=$(wc -l < "$auxInput")
else
Logger "Cannot read aux file [$auxInput]." "WARN"
fi
fi
fi
if [ $functionMode == "WaitForTaskCompletion" ]; then
# Force first while loop condition to be true because we don't deal with counters but pids in WaitForTaskCompletion mode
counter=$mainItemCount
fi
# soft / hard execution time checks that needs to be a subfunction since it is called both from main loop and from parallelExec sub loop
function _ExecTasksTimeCheck {
if [ $spinner == true ]; then
Spinner
fi
if [ $counting == true ]; then
exec_time=$((SECONDS - seconds_begin))
else
exec_time=$SECONDS
fi
if [ $keepLogging -ne 0 ]; then
# This log solely exists for readability purposes before having next set of logs
if [ ${#pidsArray[@]} -eq $numberOfProcesses ] && [ $log_ttime -eq 0 ]; then
log_ttime=$exec_time
Logger "There are $((mainItemCount-counter+postponedItemCount)) / $mainItemCount tasks in the queue of which $postponedItemCount are postponed. Currently, ${#pidsArray[@]} tasks running with pids [$(joinString , ${pidsArray[@]})]." "NOTICE"
fi
if [ $(((exec_time + 1) % keepLogging)) -eq 0 ]; then
if [ $log_ttime -ne $exec_time ]; then # Fix when sleep time lower than 1 second
log_ttime=$exec_time
if [ $functionMode == "WaitForTaskCompletion" ]; then
Logger "Current tasks still running with pids [$(joinString , ${pidsArray[@]})]." "NOTICE"
elif [ $functionMode == "ParallelExec" ]; then
Logger "There are $((mainItemCount-counter+postponedItemCount)) / $mainItemCount tasks in the queue of which $postponedItemCount are postponed. Currently, ${#pidsArray[@]} tasks running with pids [$(joinString , ${pidsArray[@]})]." "NOTICE"
fi
fi
fi
fi
if [ $exec_time -gt $softMaxTime ]; then
if [ "$softAlert" != true ] && [ $softMaxTime -ne 0 ] && [ $noTimeErrorLog != true ]; then
Logger "Max soft execution time [$softMaxTime] exceeded for task [$id] with pids [$(joinString , ${pidsArray[@]})]." "WARN"
softAlert=true
SendAlert true
fi
fi
if [ $exec_time -gt $hardMaxTime ] && [ $hardMaxTime -ne 0 ]; then
if [ $noTimeErrorLog != true ]; then
Logger "Max hard execution time [$hardMaxTime] exceeded for task [$id] with pids [$(joinString , ${pidsArray[@]})]. Stopping task execution." "ERROR"
fi
for pid in "${pidsArray[@]}"; do
KillChilds $pid true
if [ $? -eq 0 ]; then
Logger "Task with pid [$pid] stopped successfully." "NOTICE"
else
if [ $noErrorLogsAtAll != true ]; then
Logger "Could not stop task with pid [$pid]." "ERROR"
fi
fi
errorcount=$((errorcount+1))
done
if [ $noTimeErrorLog != true ]; then
SendAlert true
fi
eval "HARD_MAX_EXEC_TIME_REACHED_$id=true"
if [ $functionMode == "WaitForTaskCompletion" ]; then
return $errorcount
else
return 129
fi
fi
}
function _ExecTasksPidsCheck {
newPidsArray=()
if [ "$currentRunningPids" != "$(joinString " " ${pidsArray[@]})" ]; then
Logger "ExecTask running for pids [$(joinString " " ${pidsArray[@]})]." "DEBUG"
currentRunningPids="$(joinString " " ${pidsArray[@]})"
fi
for pid in "${pidsArray[@]}"; do
if [ $(IsInteger $pid) -eq 1 ]; then
if kill -0 $pid > /dev/null 2>&1; then
# Handle uninterruptible sleep state or zombies by ommiting them from running process array (How to kill that is already dead ? :)
pidState="$(eval $PROCESS_STATE_CMD)"
if [ "$pidState" != "D" ] && [ "$pidState" != "Z" ]; then
# Check if pid hasn't run more than soft/hard perProcessTime
pidsTimeArray[$pid]=$((SECONDS - seconds_begin))
if [ ${pidsTimeArray[$pid]} -gt $softPerProcessTime ]; then
if [ "$softAlert" != true ] && [ $softPerProcessTime -ne 0 ] && [ $noTimeErrorLog != true ]; then
Logger "Max soft execution time [$softPerProcessTime] exceeded for pid [$pid]." "WARN"
if [ "${commandsArrayPid[$pid]}]" != "" ]; then
Logger "Command was [${commandsArrayPid[$pid]}]]." "WARN"
fi
softAlert=true
SendAlert true
fi
fi
if [ ${pidsTimeArray[$pid]} -gt $hardPerProcessTime ] && [ $hardPerProcessTime -ne 0 ]; then
if [ $noTimeErrorLog != true ] && [ $noErrorLogsAtAll != true ]; then
Logger "Max hard execution time [$hardPerProcessTime] exceeded for pid [$pid]. Stopping command execution." "ERROR"
if [ "${commandsArrayPid[$pid]}]" != "" ]; then
Logger "Command was [${commandsArrayPid[$pid]}]]." "WARN"
fi
fi
KillChilds $pid true
if [ $? -eq 0 ]; then
Logger "Command with pid [$pid] stopped successfully." "NOTICE"
else
if [ $noErrorLogsAtAll != true ]; then
Logger "Could not stop command with pid [$pid]." "ERROR"
fi
fi
errorcount=$((errorcount+1))
if [ $noTimeErrorLog != true ]; then
SendAlert true
fi
fi
newPidsArray+=($pid)
fi
else
# pid is dead, get its exit code from wait command
wait $pid
retval=$?
# Check for valid exit codes
if [ $(ArrayContains $retval "${validExitCodes[@]}") -eq 0 ]; then
if [ $noErrorLogsAtAll != true ]; then
Logger "${FUNCNAME[0]} called by [$id] finished monitoring pid [$pid] with exitcode [$retval]." "ERROR"
if [ "$functionMode" == "ParallelExec" ]; then
Logger "Command was [${commandsArrayPid[$pid]}]." "ERROR"
fi
if [ -f "${commandsArrayOutput[$pid]}" ]; then
Logger "Truncated output:\n$(head -c16384 "${commandsArrayOutput[$pid]}")" "ERROR"
fi
fi
errorcount=$((errorcount+1))
# Welcome to variable variable bash hell
if [ "$failedPidsList" == "" ]; then
failedPidsList="$pid:$retval"
else
failedPidsList="$failedPidsList;$pid:$retval"
fi
else
Logger "${FUNCNAME[0]} called by [$id] finished monitoring pid [$pid] with exitcode [$retval]." "DEBUG"
fi
fi
fi
done
# hasPids can be false on last iteration in ParallelExec mode
pidsArray=("${newPidsArray[@]}")
# Trivial wait time for bash to not eat up all CPU
sleep $sleepTime
}
while [ ${#pidsArray[@]} -gt 0 ] || [ $counter -lt $mainItemCount ] || [ $postponedItemCount -ne 0 ]; do
_ExecTasksTimeCheck
retval=$?
if [ $retval -ne 0 ]; then
return $retval;
fi
# The following execution bloc is only needed in ParallelExec mode since WaitForTaskCompletion does not execute commands, but only monitors them
if [ $functionMode == "ParallelExec" ]; then
while [ ${#pidsArray[@]} -lt $numberOfProcesses ] && ([ $counter -lt $mainItemCount ] || [ $postponedItemCount -ne 0 ]); do
_ExecTasksTimeCheck
retval=$?
if [ $retval -ne 0 ]; then
return $retval;
fi
executeCommand=false
isPostponedCommand=false
currentCommand=""
currentCommandCondition=""
needsPostponing=false
if [ $readFromFile == true ]; then
# awk identifies first line as 1 instead of 0 so we need to increase counter
currentCommand=$(awk 'NR == num_line {print; exit}' num_line=$((counter+1)) "$mainInput")
if [ $auxItemCount -ne 0 ]; then
currentCommandCondition=$(awk 'NR == num_line {print; exit}' num_line=$((counter+1)) "$auxInput")
fi
# Check if we need to fetch postponed commands
if [ "$currentCommand" == "" ]; then
currentCommand=$(awk 'NR == num_line {print; exit}' num_line=$((postponedCounter+1)) "$RUN_DIR/$PROGRAM.${FUNCNAME[0]}-postponedMain.$id.$SCRIPT_PID.$TSTAMP")
currentCommandCondition=$(awk 'NR == num_line {print; exit}' num_line=$((postponedCounter+1)) "$RUN_DIR/$PROGRAM.${FUNCNAME[0]}-postponedAux.$id.$SCRIPT_PID.$TSTAMP")
isPostponedCommand=true
fi
else
currentCommand="${commandsArray[$counter]}"
if [ $auxItemCount -ne 0 ]; then
currentCommandCondition="${commandsConditionArray[$counter]}"
fi
if [ "$currentCommand" == "" ]; then
currentCommand="${postponedCommandsArray[$postponedCounter]}"
currentCommandCondition="${postponedCommandsConditionArray[$postponedCounter]}"
isPostponedCommand=true
fi
fi
# Check if we execute postponed commands, or if we delay them
if [ $isPostponedCommand == true ]; then
# Get first value before '@'
postponedExecTime="${currentCommand%%@*}"
postponedExecTime=$((SECONDS-postponedExecTime))
# Get everything after first '@'
temp="${currentCommand#*@}"
# Get first value before '@'
postponedRetryCount="${temp%%@*}"
# Replace currentCommand with actual filtered currentCommand
currentCommand="${temp#*@}"
# Since we read a postponed command, we may decrase postponedItemCounter
postponedItemCount=$((postponedItemCount-1))
#Since we read one line, we need to increase the counter
postponedCounter=$((postponedCounter+1))
else
postponedRetryCount=0
postponedExecTime=0
fi
if ([ $postponedRetryCount -lt $maxPostponeRetries ] && [ $postponedExecTime -ge $minTimeBetweenRetries ]) || [ $isPostponedCommand == false ]; then
if [ "$currentCommandCondition" != "" ]; then
Logger "Checking condition [$currentCommandCondition] for command [$currentCommand]." "DEBUG"
eval "$currentCommandCondition" &
ExecTasks $! "subConditionCheck" false 0 0 1800 3600 true $SLEEP_TIME $KEEP_LOGGING true true true
subRetval=$?
if [ $subRetval -ne 0 ]; then
# is postponing enabled ?
if [ $maxPostponeRetries -gt 0 ]; then
Logger "Condition [$currentCommandCondition] not met for command [$currentCommand]. Exit code [$subRetval]. Postponing command." "NOTICE"
postponedRetryCount=$((postponedRetryCount+1))
if [ $postponedRetryCount -ge $maxPostponeRetries ]; then
Logger "Max retries reached for postponed command [$currentCommand]. Skipping command." "NOTICE"
else
needsPostponing=true
fi
postponedExecTime=0
else
Logger "Condition [$currentCommandCondition] not met for command [$currentCommand]. Exit code [$subRetval]. Ignoring command." "NOTICE"
fi
else
executeCommand=true
fi
else
executeCommand=true
fi
else
needsPostponing=true
fi
if [ $needsPostponing == true ]; then
postponedItemCount=$((postponedItemCount+1))
if [ $readFromFile == true ]; then
echo "$((SECONDS-postponedExecTime))@$postponedRetryCount@$currentCommand" >> "$RUN_DIR/$PROGRAM.${FUNCNAME[0]}-postponedMain.$id.$SCRIPT_PID.$TSTAMP"
echo "$currentCommandCondition" >> "$RUN_DIR/$PROGRAM.${FUNCNAME[0]}-postponedAux.$id.$SCRIPT_PID.$TSTAMP"
else
postponedCommandsArray+=("$((SECONDS-postponedExecTime))@$postponedRetryCount@$currentCommand")
postponedCommandsConditionArray+=("$currentCommandCondition")
fi
fi
if [ $executeCommand == true ]; then
Logger "Running command [$currentCommand]." "DEBUG"
randomOutputName=$(date '+%Y%m%dT%H%M%S').$(PoorMansRandomGenerator 5)
eval "$currentCommand" >> "$RUN_DIR/$PROGRAM.${FUNCNAME[0]}.$id.$pid.$randomOutputName.$SCRIPT_PID.$TSTAMP" 2>&1 &
pid=$!
pidsArray+=($pid)
commandsArrayPid[$pid]="$currentCommand"
commandsArrayOutput[$pid]="$RUN_DIR/$PROGRAM.${FUNCNAME[0]}.$id.$pid.$randomOutputName.$SCRIPT_PID.$TSTAMP"
# Initialize pid execution time array
pidsTimeArray[$pid]=0
else
Logger "Skipping command [$currentCommand]." "DEBUG"
fi
if [ $isPostponedCommand == false ]; then
counter=$((counter+1))
fi
_ExecTasksPidsCheck
done
fi
_ExecTasksPidsCheck
done
# Return exit code if only one process was monitored, else return number of errors
# As we cannot return multiple values, a global variable WAIT_FOR_TASK_COMPLETION contains all pids with their return value
eval "WAIT_FOR_TASK_COMPLETION_$id=\"$failedPidsList\""
if [ $mainItemCount -eq 1 ]; then
return $retval
else
return $errorcount
fi
}
希望你玩得开心.
这篇关于重击多个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!