问题描述
我有一个bash脚本,删除了两个时间戳之间的日志文件的一部分,但由于文件的大小,它需要一段时间才能运行。
如果我要在Perl中重写脚本,我可以实现显着的速度增加 - 或者我会移动到像C这样的东西来完成这个?
#!/ bin / bash
if [$#-ne 3];那么
回显USAGE $ 0< logfile(s)>< from date(epoch)>< to date(epoch)>
exit 1
fi
LOGFILES = $ 1
FROM = $ 2
TO = $ 3
rm -f / tmp / getlogs ??? ???
TEMP =`mktemp / tmp / getlogsXXXXXX`
##需要上市的日志CHRONOLOGICALLY
ls -lnt $ LOGFILES | awk'{print $ 8}'> $ TEMP
LOGFILES =`tac $ TEMP`
cp / dev / null $ TEMP
findEntry(){
RETURN = 0
dt = $ 1
fil = $ 2
ln1 = $ 3
ln2 = $ 4
t1 =`tail -n + $ ln1 $ fil | head -n1 | cut -c1-15`
dt1 =`date -d$ t1+%s`
t2 =`tail -n + $ ln2 $ fil | head -n1 | cut -c1-15`
dt2 =`date -d $ t2+%s`
if [$ dt -ge $ dt2];那么
mid = $ dt2
else
mid = $((($ ln2- $ ln1)*($ dt- $ dt1)/($ dt2- $ dt1))+ $ ln1))
fi
t3 =`tail -n + $ mid $ fil | head -n1 | cut -c1-15`
dt3 =`date -d$ t3+%s `
#finished
if [$ dt -eq $ dt3];然后
#FOUND IT(滚动回第一个匹配)
while [$ dt -eq $ dt3]; do
mid = $(($ mid-1))
t3 =`tail -n + $ mid $ fil | head -n1 | cut -c1-15`
dt3 =`date - d$ t3+%s`
done
RETURN = $(($ mid + 1))
return
fi
if [$ -1))-eq $ ln1] || [$(($ ln2-1))-eq $ mid];那么
#FOUND NEAR IT
RETURN = $ mid
return
fi
#尚未完成
如果[$ dt -lt $ dt3];那么
#太高
findEntry $ dt $ fil $ ln1 $ mid
else
if [$ dt -ge $ dt3];然后
#太低
findEntry $ dt $ fil $ mid $ ln2
fi
fi
}
#检查日志文件上的时间戳
LOGS =
用于$ LOGFILES中的LOG; do
filetime =`ls -ln $ LOG | awk'{print $ 6,$ 7}'`
timestamp =`date -d$ filetime+%s`
if [$ timestamp -ge $ FROM];然后
LOGS =$ LOGS $ LOG
fi
done
#检查LOGS中的第一个和最后一个日期以进一步优化
$ LOGS; do
if [$ {LOG%.gz}!= $ LOG];然后
gunzip -c $ LOG> $ TEMP
else
cp $ LOG $ TEMP
fi
t =`head -n1 $ TEMP | cut -c1-15`
FIRST =`date -d $ t+%s`
t =`tail -n1 $ TEMP | cut -c1-15`
LAST =`date -d$ t+%s`
if [$ TO -lt $ FIRST] || [$ FROM -gt $ LAST];那么
#这个文件完全超出范围
cp / dev / null $ TEMP
else
if [$ FROM -le $ FIRST];那么
if [$ TO -ge $ LAST];那么
#整个文件在范围内
cat $ TEMP
else
#文件的最后部分超出范围
STARTLINENUMBER = 1
ENDLINENUMBER =` wc -l< $ TEMP`
findEntry $ TO $ TEMP $ STARTLINENUMBER $ ENDLINENUMBER
head -n $ RETURN $ TEMP
fi
else
如果[$ TO - ge $ LAST];那么
#文件的第一部分超出范围
STARTLINENUMBER = 1
ENDLINENUMBER =`wc -l< $ TEMP`
findEntry $ FROM $ TEMP $ STARTLINENUMBER $ ENDLINENUMBER
tail -n + $ RETURN $ TEMP
else
#范围完全在此日志文件内
STARTLINENUMBER = 1
ENDLINENUMBER =`wc -l< $ TEMP`
findEntry $ FROM $ TEMP $ STARTLINENUMBER $ ENDLINENUMBER
n1 = $ RETURN
findEntry $ TO $ TEMP $ STARTLINENUMBER $ ENDLINENUMBER
n2 = $ RETURN
tail -n + $ n1 $ TEMP | head -n $(($ n2- $ n1))
fi
fi
fi
done
rm -f / tmp / getlogs
#!/ usr / bin / perl
use strict;
使用警告;
my%months =(
jan => 1,feb => 2,mar => 3,apr => 4,
may => 5 ,jun => 6,jul => 7,aug => 8,
sep => 9,oct => 10,nov => 11,dec => 12,
);
while(my $ line =<>){
my $ ts = substr $ line,0,15;
next if parse_date($ ts)lt'0201100543';
last if parse_date($ ts)gt'0715123456';
print $ line;
}
sub parse_date {
my($ month,$ day,$ time)= split'',$ _ [0]
my($ hour,$ min,$ sec)= split /:/,$ time;
return sprintf(
'%2.2d%2.2d%2.2d%2.2d%2.2d',
$ months {lc $ month},$ day,
$ hour ,$ min,$ sec,
);
}
__END__
上一个回答参考:文件的格式是什么?这里有一个简短的脚本,假设第一列是一个时间戳,并且只打印在某个范围内有时间戳的行。它还假设时间戳已排序。在我的系统上,花费了大约一秒钟来过滤一百万行中的900,000行:
#!/ usr / bin / perl
use strict;
使用警告;
while(<>){
my($ ts)= split;
next if $ ts< 1247672719;
last if $ ts> 1252172093;
print $ ts,\\\
;
}
__END__
I have a bash script that cuts out a section of a logfile between 2 timestamps, but because of the size of the files, it takes quite a while to run.
If I were to rewrite the script in Perl, could I achieve a significant speed increase - or would I have to move to something like C to accomplish this?
#!/bin/bash
if [ $# -ne 3 ]; then
echo "USAGE $0 <logfile(s)> <from date (epoch)> <to date (epoch)>"
exit 1
fi
LOGFILES=$1
FROM=$2
TO=$3
rm -f /tmp/getlogs??????
TEMP=`mktemp /tmp/getlogsXXXXXX`
## LOGS NEED TO BE LISTED CHRONOLOGICALLY
ls -lnt $LOGFILES|awk '{print $8}' > $TEMP
LOGFILES=`tac $TEMP`
cp /dev/null $TEMP
findEntry() {
RETURN=0
dt=$1
fil=$2
ln1=$3
ln2=$4
t1=`tail -n+$ln1 $fil|head -n1|cut -c1-15`
dt1=`date -d "$t1" +%s`
t2=`tail -n+$ln2 $fil|head -n1|cut -c1-15`
dt2=`date -d "$t2" +%s`
if [ $dt -ge $dt2 ]; then
mid=$dt2
else
mid=$(( (($ln2-$ln1)*($dt-$dt1)/($dt2-$dt1))+$ln1 ))
fi
t3=`tail -n+$mid $fil|head -n1|cut -c1-15`
dt3=`date -d "$t3" +%s`
# finished
if [ $dt -eq $dt3 ]; then
# FOUND IT (scroll back to the first match)
while [ $dt -eq $dt3 ]; do
mid=$(( $mid-1 ))
t3=`tail -n+$mid $fil|head -n1|cut -c1-15`
dt3=`date -d "$t3" +%s`
done
RETURN=$(( $mid+1 ))
return
fi
if [ $(( $mid-1 )) -eq $ln1 ] || [ $(( $ln2-1)) -eq $mid ]; then
# FOUND NEAR IT
RETURN=$mid
return
fi
# not finished yet
if [ $dt -lt $dt3 ]; then
# too high
findEntry $dt $fil $ln1 $mid
else
if [ $dt -ge $dt3 ]; then
# too low
findEntry $dt $fil $mid $ln2
fi
fi
}
# Check timestamps on logfiles
LOGS=""
for LOG in $LOGFILES; do
filetime=`ls -ln $LOG|awk '{print $6,$7}'`
timestamp=`date -d "$filetime" +%s`
if [ $timestamp -ge $FROM ]; then
LOGS="$LOGS $LOG"
fi
done
# Check first and last dates in LOGS to refine further
for LOG in $LOGS; do
if [ ${LOG%.gz} != $LOG ]; then
gunzip -c $LOG > $TEMP
else
cp $LOG $TEMP
fi
t=`head -n1 $TEMP|cut -c1-15`
FIRST=`date -d "$t" +%s`
t=`tail -n1 $TEMP|cut -c1-15`
LAST=`date -d "$t" +%s`
if [ $TO -lt $FIRST ] || [ $FROM -gt $LAST ]; then
# This file is entirely out of range
cp /dev/null $TEMP
else
if [ $FROM -le $FIRST ]; then
if [ $TO -ge $LAST ]; then
# Entire file is within range
cat $TEMP
else
# Last part of file is out of range
STARTLINENUMBER=1
ENDLINENUMBER=`wc -l<$TEMP`
findEntry $TO $TEMP $STARTLINENUMBER $ENDLINENUMBER
head -n$RETURN $TEMP
fi
else
if [ $TO -ge $LAST ]; then
# First part of file is out of range
STARTLINENUMBER=1
ENDLINENUMBER=`wc -l<$TEMP`
findEntry $FROM $TEMP $STARTLINENUMBER $ENDLINENUMBER
tail -n+$RETURN $TEMP
else
# range is entirely within this logfile
STARTLINENUMBER=1
ENDLINENUMBER=`wc -l<$TEMP`
findEntry $FROM $TEMP $STARTLINENUMBER $ENDLINENUMBER
n1=$RETURN
findEntry $TO $TEMP $STARTLINENUMBER $ENDLINENUMBER
n2=$RETURN
tail -n+$n1 $TEMP|head -n$(( $n2-$n1 ))
fi
fi
fi
done
rm -f /tmp/getlogs??????
Updated script based on Brent's comment: This one is untested.
#!/usr/bin/perl
use strict;
use warnings;
my %months = (
jan => 1, feb => 2, mar => 3, apr => 4,
may => 5, jun => 6, jul => 7, aug => 8,
sep => 9, oct => 10, nov => 11, dec => 12,
);
while ( my $line = <> ) {
my $ts = substr $line, 0, 15;
next if parse_date($ts) lt '0201100543';
last if parse_date($ts) gt '0715123456';
print $line;
}
sub parse_date {
my ($month, $day, $time) = split ' ', $_[0];
my ($hour, $min, $sec) = split /:/, $time;
return sprintf(
'%2.2d%2.2d%2.2d%2.2d%2.2d',
$months{lc $month}, $day,
$hour, $min, $sec,
);
}
__END__
Previous answer for reference: What is the format of the file? Here is a short script which assumes the first column is a timestamp and prints only lines that have timestamps in a certain range. It also assumes that the timestamps are sorted. On my system, it took about a second to filter 900,000 lines out of a million:
#!/usr/bin/perl
use strict;
use warnings;
while ( <> ) {
my ($ts) = split;
next if $ts < 1247672719;
last if $ts > 1252172093;
print $ts, "\n";
}
__END__
这篇关于Perl比bash快吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!