这是一项正在进行的工作,我正在向有更多知识的人寻求建议(计算机是我的爱好,不是我的专业)。
此脚本用于组织电视节目目录(将每个文件重命名为惯例s01e01.title of sception.ext并创建原始名称的符号链接)。
我喜欢写这篇文章,我不希望其他人花太多时间。我想我现在最大的“树桩”是:
修正grep+cut输出中出现“用awk从wiki抓取正确的文本块(根据季节)
(另外,如果有任何事情看起来效率低下,请告诉我——我正在学习)
我在这些论坛上左右逢源,不断进步。与我类似的问题我已经问得太多了。(这些论坛是我目前建立这个论坛的100%原因)。

## Find show name and season (directories nested: /show/season)
show1=$(cd .. ; pwd)
show="${show1##*/}"
season=("${PWD##*/}")

IFS=$'\n'

## Download list of episodes for given season
wget -q -O- --header\="Accept-Encoding: gzip" https://en.wikipedia.org/wiki/List_of_$show\_episodes | gunzip > tmp.html

## Working on first awk/sed command to grab textblock of only specific season
## grep command works great, except when episode is hyperlinked ('a href' tag gets cut)
if [ "$season" == 'Season 1' ]; then
        listing=( $(awk '/\(season_1\)/,/rellink/' tmp.html | grep "summary.*[\"]<" | cut -d'"' -f6) )
        unset IFS
elif [ "$season" == 'Season 2' ]; then
        listing=( $(awk '/\(season_2\)/,/rellink/' tmp.html | grep "summary.*[\"]<" | cut -d'"' -f6) )
        unset IFS
#..........................continued 20 times or so
fi

我已经对上面的代码做了太多的调整,所以下半部分必须在之后完成;但之前它确实工作了90%。唯一的问题是,如果某些文件在维基百科页面上被超链接(因为剪切),它会将它们命名为s01e05.ahref=.mkv。
## Parse filename for season/episode descriptor
## Rename file with season/episode and name from wikipedia database
for file in *
do
    name=$(ls "$file" | grep -o "S[0-9][0-9]E[0-9][0-9]")
    episode=$(ls "$file" | grep -o "E[0-9][0-9]")
        if [ "$episode" == 'E01' ]; then
                mv "$file" "$name.${listing[0]}.mkv"
                ln -s "$name.${listing[0]}.mkv" "$file"
                echo "Renamed '$file' and created a symbolic link."
        #..........................continued
        fi
done

最佳答案

同意这些评论,即bash不是解析网页或html的方法。但如果你已经开始并想在bash中完成它,那么这并不是不可能的。看看你的代码,我喜欢你使用bash替换和globing,但是有点搞不清它们是如何结合在一起的,所以我自己写了一个简单的版本,希望你可以插入或处理。

#!/bin/bash

show="Archer"
url="http://en.wikipedia.org/wiki/List_of_${show}_episodes"

while read line; do
  [[ $line =~ "<h3><span class=\"mw-headline\" id=\"Season" ]] && episode= && ((
  if [[ $line =~ "<td class=\"summary\" style=\"text-align: left;\">\""(.*)"\""

    title="${BASH_REMATCH[1]}"
    [[ "$title" =~ "title=\""(.*)"\"" ]] && title="${BASH_REMATCH[1]}"
    title="${title%%\"*}"
    title="$(echo ${title/($show)/})"

    echo "Season [$season] Episode [$((episode+=1))] Title [$title]"
  fi
done < <(wget -qO- "$url")

示例输出:(还使用scrubssimpsons进行测试,以获得正确的结果)
Season [1] Episode [1] Title [Mole Hunt]
Season [1] Episode [2] Title [Training Day]
Season [1] Episode [3] Title [Diversity Hire]
Season [1] Episode [4] Title [Killing Utne]
Season [1] Episode [5] Title [Honeypot]
Season [1] Episode [6] Title [Skorpio]
Season [1] Episode [7] Title [Skytanic]
Season [1] Episode [8] Title [The Rock]
Season [1] Episode [9] Title [Job Offer]
Season [1] Episode [10] Title [Dial M for Mother]
Season [2] Episode [1] Title [Swiss Miss]
Season [2] Episode [2] Title [A Going Concern]
Season [2] Episode [3] Title [Blood Test]
Season [2] Episode [4] Title [Pipeline Fever]
Season [2] Episode [5] Title [The Double Deuce]
Season [2] Episode [6] Title [Tragical History]
Season [2] Episode [7] Title [Movie Star]
Season [2] Episode [8] Title [Stage Two]
Season [2] Episode [9] Title [Placebo Effect]
Season [2] Episode [10] Title [El Secuestro]
Season [2] Episode [11] Title [Jeu Monégasque]
Season [2] Episode [12] Title [White Nights]
Season [2] Episode [13] Title [Double Trouble]
Season [3] Episode [1] Title [Heart of Archness: Part I]
Season [3] Episode [2] Title [Heart of Archness: Part II]
Season [3] Episode [3] Title [Heart of Archness: Part III]
Season [3] Episode [4] Title [The Man from Jupiter]
Season [3] Episode [5] Title [El Contador]
Season [3] Episode [6] Title [The Limited]
Season [3] Episode [7] Title [Drift Problem]
Season [3] Episode [8] Title [Lo Scandalo]
Season [3] Episode [9] Title [Bloody Ferlin]
Season [3] Episode [10] Title [Crossing Over]
Season [3] Episode [11] Title [Skin Game]
Season [3] Episode [12] Title [Space Race]
Season [3] Episode [13] Title [Space Race]
Season [4] Episode [1] Title [Fugue and Riffs]
Season [4] Episode [2] Title [The Wind Cries Mary]
Season [4] Episode [3] Title [Legs]
Season [4] Episode [4] Title [Midnight Ron]
Season [4] Episode [5] Title [Viscous Coupling]
Season [4] Episode [6] Title [Once Bitten]
Season [4] Episode [7] Title [Live and Let Dine]
Season [4] Episode [8] Title [Coyote Lovely]
Season [4] Episode [9] Title [The Honeymooners]
Season [4] Episode [10] Title [Un Chien Tangerine]
Season [4] Episode [11] Title [The Papal Chase]
Season [4] Episode [12] Title [Sea Tunt: Part I]
Season [4] Episode [13] Title [Sea Tunt: Part II]
Season [5] Episode [1] Title [White Elephant]
Season [5] Episode [2] Title [Archer Vice: A Kiss While Dying]
Season [5] Episode [3] Title [Archer Vice: A Debt of Honor]
Season [5] Episode [4] Title [Archer Vice: House Call]
Season [5] Episode [5] Title [Archer Vice: Southbound and Down]

说明:
我发现BASH_REMATCH在很多情况下都很有用,比如你必须匹配子字符串,而不想找出一些疯狂的正则表达式。
BASH_REMATCH
  An array variable whose members are assigned by the =~ binary operator to the [[ conditional  command.   The  element
  with  index  0  is the portion of the string matching the entire regular expression.  The element with index n is the
  portion of the string matching the nth parenthesized subexpression.  This variable is read-only.

否则,正如您所指出的,主要问题是标题格式可能会有所不同。所以我就对a-ref(当它有一个BASH_REMATCH属性时)的情况做了另一个title,并在奇怪的情况下删除了尾随的文本(当事件还没有出现时)。也许还有其他的情况,但这对我测试的3个节目都有效。

关于regex - 当它通过管道传输几行与grep模式不同的“剪切”时的解决方法?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/21372175/

10-11 22:58