正则表达式之处理一组lrc听力文件

我觉得可以写一个「室友需求」系列的文章了 ^_^

工作缘故,室友时常会处理一些文本文件,手动编辑不胜其烦,于是我又可以有秀代码的机会啦!

看着自己十几分钟写的代码,能在生产实际中被使用,并且给人家节约大量的时间,嘚瑟的不行啊简直!

需求描述

此次的任务是处理一批lrc文件:

  • 转化时间的格式,将[分钟:秒]转化为
  • 每一个句子前头时间应当包含始末
  • 输出txt格式的文件

待处理文本

[00:00.12]section 2.
[00:02.68]you will part of a radio programme about the opening of a new local sports shop.
[00:09.32]first you have some time to look at questions 11 to 16.
[00:39.64]now listen carefully and answer questions 11 to 16.
[00:48.24]now we go to Jane who is going to tell us about what's happening in town this weekend.
[00:52.24]right,thanks Andrew,
[00:53.92]and now on to what's new,
[00:56.48]and do we really need yet another sports shop in Bradcaster?
[01:01.24]well,most of you probably know Sports World-
[01:04.44]the branch of a Danish sports goods company that opened a few years ago-
[01:09.04]it's attracted a lot of custom,
[01:11.36]and so the company has now decided to open another branch in the area.
[01:16.60]it's going to be in the shopping centre to the west of Bradcaster,
[01:20.44]so that will be good news for all of you who've found the original shop in the north of the town hard to get to.
[01:27.12]i was invited to a special preview
[01:29.60]and i can promise you,this is the ultimate in sports retailing.
......

目标格式

0.12      2.68   section 2.
2.68      9.32   you will part of a radio programme about the opening of a new local sports shop.
9.32     39.64   first you have some time to look at questions 11 to 16.
39.64    48.24   now listen carefully and answer questions 11 to 16.
48.24    52.24   now we go to Jane who is going to tell us about what's happening in town this weekend.
52.24    53.92   right,thanks Andrew,
53.92    56.48   and now on to what's new,
56.48    61.24   and do we really need yet another sports shop in Bradcaster?
61.24    64.44   well,most of you probably know Sports World-
64.44    69.04   the branch of a Danish sports goods company that opened a few years ago-
69.04    71.36   it's attracted a lot of custom,
71.36    76.60   and so the company has now decided to open another branch in the area.
76.60    80.44   it's going to be in the shopping centre to the west of Bradcaster,
80.44    87.12   so that will be good news for all of you who've found the original shop in the north of the town hard to get to.
87.12    89.60   i was invited to a special preview
89.60    94.88   and i can promise you,this is the ultimate in sports retailing.
......

思路

这种问题当然是用正则表达式来解决。比较目标文本格式和待处理文本格式,我们发现,上下两行间需要关联处理,而正则表达式一般是以一行为处理单位的。

我把这个问题转化为列表迭代的问题。之后匹配替换。all in code。直接读代码吧

解决方案(Python版)

#!/usr/bin/env python
# encoding: utf-8
import re
import os
import sys
from decimal import Decimal as D

def filter_filetype(path,filetype):
    filetype = filetype.lower()
    filenames = [filename for filename in os.listdir(path)
        if os.path.isfile(os.path.join(path, filename))] #Get all regular files
    filter_filename_list = [filename for filename in filenames if filename.endswith(filetype)]
    return filter_filename_list

def min_sec_str2sec_str(min_sec_str):
    '''minute:second change to second'''
    (m,s) = min_sec_str.split(":")
    sec = 60*D(m)+D(s)
    sec_str = str(sec)
    return sec_str

def format_file_time(filename):
    with open(filename,"r") as input_file:
        content = input_file.read()
        thetime = re.compile(r"\[(\d{2}:.{5})\]")
        #match is list: ['00:00.12',...]
        match = thetime.findall(content)
        sec_strs = [min_sec_str2sec_str(min_sec_str) for min_sec_str in match ]
        #再一次推导,使其成为新的格式,注意边界
        new_format = ["{:<6s}  {:>6s}   ".format(sec_str,sec_strs[i+1]) for i,sec_str in enumerate(sec_strs) if i<len(sec_strs)-1] #最后一行是边界
        #单独添加最后一行
        #new_format.append(sec_strs[-1])#list本身被改变了
        #print new_format
        #exit()
    filename_split = filename.split(".")
    #后缀改为txt
    out_filename = filename_split[0]+"_output.txt"
    output_file = open(out_filename,"w")
    with open(filename,"r") as input_file:
        i = 0 #行号标记
        for line in input_file:
            if i==len(sec_strs)-1:
                #最后一行不要
                break
            result_line = thetime.sub(new_format[i],line)
            #print result_line
            output_file.write(result_line)
            i = i+1
    output_file.close()

if __name__ == "__main__":
    path = sys.argv[1]
    filetype = sys.argv[2]
    filter_filename_list = filter_filetype(path,filetype)
    for filename in filter_filename_list:
        print(os.path.join(path,filename))
        format_file_time(os.path.join(path,filename))
        #注意路径

使用:python3 two_line_time.py PATH lrc (PATH是你要处理的lrc文件的路径,请使用绝对路径!)

室友说这段代码可能会挺有用,目前网上下载到的多是这类lrc,需要这样转化后才能被软件读取。

在OS X 和ubuntu下测试都正常

用得着的同学就自取吧~




Fork me on GitHub