Mixed encoding content converter

標題可能下的含糊不清,意義不明,且讓我娓娓道來

昨天,我用youtube-dl從酷我音樂上載了林俊傑的江南下來。載下來的檔案是ape格式,我還是第一次看到,用VLC開一堆error,所以我就用mpv聽:

$ mpv 江南-93157.ape
Playing: 江南-93157.ape
 (+) Audio --aid=1 (ape)
File tags:
 Artist: ÁÖ¿¡½Ü
 Album: ½­ÄÏ
 Genre: R&B
 Title: ½­ÄÏ
 Track: 2
AO: [pulse] 44100Hz stereo 2ch s16
A: 00:00:01 / 00:04:27 (0%)

一切都很好,除了metadata那邊一堆亂碼。(後來發現他們是GB2312)ffprobe出來的結果也差不多:

$ ffprobe ~/Music/songs/江南-93157.ape
ffprobe version 2.7.1 Copyright (c) 2007-2015 the FFmpeg developers
  built with gcc 5.1.0 (GCC)
  configuration: --prefix=/usr --disable-debug --disable-static --disable-stripping --enable-avisynth --enable-avresample --enable-fontconfig --enable-gnutls --enable-gpl --enable-libass --enable-libbluray --enable-libfreetype --enable-libfribidi --enable-libgsm --enable-libmodplug --enable-libmp3lame --enable-libopencore_amrnb --enable-libopencore_amrwb --enable-libopenjpeg --enable-libopus --enable-libpulse --enable-libschroedinger --enable-libspeex --enable-libssh --enable-libtheora --enable-libv4l2 --enable-libvorbis --enable-libvpx --enable-libx264 --enable-libx265 --enable-libxvid --enable-shared --enable-version3 --enable-x11grab
  libavutil      54. 27.100 / 54. 27.100
  libavcodec     56. 41.100 / 56. 41.100
  libavformat    56. 36.100 / 56. 36.100
  libavdevice    56.  4.100 / 56.  4.100
  libavfilter     5. 16.101 /  5. 16.101
  libavresample   2.  1.  0 /  2.  1.  0
  libswscale      3.  1.101 /  3.  1.101
  libswresample   1.  2.100 /  1.  2.100
  libpostproc    53.  3.100 / 53.  3.100
Input #0, ape, from '江南-93157.ape':
  Metadata:
    title           : ½­ÄÏ
    artist          : ÁÖ¿¡½Ü
    album           : ½­ÄÏ
    encoded_by      : Exact Audio Copy   (Secure mode)
    track           : 2
    genre           : R&B
    date            : 2004
  Duration: 00:04:27.95, start: 0.000000, bitrate: 907 kb/s
    Stream #0:0: Audio: ape (APE  / 0x20455041), 44100 Hz, stereo, s16p

我就猜是ffmpeg的問題。首先先確定這些metadata是哪種格式:

$ file 江南-93157.ape
江南-93157.ape: Audio file with ID3 version 2.3.0, contains: Monkey's Audio compressed format version 3970 with fast compression, stereo, sample rate 44100

嗯是ID3v2 2.3。ID3v2的header開頭是"ID3"。先在ffmpeg的code裡搜尋這個字串:

$ grep -r '"ID3"'
libavformat/asfdec_o.c:            else if (!strcmp(name, "ID3")) // handle ID3 tag
libavformat/asfdec_f.c:        } else if (!strcmp(key, "ID3")) { // handle ID3 tag
libavformat/id3v2.h: * Default magic bytes for ID3v2 header: "ID3"
libavformat/id3v2.h:#define ID3v2_DEFAULT_MAGIC "ID3"

雖然asfdec_f.c跟asfdec_o.c聽起來不太像是decode ape的檔,不過應該都差不多。打開asfdec_o.c來看:

            else if (!strcmp(name, "ID3")) // handle ID3 tag
                get_id3_tag(s, val_len);

一步一步追查下去,可以知道函數呼叫的順序是:

  1. get_id3_tag() in libavformat/asfdec_o.c
  2. ff_id3v2_read() in libavformat/id3v2.c, 3~5也是在這個檔案裡
  3. id3v2_read_internal()
  4. id3v2_parse()
  5. read_ttag()

read_ttag()裡面,他先做decode_str():

    if (decode_str(s, pb, encoding, &dst, &taglen) < 0) {
        av_log(s, AV_LOG_ERROR, "Error reading frame %s, skipped\n", key);
        return;
    }

decode_str()用到的encoding參數是從前一個byte讀出來的。我下載下來的檔案這個地方是0,也就是ISO-8859-1。不過那些字串顯然不是ISO-8859-1,顯示出來就變成亂碼。

    case ID3v2_ENCODING_ISO8859:
        while (left && ch) {
            ch = avio_r8(pb);
            PUT_UTF8(ch, tmp, avio_w8(dynbuf, tmp);)
            left--;
        }
        break;

這裡的關鍵是PUT_UTF8,他做的事是把UTF-32轉換成UTF-8(詳見libavutil/common.h)。理論上只要把UTF-8轉回UTF-32,再用GB2312顯示出來就可以了。

不過這裡有個問題。ffprobe的結果裡面"江南-93157.ape"那一行是正確的UTF-8,其他有亂碼的地方是GB2312,不能全部用同一種encoding顯示出來。我的作法是每行分開處理。

剛好最近聽到chardet,一個可以自動幫你偵測encoding的python module,我就拿來用用看。最後的code放在gitlab上。

結果如下,大功告成!

$ ffprobe ~/Music/songs/江南-93157.ape |& convert_encoding.py
Encoding: GB2312 with confidence 0.99

ffprobe version 2.7.1 Copyright (c) 2007-2015 the FFmpeg developers
  built with gcc 5.1.0 (GCC)
  configuration: --prefix=/usr --disable-debug --disable-static --disable-stripping --enable-avisynth --enable-avresample --enable-fontconfig --enable-gnutls --enable-gpl --enable-libass --enable-libbluray --enable-libfreetype --enable-libfribidi --enable-libgsm --enable-libmodplug --enable-libmp3lame --enable-libopencore_amrnb --enable-libopencore_amrwb --enable-libopenjpeg --enable-libopus --enable-libpulse --enable-libschroedinger --enable-libspeex --enable-libssh --enable-libtheora --enable-libv4l2 --enable-libvorbis --enable-libvpx --enable-libx264 --enable-libx265 --enable-libxvid --enable-shared --enable-version3 --enable-x11grab
  libavutil      54. 27.100 / 54. 27.100
  libavcodec     56. 41.100 / 56. 41.100
  libavformat    56. 36.100 / 56. 36.100
  libavdevice    56.  4.100 / 56.  4.100
  libavfilter     5. 16.101 /  5. 16.101
  libavresample   2.  1.  0 /  2.  1.  0
  libswscale      3.  1.101 /  3.  1.101
  libswresample   1.  2.100 /  1.  2.100
  libpostproc    53.  3.100 / 53.  3.100
Input #0, ape, from '江南-93157.ape':
  Metadata:
    title           : 江南
    artist          : 林俊杰
    album           : 江南
    encoded_by      : Exact Audio Copy   (Secure mode)
    track           : 2
    genre           : R&B
    date            : 2004
  Duration: 00:04:27.95, start: 0.000000, bitrate: 907 kb/s
    Stream #0:0: Audio: ape (APE  / 0x20455041), 44100 Hz, stereo, s16p

social