Nutch：contentデータの文字化け対策

Nutch*1メモ。

SegmentReaderでcontentデータをdumpすると、UTF-8以外の文字コードが文字化けする。

SegmentReader実行時の引数はこんな感じ。

-dump crawl/segments/[0-9]{14} crawl/segments/[0-9]{14}/dumped_text -nofetch -noparsetext -noparse -nogenerate -noparsedata

とりあえずの解決法

HttpResponseクラスのコンストラクタの

readPlainContent(in);

String contentEncoding = getHeader(Response.CONTENT_ENCODING);
if ("gzip".equals(contentEncoding) || "x-gzip".equals(contentEncoding)) {
  content = http.processGzipEncoded(content, url);
} else if ("deflate".equals(contentEncoding)) {
 content = http.processDeflateEncoded(content, url);
} else {
  if (Http.LOG.isTraceEnabled()) {
    Http.LOG.trace("fetched " + content.length + " bytes from " + url);
  }
}

この部分の直後に

analyzeEncoding(content);

こいつを追加。
で、そのメソッドが以下。(問題があったため09/07/06に修正)

  private void analyzeEncoding(byte[] contentIn)  throws IOException {
    byte[] buf = new byte[Http.BUFFER_SIZE];
    UniversalDetector detector = new UniversalDetector(null);
    int nread;

    BufferedInputStream in = new BufferedInputStream(new ByteArrayInputStream(contentIn));

    in.mark(Http.BUFFER_SIZE * 10);
    while ((nread = in.read(buf)) > 0 && !detector.isDone()) {
      detector.handleData(buf, 0, nread);
    }
    in.reset();

    detector.dataEnd();
    String encoding = detector.getDetectedCharset();
    if (encoding != null) {
      System.out.println("Detected encoding = " + encoding);
    } else {
      System.out.println("No encoding detected.");
      encoding = "UTF-8";
    }
    detector.reset();

    content = (new String(contentIn, encoding)).getBytes();
  }

これで多分解決。
文字コードの判別にはjuniversalchardet*2を使用。

失敗メモ

readPlainContent(in);

このメソッド内で文字コード判別してから保存すると、そのあとのgzipの処理でエラーが発生。
というわけで、gzip処理後に文字コード判別が正解っぽい。

ちなみに

この書き換えを行うと、contentデータは文字化けしないけど、パースした後のデータが文字化けする。
多分HTMLのcharsetとかを見て文字コード判別してるせいかな？
↑と同じDetector使えば良いのかも。

でもパースしたデータは使わないので無視。

というか

こんな事しないでmetaタグのContent-Type見れば早かったかも。

*1:http://lucene.apache.org/nutch/

*2:http://www.void.in/wiki/Universalchardet