トップページに戻る

ACM SIGMOD日本支部 講演会(チュートリアル)

「Text Classification without Labeled Negative Documents」

講演者:Prof. Jeffrey Xu YU(Chinese University of Hong Kong)
日時:2005年12月07日(金) 10:30 - 11:30
開催場所東京大学生産技術研究所 会議室A(Ew-501)
主催:日本データベース学会・ACM SIGMOD日本支部 共催
支部長:北川 博之

Abstract

This talk presents a new solution for the problem of building a text
classifier with a small set of labeled positive documents (P) and a
large set of unlabeled documents (U). Here, the unlabeled documents
are mixed with both of the positive and negative documents. In other
words, no document is labeled as negative. This makes the task of
building a reliable text classifier challenging. In general, the
existing approaches use a two-step approach: i) extract the negative
documents (N) from U; and ii) build a classifier based on P and
N. However, none of the reported studies tries to further extract
any positive documents from U. In fact, intuitively, extracting
positive documents from U will increase the reliability of the
classifier. However, extracting positive documents from U is
difficult. A document in U that possesses the features exhibited in
P does not necessarily mean that it is a positive document, and vice
versa.
It is very sensitive to extract positive documents, because those
extracted positive samples may become noises.
The very large size of U and very high diversity exhibited there
also contribute to the difficulty of extracting any positive
documents. Thus, we propose a partition-based heuristic which
aims at extracting both the positive and negative documents in
U. Extensive experiments based on three benchmarks are
conducted. The favorable results indicated that our proposed heuristic
outperforms all of the existing approaches significantly, especially
in the case where the size of P is extremely small.
参加条件:どなたでもご参加頂けます.参加費は無料です.