高级检索

基于改进Tri-training算法的中文问句分类

Chinese Question Classification Based on Improved Tri-training Algorithm

  • 摘要: 原始Tri-training算法对有标记的数据集通过随机采样方法,形成3个训练集去训练3个分类器。但是由这种随机采样形成的训练集中,可能出现有标记数据集中的不同类别数据数量相差较大,从而导致训练集中样本类别不平衡问题,影响分类器的分类正确率。本文通过分类采样对Tri-training算法的随机采样方法进行改进,根据该改进的Tri-training算法,建立分类模型,并利用其对哈工大中文问句集和本文扩展问句集进行分类实验。结果表明,本文算法有良好的适应性,且分类正确率明显提高;适当增大训练集和未标记样本数据可以增强分类器的泛化能力,从而使分类正确率提高。

     

    Abstract: The original Tri-training algorithm classifies the labeled data by the method of random sampling,forming three training sets for three classifiers.There is an phenomenon that the number of different categories may have huge differences between the exiting labeled data sets in this training sets formed by random sampling three classifiers, which may lead the categories of training sets into imbalance, and influence the accuracy of classifier.By employing a method of classification sampling to replace the random sampling, Tri-training algorithm was improved and a classification model was established. Classification experiment were performed on HIT question set and expanded question set. The results were compared with those of original Tri-training algorithm on the same data sets, which indicates that the new algorithm has good adaptability, and the accuracy of the algorithm is improved. With the increase of training set and the number of unlabeled samples, the generalization ability and the accuracy of the classifier are improved.

     

/

返回文章
返回