Text Topic Classifier: Using sets of relevant and irrelevant training text files, this tool trains and runs classifiers to decide if text files are about a given topic or not.

Requires Python 3 and Scikit-learn.

Written by Kamran Karimi. This software is released under the GNU General Public License (GPL)v3.

For more information please refer to the paper "Classifying domain-specific text documents containing ambiguous keywords", Database (Oxford), https://doi.org/10.1093/database/baab062

Sample data files are included in the package. Each Python file includes sample command line arguments in the main comment section..

Files:
1) text_topic_classifier.py creates a number of classifiers using sklearn libraries. Positive (relevant) texts are asigned the class 1, while negative (irrelevant) texts are assigned the class 0. Please see the examples at the top of the file on how to generate and save classifiers.

2) single_classify.py uses one of the classifiers created by text_topic_classifier.py to assign a class (0 or 1) to text contained in a single file. The class is the returned value of the script.

3) bach_classify.py uses one the the classifiers created by text_topic_classifier.py to assign a class to a set of text files. Positive and negative file names are saved in separate text files. Optionally, positive and negative files can be copied to separate directories too.

4) LiteratureLoader.java is sample code to show how single_classify.py can be called from other programs as part of a pipeline.

Example data are from NCBI's PubMed database, and contain relevant papers for echinoderms, as well as papers irrelevant to echinoderms.