Establishing meaningful text classification categories depends on the goals of the project or the question being asked. Are relevant categories known already, or do they need to be established?

Curated Categories (my provisional term until I find the appropriate equivalent in current use) are established by subject matter experts. The established categories can be constructed from:

  • A limited keywords vocabulary (National Library of Medicine MeSH categories for example)
  • Disciplinary conventions or shared best practices (the Dublin Core is an example)
  • Prior empirical studies
  • The dataset itself, using de novo iterative coding. This process is described further here.

Topic Modeling is a group of techniques for identifying groups of words (the “topic”) that best summarize the information in a collection of documents. Topic modeling is an iterative/ generative process that does not require any prior knowledge of the documents or relationships.

 

Topic Modeling Methods to Identify Features

Latent Dirichlet Allocation

LDA is a common method for generating topic model features. LDA is an iterative model which starts from a fixed number of topics. Each topic is represented as a distribution over words, and each document is then represented as a distribution over topics. Although the tokens themselves are meaningless, the probability distributions over words provided by the topics provide a sense of the different ideas contained in the documents.

Brown Clustering

A hierarchical clustering algorithm that is applied to text, grouping words into clusters that are assumed to be semantically related by virtue of their having been embedded in similar contexts.

In natural language processing, Brown clustering or IBM clustering is a form of hierarchical clustering of words based on the contexts in which they occur. The intuition behind the method is that a class-based language model (also called cluster n-gram model), i.e. one where probabilities of words are based on the classes (clusters) of previous words, is used to address the data sparsity problem inherent in language modeling.

Brown groups items (i.e., types) into classes, using a binary merging criterion based on the log-probability of a text under a class-based language model, i.e. a probability model that takes the clustering into account.

 

Latent semantic analysis

LSA is a technique in natural language processing of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis). A matrix containing word counts per paragraph (rows represent unique words and columns represent each paragraph) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Words are then compared by taking the cosine of the angle between the two vectors (or the dot product between the normalizations of the two vectors) formed by any two rows. Values close to 1 represent very similar words while values close to 0 represent very dissimilar words.

In the context of its application to information retrieval, it is sometimes called latent semantic indexing (LSI).

LSA can use a term-document matrix which describes the occurrences of terms in documents; it is a sparse matrix whose rows correspond to terms and whose columns correspond to documents. A typical example of the weighting of the elements of the matrix is tf-idf (term frequency–inverse document frequency): the weight of an element of the matrix is proportional to the number of times the terms appear in each document, where rare terms are upweighted to reflect their relative importance.

Applications of LSA include:

  • Comparing documents in low-dimensional space (data clustering, document classification).
  • Finding relations between terms (synonymy and polysemy). Synonymy is the phenomenon where different words describe the same idea. Polysemy is the phenomenon where the same word has multiple meanings.
  • Given a query of terms, translating it into the low-dimensional space, and find matching documents (information retrieval).
  • Analyzing word association in text corpus.

LSA overcomes two of the most problematic constraints of Boolean keyword queries: multiple words that have similar meanings (synonymy) and words that have more than one meaning (polysemy).

LSA is also used to perform automated document categorization. Document categorization is the assignment of documents to one or more predefined categories based on their similarity to the conceptual content of the categories. Several experiments have demonstrated that there are a number of correlations between the way LSI and humans process and categorize text.

LSA uses example documents to establish the conceptual basis for each category. During categorization processing, the concepts contained in the documents being categorized are compared to the concepts contained in the example items, and a category (or categories) is assigned to the documents based on the similarities between the concepts they contain and the concepts that are contained in the example documents.

LSA automatically adapts to new and changing terminology, and has been shown to be very tolerant of noise (i.e., misspelled words, typographical errors, unreadable characters, etc.). This is especially important for applications using text derived from Optical Character Recognition (OCR) and speech-to-text conversion. LSA also deals effectively with sparse, ambiguous, and contradictory data.

Text does not need to be in sentence form for LSA to be effective. It can work with lists, free-form notes, email, Web-based content, etc. As long as a collection of text contains multiple terms, LSA can be used to identify patterns in the relationships between the important terms and concepts contained in the text.

 


Copyright © 2019 A. Daniel Johnson. All rights reserved.