This page provides general background on theory and approaches to text classification. Links below connect to sub-pages that describe more specific topics. Follow the links for details.

 

Basic Workflow

Text classification problems can be broken down into four overlapping sub-problems. Each piece may be more or less important to the overall problem, but all need to be considered.

  1. Feature selection and engineering (described below)
  2. Topic categorization and modeling (https://adanieljohnson.github.io/default_website/Topic_categorization.html)
  3. Selecting an appropriate text classifier method (https://adanieljohnson.github.io/default_website/Choosing_classifier.html)
  4. Evaluating the outcomes, and improving the classifier (https://adanieljohnson.github.io/default_website/Improving_classifier.html)

 

Feature Selection and Engineering

This is the process of selecting, tabulating, calculating, or creating text-based features for subsequent evaluation. This includes:

  • Identifying existing, directly observable text features within the dataset
  • Engineering additional features using established NLP methods
  • Reverse engineering relevant features based on an empirically defined feature set

The features listed below are used routinely in natural language processing. Not all will be appropriate for all situations though; selecting, evaluating, comparing, and validating appropriate features for a classification task is an iterative process, not a one-time decision.

  • Count-based Vectors
    • Word Count – total number of words in the documents
    • Character Count – total number of characters in the documents
    • Punctuation Count – total number of punctuation marks in the documents
    • Upper Case Count – total number of upper case words in the documents
    • Title Word Count – total number of consecutive upper case (title) words in the documents
  • Distribution and Frequency Calculations
    • Average word density of sentences
    • Average length of the words used in the documents
    • TF-IDF
    • Reading level estimates, readability (described below)
  • Part of Speech Tagging and Frequency Distribution of Tags
    • Noun Count
    • Verb Count
    • Adjective Count
    • Adverb Count
    • Pronoun Count
  • Word Embedding (described below)

Additional features are available for longer texts such as multi-sentence paragraphs. These features require analysis of sentence structure, syntax, parts of speech patterns, etc.

  • Word choices (vocabulary range)
  • Lexical diversity (described below)
  • Stance patterns - hedging, booster, generalization words or phrasing
  • Concision - comparisons of noun phrase density versus comment clause density
  • Cohesion - use of words that are additive (also), countering (however), specify or emphasize (even, both, especially), connecting (former, latter)
  • Civility and uncertainty - language that acknowledges limits of statement, critiques ideas not people, and makes fewer generalizations

 

Some Feature Engineering Methods

Lexical Diversity

This is different from simple word frequencies The following explanation is an excerpt from “quanteda: Quantitative Analysis of Textual Data”, which describes how to use textstat_lexdiv to calculate lexical diversity. https://rdrr.io/cran/quanteda/man/textstat_lexdiv.html http://quanteda.io/reference/textstat_lexdiv.html

Consider a speaker, who uses the term “allow” multiple times throughout the speech, compared to an another speaker who uses terms “allow, concur, acquiesce, accede, and avow”" for the same word. The latter speech has more lexical diversity than the former. Lexical diversity is widely believed to be an important parameter to rate a document in terms of textual richness and effectiveness.

Lexical diversity, in simple terms, is a measurement of the breadth and variety of vocabulary used in a document. The different measures of lexical diversity are TTR, MSTTR, MATTR, C, R, CTTR, U, S, K, Maas, HD-D, MTLD, and MTLD-MA.

koRpus package in R also provides functions to estimate lexical diversity or complexity. For more about this package, go to https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781783551811/2/ch02lvl1sec16/lexical-diversity

 

Readability metrics

These estimate the relative grade level or age of a typical reader who could comprehend 80+% (typically) of the text. Most readability metrics require longer blocks of text (100+ words) to achieve a stable score. They use some combination of word count, average syllables per word, and sentence length to estimate reading level.

The metrics differ in how appropriate they are for specific situations. Some were developed for general writing, others for specific school age groups, and still others for technical documents; the US Navy in particular developed a suite of metrics for evaluating their training documents.

The readability library (v0.1.1) is a collection of readability tools that utilize the syllable package for fast calculation of readability scores by grouping variables. It calculates Flesch Kincaid, Gunning Fog Index, Coleman Liau, SMOG, Automated Readability Index and an average of the 5 readability scores. These all are general scales; other scores may not be available in R packages.

 

Word Embedding

Word embedding is a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. The vectors then serve as engineered features for subsequent classifier functions.

Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension.

Methods to generate the mapping include neural networks, dimensionality reduction, probabilistic models, knowledge base methods, and explicit representation in terms of the context in which words appear. The technique of representing words as vectors has roots in the 1960s with the development of the vector space model for information retrieval.

Word embeddings come in two different styles, one in which words are expressed as vectors of co-occurring words, and another in which words are expressed as vectors of linguistic contexts in which the words occur. Most new word embedding techniques rely on a neural network architecture instead of more traditional n-gram models and unsupervised learning.

One of the main limitations of word embeddings (and word vector space models in general) is that possible meanings of a word are conflated into a single representation (a single vector in the semantic space). Sense embeddings are a solution to this problem: individual meanings of words are represented as distinct vectors in the space.

Tools like Principal Component Analysis (PCA) and T-Distributed Stochastic Neighbour Embedding (t-SNE) are used to reduce the dimensionality of word vector spaces and visualize word embeddings and clusters.

Software for training and using word embeddings includes Tomas Mikolov’s Word2vec, Stanford University’s GloVe, fastText, Gensim, AllenNLP’s Elmo, Indra and Deeplearning4j. Three of these tools are described further below.

 

Word2vec

Word2vec is a group of shallow, two-layer neural network models that produce word embeddings. The toolkit can train vector space models to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space so words that share common contexts in the corpus are located in close proximity to one another in the space.

The use of different model parameters and different corpus sizes can greatly affect the quality of a word2vec model. Accuracy can be improved in a number of ways, including the choice of model architecture (CBOW or Skip-Gram), increasing the training data set, increasing the number of vector dimensions, and increasing the window size of words considered by the algorithm. Each of these improvements comes with the cost of increased computational complexity and therefore increased model generation time.

In models using large corpora and a high number of dimensions, the skip-gram model yields the highest overall accuracy, and consistently produces the highest accuracy on semantic relationships, as well as yielding the highest syntactic accuracy in most cases. However, the CBOW is less computationally expensive and yields similar accuracy results.

Accuracy increases overall as the number of words used increases, and as the number of dimensions increases. Altszyler et al. (Altszyler et al. 2017) studied Word2vec performance in two semantic tests for different corpus size. They found that Word2vec has a steep learning curve, outperforming another word-embedding technique (LSA) when it is trained with medium to large corpus size (more than 10 million words). However, with a small training corpus LSA showed better performance. Additionally they show that the best parameter setting depends on the task and the training corpus. Nevertheless, for skip-gram models trained in medium size corpora, with 50 dimensions, a window size of 15 and 10 negative samples seems to be a good parameter setting.

 

fastText

fastText is a Python-based library created by Facebook’s AI Research (FAIR) lab for learning of word embeddings and text classification. The model allows to create an unsupervised learning or supervised learning algorithm for obtaining vector representations for words. fastText uses a neural network for word embedding.

https://fasttext.cc/

 

GloVe

GloVe, short for Global Vectors, is a model for distributed word representation. The model is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. It is developed as an open-source project at Stanford. Pre-trained word vectors data are made available under the Public Domain Dedication and License v1.0.

GloVe uses Euclidean distance (or cosine similarity) between two word vector for measuring the linguistic or semantic similarity of the corresponding words. Sometimes, the nearest neighbors according to this metric reveal rare but relevant words that lie outside an average human’s vocabulary. For example, here are the closest words to the target word frog:

  • frog
  • frogs
  • toad
  • litoria
  • leptodactylidae
  • rana
  • lizard
  • eleutherodactylus

The similarity metrics used for nearest neighbor evaluations produce a single scalar that quantifies the relatedness of two words. This simplicity can be problematic since two given words almost always exhibit more intricate relationships than can be captured by a single number. For example, man may be regarded as similar to woman in that both words describe human beings; on the other hand, the two words are often considered opposites since they highlight a primary axis along which humans differ from one another.

In order to capture in a quantitative way the nuance necessary to distinguish man from woman, it is necessary for a model to associate more than a single number to the word pair. A natural and simple candidate for an enlarged set of discriminative numbers is the vector difference between the two word vectors. GloVe is designed in order that such vector differences capture as much as possible the meaning specified by the juxtaposition of two words.

https://nlp.stanford.edu/projects/glove/

 


Altszyler, E., S. Ribeiro, M. Sigman, and D. Fernández Slezak. 2017. “The Interpretation of Dream Meaning: Resolving Ambiguity Using Latent Semantic Analysis in a Small Corpus of Text.” Consciousness and Cognition 56: 178–87.


Copyright © 2019 A. Daniel Johnson. All rights reserved.