The testing dataset used in to build this project was created using BaSH scripting and one-off text processing. This page documents progress on building an R-based repeatable workflow to simplify that process.

General Approach

Put all files to process in ONE folder.
Extract a list of file names to cycle through.
For each file, use ‘read_docx’, then ‘docx_extract_all_cmnts’.
Cycle through and append the extracted comments to ONE dataframe

library(plyr)
library(tidyverse)
library(reshape2)
library(tidyr)
library(readtext)
library(docxtractr)

#Generates a list of all files in a particular folder
files_list <-list.files("~/Dropbox/Coding_Tools/R_Environment/R_Projects/DOCX_comment_extraction/Test_Set", full.names=TRUE)

#### load data ####
# Define path of directory
# Get full path filename of every .docx-file in directory as 'filelist'
# Get filenames without paths for same sequence as 'filelist_short'
pathto <- "~/Dropbox/Coding_Tools/R_Environment/R_Projects/DOCX_comment_extraction/Test_Set/Hiding"
filelist <- list.files(path = pathto, pattern = "*.docx", full.names = TRUE)
filelist_short <- list.files(path = pathto, pattern = "*.docx", full.names = FALSE)

# Read every file with docxtractr::read_docx()
# Extract every comment from every file with docxtractr::docx_extract_all_cmnts()
document_contents_list <- lapply(filelist, read_docx)
comments_list <- lapply(document_contents_list, docx_extract_all_cmnts, include_text=TRUE)

The variable named ‘comments_list’ is a list of length 6. Each item in the list is a tibble with variable number of rows, but all with the same 6 columns:

id
author
date
initials
comment_text
word_src

I need to append each tibble to the next one for all items in the list.

# identify the filename of the docx file that was used to create the list of comments
document_contents_list[[3]]$path

## [1] "/Users/danjohnson/Dropbox/Coding_Tools/R_Environment/R_Projects/DOCX_comment_extraction/Test_Set/Hiding/EmilyConte_R_Qb0snGjbwoeyhEt_text.docx"

# join the individual lists of comments into a larger list
comments_list_joined <- bind_rows(comments_list, id=NULL)

# problem is that I lose the filenames.

Problem to solve now is the joined comments list contains no record of the source file’s name. Need some way to attach the remaining report metadata. ***

#Creates the output CSV file if needed for analysis outside R.
write.csv(comments_list_joined, file = "~/Dropbox/Coding_Tools/R_Environment/R_Projects/DOCX_comment_extraction/Test_Set/Extracted_comments.csv")


#Rename the columns in a data.frame. In this example, "name" has 2 columns. The command pipes new names to the 2 columns, using the vectorized text names.
names(name_of_file_to_change) <- c("columns", "values")

DRAFT: Extracting Comments From MS Word DOCX Files

May 17, 2019

General Approach