Overview

This page describes the target dataset, data pre-checks and import process, post-import validation, and how selected elements are recoded for analysis.

The dataset being used to develop this project is an Excel table containing 10,808 TA comments extracted from student lab reports submitted during Spring 2018. Each TA comment is stored as a separate row of the table, and is a mixed alphanumeric string of one or more words, numbers, and punctuation. Other columns record unique report, student, and TA IDs; grade assigned to the report; other standardized information about the original report from which the comment was extracted; and the hand-coded subject and structure of each comment.

Pre-Check and Import

Prior to import, review the data table in Excel and check that:

Student and TA names have been recoded using pre-assigned anonymous IDs.
Vocabulary terms used in coding columns matches comment codebook criteria described at https://adanieljohnson.github.io/default_website/codebook.html.
Data table headers match those listed below.

Column Number	Column Name	Accepted Values or Format
1	unique.record	Sp18.00001 (semester and year, then order in dataset)
2	report.id	R_0fanqHFhaY2yLT7 or similar
3	sort	remnant of coding
4	report.title (empty)	string
5	student	Std_drVZ2x1W (format is Std_nnn)
6	course	113, 114, or 214
7	ta	TA_bc8EVX09 (format is TA_nnn)
8	lab	113: allocation, betta; 114: manduca, frog; 214: enzymes (old) photosynthesis, physarum, cyanophage (new)
9	tag	first, second
10	type.TA	submission, revision
11	grade.TA	A, B, C, F
12	grading.time	time in minutes
13	Rank	remnant of coding
14	hypothesis.ok	Yes, No
15	data.ok	Yes, No
16	citation.ok	Yes, No
17	interpretation.ok	Yes, No
18	organization.ok	Yes, No
19	techflaws.ok	Yes, No
20	writing.ok	Yes, No
21	comments.incorporated	Yes, Somewhat, No
22	ta.comment	text string
23	code.subject	values in Ref. Tables
24	code.structure	values in Ref. Tables
25	code.locus	values in Ref. Tables
26	code.scope	values in Ref. Tables
27	code.tone	values in Ref. Tables
28	code.notes	values in Ref. Tables

Data were anonymized in Excel, replacing names, email addresses, and other confidential information was replaced with unique randomized alphanumeric strings.

Once all data are in correct format, export Excel file to CSV with name formatted as “coded_full_comments_dataset_SemYear.csv”

Target Dataset

The development dataset is an Excel table of 10,808 rows. One column has “TA comments” extracted from student lab reports in Spring 2018. Each TA comment is stored as a separate row of the table, and is a mixed alphanumeric string of one or more words, numbers, and punctuation. Other columns record unique report, student, and TA IDs; grade assigned to the report; other standardized information about the original report from which the comment was extracted; and the hand-coded subject and structure of each comment.

Data were anonymized in Excel prior to conversion to CSV format. Names, email addresses, and other confidential information was replaced with unique randomized alphanumeric strings.

Pre-Checks Before Import

Prior to import, review the data table in Excel and check that:

Student and TA names have been recoded using pre-assigned anonymous IDs.
Vocabulary terms used in coding columns matches comment codebook criteria described at https://adanieljohnson.github.io/default_website/codebook.html.
Data table headers match those listed below.

Column Number	Column Name	Accepted Values or Format
1	unique.record	Sp18.00001 (semester and year, then order in dataset)
2	report.id	R_0fanqHFhaY2yLT7 or similar
3	sort	remnant of coding
4	report.title (empty)	string
5	student	Std_drVZ2x1W (format is Std_nnn)
6	course	113, 114, or 214
7	ta	TA_bc8EVX09 (format is TA_nnn)
8	lab	113: allocation, betta; 114: manduca, frog; 214: enzymes (old) photosynthesis, physarum, cyanophage (new)
9	tag	first, second
10	type.TA	submission, revision
11	grade.TA	A, B, C, F
12	grading.time	time in minutes
13	Rank	remnant of coding
14	hypothesis.ok	Yes, No
15	data.ok	Yes, No
16	citation.ok	Yes, No
17	interpretation.ok	Yes, No
18	organization.ok	Yes, No
19	techflaws.ok	Yes, No
20	writing.ok	Yes, No
21	comments.incorporated	Yes, Somewhat, No
22	ta.comment	text string
23	code.subject	values in Ref. Tables
24	code.structure	values in Ref. Tables
25	code.locus	values in Ref. Tables
26	code.scope	values in Ref. Tables
27	code.tone	values in Ref. Tables
28	code.notes	values in Ref. Tables

Once all data are in correct format, export Excel file to CSV with name formatted as “coded_full_comments_dataset_SemYear.csv”

Initial Data Import

# Call in tidyverse, other required libraries.
library(tidyverse)
library(tidytext)
library(dplyr)
library(tidyr)
library(ggplot2)
library(scales)

#This block reads in full CSV into a starting dataframe named "base_data".
#Subsequent analyses are performed on extracted subsets of "base_data".
base_data <- read_csv(file='data/coded_full_comments_dataset_Spring18anonVERB.csv')

If dataset imports correctly, there will be a dataframe of ~11,000 rows and 28 columns named as listed in the Data Pre-Checks table above.

Post-Import Checks

Given some data points may be entered incorrectly by instructors or students, these post-import data checks ensure data are properly coded and columns contain valid entries only. (This process can be automated using the janitor data cleaning package.)

#"Course" column should only have entries that correspond to 113, 114, 214, NA. 
course<-unique(base_data$course)

#Entries in "TAs" column should match the anonymous IDs that correspond to the TAs assigned to the relevant courses in the semester being analyzed. 
#If there are incorrect or extra anonymous IDs in this list, the mostly likely source of the error is improper de-identification of the Excel file prior to import.
ta<-unique(base_data$ta)

#"Lab" should only have entries that match the allowed limited keywords in codebook.
lab<-unique(base_data$lab)

#It is impractical to validate all randomized identifiers. Instead look at the NUMBER of unique anonymous student IDs. Quantity should be within 10% of (but not more than) the combined enrollment of the 3 target courses for the semester.
student<-unique(base_data$student)

#Check that the list of "code.subject" topics here matches allowed terms in codebook.
subject<-unique(base_data$code.subject)

#Check that the list of "code.structure" terms extracted from the dataset and displayed here matches allowed terms in codebook.
structure<-unique(base_data$code.structure)

Post-Import Processing

Comment labels used in the project codebook contain punctuation and spacing that complicates subsequent analysis. Rather than modify the project codebook or external processes, it is simpler to recode the problematic labels after import. For example, the following code extracts a subset of rows from the full dataset, reduces those to 3 columns, and renames the comment.subject column as subject. Four problematic code labels in the subject column are replaced with simpler terms.

#Read in TA comments from CSV file.
base_data <- read_csv(file='data/coded_full_comments_dataset_Spring18anon.csv')

#Select rows representing the sub-groups to compare. 
comments_subset <- filter(base_data,code.subject=="1. Basic Criteria"|code.subject=="2. Writing Quality"|code.subject=="3. Technical and Scientific"|code.subject=="4. Logic and Thinking")

#Reduce larger dataframe to 3 required columns of data (subject, structure, and text of comment), and put columns in order needed.
comments_raw <- comments_subset %>% select(23,24,22)

#Rename the columns.
names(comments_raw)[1] <- "subject"
names(comments_raw)[2] <- "structure"
names(comments_raw)[3] <- "text"

#Simplify coding terms used in "subject" column.
comments_raw[,1] <- ifelse(comments_raw[,1] == "1. Basic Criteria","1_basic", ifelse(comments_raw[,1] == "2. Writing Quality","2_writing", ifelse(comments_raw[,1] == "3. Technical and Scientific","3_technical", ifelse(comments_raw[,1] == "4. Logic and Thinking","4_logic",99))))

#Change "subject" element from character to a factor for analysis.
comments_raw$subject <- factor(comments_raw$subject)
str(comments_raw$subject)

##  Factor w/ 4 levels "1_basic","2_writing",..: 3 2 3 3 3 3 4 4 4 3 ...

table(comments_raw$subject)

## 
##     1_basic   2_writing 3_technical     4_logic 
##         211        2578        5409        1142

Most analyses will use a similar extraction and post-processing code block. In practice, it is easier to modify the post-processing code than to maintain multiple separate copies of the original CSV dataset for different analyses.

Data Import & Checking

Dan Johnson

2/22/2019