This page describes the target dataset, data pre-checks and import process, post-import validation, and how selected elements are recoded for analysis.
The dataset being used to develop this project is an Excel table containing 10,808 TA comments extracted from student lab reports submitted during Spring 2018. Each TA comment is stored as a separate row of the table, and is a mixed alphanumeric string of one or more words, numbers, and punctuation. Other columns record unique report, student, and TA IDs; grade assigned to the report; other standardized information about the original report from which the comment was extracted; and the hand-coded subject and structure of each comment.
Prior to import, review the data table in Excel and check that:
Column Number | Column Name | Accepted Values or Format |
---|---|---|
1 | unique.record | Sp18.00001 (semester and year, then order in dataset) |
2 | report.id | R_0fanqHFhaY2yLT7 or similar |
3 | sort | remnant of coding |
4 | report.title (empty) | string |
5 | student | Std_drVZ2x1W (format is Std_nnn) |
6 | course | 113, 114, or 214 |
7 | ta | TA_bc8EVX09 (format is TA_nnn) |
8 | lab | 113: allocation, betta; 114: manduca, frog; 214: enzymes (old) photosynthesis, physarum, cyanophage (new) |
9 | tag | first, second |
10 | type.TA | submission, revision |
11 | grade.TA | A, B, C, F |
12 | grading.time | time in minutes |
13 | Rank | remnant of coding |
14 | hypothesis.ok | Yes, No |
15 | data.ok | Yes, No |
16 | citation.ok | Yes, No |
17 | interpretation.ok | Yes, No |
18 | organization.ok | Yes, No |
19 | techflaws.ok | Yes, No |
20 | writing.ok | Yes, No |
21 | comments.incorporated | Yes, Somewhat, No |
22 | ta.comment | text string |
23 | code.subject | values in Ref. Tables |
24 | code.structure | values in Ref. Tables |
25 | code.locus | values in Ref. Tables |
26 | code.scope | values in Ref. Tables |
27 | code.tone | values in Ref. Tables |
28 | code.notes | values in Ref. Tables |
Data were anonymized in Excel, replacing names, email addresses, and other confidential information was replaced with unique randomized alphanumeric strings.
Once all data are in correct format, export Excel file to CSV with name formatted as “coded_full_comments_dataset_SemYear.csv”
The development dataset is an Excel table of 10,808 rows. One column has “TA comments” extracted from student lab reports in Spring 2018. Each TA comment is stored as a separate row of the table, and is a mixed alphanumeric string of one or more words, numbers, and punctuation. Other columns record unique report, student, and TA IDs; grade assigned to the report; other standardized information about the original report from which the comment was extracted; and the hand-coded subject and structure of each comment.
Data were anonymized in Excel prior to conversion to CSV format. Names, email addresses, and other confidential information was replaced with unique randomized alphanumeric strings.
Prior to import, review the data table in Excel and check that:
Column Number | Column Name | Accepted Values or Format |
---|---|---|
1 | unique.record | Sp18.00001 (semester and year, then order in dataset) |
2 | report.id | R_0fanqHFhaY2yLT7 or similar |
3 | sort | remnant of coding |
4 | report.title (empty) | string |
5 | student | Std_drVZ2x1W (format is Std_nnn) |
6 | course | 113, 114, or 214 |
7 | ta | TA_bc8EVX09 (format is TA_nnn) |
8 | lab | 113: allocation, betta; 114: manduca, frog; 214: enzymes (old) photosynthesis, physarum, cyanophage (new) |
9 | tag | first, second |
10 | type.TA | submission, revision |
11 | grade.TA | A, B, C, F |
12 | grading.time | time in minutes |
13 | Rank | remnant of coding |
14 | hypothesis.ok | Yes, No |
15 | data.ok | Yes, No |
16 | citation.ok | Yes, No |
17 | interpretation.ok | Yes, No |
18 | organization.ok | Yes, No |
19 | techflaws.ok | Yes, No |
20 | writing.ok | Yes, No |
21 | comments.incorporated | Yes, Somewhat, No |
22 | ta.comment | text string |
23 | code.subject | values in Ref. Tables |
24 | code.structure | values in Ref. Tables |
25 | code.locus | values in Ref. Tables |
26 | code.scope | values in Ref. Tables |
27 | code.tone | values in Ref. Tables |
28 | code.notes | values in Ref. Tables |
Once all data are in correct format, export Excel file to CSV with name formatted as “coded_full_comments_dataset_SemYear.csv”
# Call in tidyverse, other required libraries.
library(tidyverse)
library(tidytext)
library(dplyr)
library(tidyr)
library(ggplot2)
library(scales)
#This block reads in full CSV into a starting dataframe named "base_data".
#Subsequent analyses are performed on extracted subsets of "base_data".
base_data <- read_csv(file='data/coded_full_comments_dataset_Spring18anonVERB.csv')
If dataset imports correctly, there will be a dataframe of ~11,000 rows and 28 columns named as listed in the Data Pre-Checks table above.
Given some data points may be entered incorrectly by instructors or students, these post-import data checks ensure data are properly coded and columns contain valid entries only. (This process can be automated using the janitor
data cleaning package.)
#"Course" column should only have entries that correspond to 113, 114, 214, NA.
course<-unique(base_data$course)
#Entries in "TAs" column should match the anonymous IDs that correspond to the TAs assigned to the relevant courses in the semester being analyzed.
#If there are incorrect or extra anonymous IDs in this list, the mostly likely source of the error is improper de-identification of the Excel file prior to import.
ta<-unique(base_data$ta)
#"Lab" should only have entries that match the allowed limited keywords in codebook.
lab<-unique(base_data$lab)
#It is impractical to validate all randomized identifiers. Instead look at the NUMBER of unique anonymous student IDs. Quantity should be within 10% of (but not more than) the combined enrollment of the 3 target courses for the semester.
student<-unique(base_data$student)
#Check that the list of "code.subject" topics here matches allowed terms in codebook.
subject<-unique(base_data$code.subject)
#Check that the list of "code.structure" terms extracted from the dataset and displayed here matches allowed terms in codebook.
structure<-unique(base_data$code.structure)
Comment labels used in the project codebook contain punctuation and spacing that complicates subsequent analysis. Rather than modify the project codebook or external processes, it is simpler to recode the problematic labels after import. For example, the following code extracts a subset of rows from the full dataset, reduces those to 3 columns, and renames the comment.subject column as subject. Four problematic code labels in the subject column are replaced with simpler terms.
#Read in TA comments from CSV file.
base_data <- read_csv(file='data/coded_full_comments_dataset_Spring18anon.csv')
#Select rows representing the sub-groups to compare.
comments_subset <- filter(base_data,code.subject=="1. Basic Criteria"|code.subject=="2. Writing Quality"|code.subject=="3. Technical and Scientific"|code.subject=="4. Logic and Thinking")
#Reduce larger dataframe to 3 required columns of data (subject, structure, and text of comment), and put columns in order needed.
comments_raw <- comments_subset %>% select(23,24,22)
#Rename the columns.
names(comments_raw)[1] <- "subject"
names(comments_raw)[2] <- "structure"
names(comments_raw)[3] <- "text"
#Simplify coding terms used in "subject" column.
comments_raw[,1] <- ifelse(comments_raw[,1] == "1. Basic Criteria","1_basic", ifelse(comments_raw[,1] == "2. Writing Quality","2_writing", ifelse(comments_raw[,1] == "3. Technical and Scientific","3_technical", ifelse(comments_raw[,1] == "4. Logic and Thinking","4_logic",99))))
#Change "subject" element from character to a factor for analysis.
comments_raw$subject <- factor(comments_raw$subject)
str(comments_raw$subject)
## Factor w/ 4 levels "1_basic","2_writing",..: 3 2 3 3 3 3 4 4 4 3 ...
table(comments_raw$subject)
##
## 1_basic 2_writing 3_technical 4_logic
## 211 2578 5409 1142
Most analyses will use a similar extraction and post-processing code block. In practice, it is easier to modify the post-processing code than to maintain multiple separate copies of the original CSV dataset for different analyses.
Copyright © 2019 A. Daniel Johnson. All rights reserved.