Introduction

Progsnap is a specification for data on student work on programming exercises and activities.

This is the very first "official" version of the Progsnap specification. If you would like to contribute to future versions of the specification, see the home page for contact info.

Version

This is version 0.1.

Version numbers that end in “-dev” are under development, and should be expected to change.

Data set, file encoding

A progsnap data set is a collection of files representing data from a single course. In this context, “course” implies a single course, possibly with multiple activities, and possibly with multiple sections and/or instructors. Studies involving multiple courses should use separate progsnap data sets for each course.

All of the files in a data set are relative to a common base directory, which this specification will refer to as BaseDir. The files may be stored in a filesystem directory, or may be stored in a zip file (such that BaseDir is the root directory of the zip file). Implementations of tools to read progsnap data sets should support reading from both directories and zip files. Note that some files are in subdirectories in BaseDir. For example, activity files have filenames of the form activity/NNNN.txt, where NNNN is an activity number, implying the existence of a subdirectory called activity in BaseDir.

Each file is a text file using the UTF-8 character encoding.

Each file (with the exception of README.txt) is encoded as a sequence of lines. Each line has the following format:

{ "tag" : "tagname", "value" : JsonValue }

Each line is a tagged data value, where tagname is the name of the tag, and JsonValue is a single JSON value. Note that the encoding of JsonValue must not contain any literal newline characters. In general, it is guaranteed that each line of the file encodes a single JSON object with two fields, tag and value. Note that the line may or may not contain horizontal whitespace between the JSON tokens.

Each file is a collection of lines, with tags and values specified in the file type’s description (see Types of files below). Each type of line is specified with a number of occurrences:

Occurrences	Meaning
1	A line with this tag must occur exactly one time
0..1	A line with this tag is optional: there may be either 0 or 1 occurrences
0..*	There may be any number of occurrences (0 or more) of lines with this tag

Tag names starting with “x-“ are guaranteed not to conflict with any official tag name, and lines containing such tags may be used by creators of progsnap data sets to store extra information. Lines with custom tags may occur anywhere within any file in a progsnap data set. Readers of progsnap data should ignore lines with such tags, or allow custom processing of them. Note that such lines must still encode a valid JSON object with tag and value fields.

Unless explicitly stated in the file type’s description, there is no required ordering of lines in a file, and readers should be prepared to encounter the lines in an arbitrary order. One important file type that does mandate ordering of lines is the Work history file type.

Basic data types

This section describes the basic data types used in files in progsnap data sets.

Note that these are just the basic data types. In the description of the different kinds of files in progsnap data sets, compound data types will be defined which are built on (directly or indirectly) the basic data types.

Each basic data type is encoded using a JSON data type.

Basic data type	JSON data type	Notes
Int	number	Signed, 32 bit
Boolean	boolean
Real	number	Floating point, 64 bit
String	string
Timestamp	number	Floating point, 64 bits ^*
UniqueId	string	See below^†

^* Timestamp is number of milliseconds since midnight, January 1, 1970, UTC; timestamps are floating point, so precision greater than milliseconds is possible

^† Valid characters in a unique id are upper case letters A-Z, lower case letters a-z, digits 0-9, underscores (_), and hyphens (-); a unique id must have no more than 128 characters

Complex data types

This section describes the complex data types used in the various types of files in a progsnap data set.

All of the complex data types are represented as JSON objects. The order of the fields is not specified, and when encoded in JSON, the fields may appear in any order.

All implementations of readers of progsnap data should be prepared to accept fields not mentioned in this specification. For all complex data types, field names beginning with “x-“ are guaranteed not to conflict with any “official” fields, and may be used by the creator of a progsnap data set to store extra information. The value of a custom (“x-“) field can be any JSON value.

As mentioned above related to file encoding, the JSON encoding of any value (belonging to a complex data type or a basic data type) must not contain any newline characters, so that the encoded value is guaranteed not to span multiple lines of the file that contains it.

Activity

An Activity value is an object specifying the location of an activity file.

Field name	Type of value	Required?	Comment
number	Int	yes	activity number, which is distinct from activity numbers of other activities in the data set
path	String	yes	path (relative to BaseDir) of the activity file

Note that additional information about an activity, such as metadata and information about tests, is stored in the activity file (referenced by the path field of the Activity.)

The path value identifying the path to the corresponding activity file is relative to BaseDir: for example, if the activity number is 221, then the path value might be activity/0221.txt.

Test

A Test value is an automated test (such as a unit test) associated with an activity.

Because of the great diversity of programming languages and testing approaches that could be used in an activity in a progsnap data set, Test values are not guaranteed to capture the exact semantics of a test.

Field name	Type of value	Required?	Comment
number	Int	yes	test number
name	String	yes	name of the test (e.g., unit test name)
input	String	no	test input (e.g., argument values or input data)
output	String	no	expected output (e.g., result value, output data, regexp, etc.)
opaque	Boolean	no	if true, test input/output is not revealed to students
invisible	Boolean	no	if true, the existence of the test is not revealed to students

Test numbers must start at 0 and increase by 1 for each test in the corresponding activity. For example, an activity with 4 tests would have tests with numbers 0, 1, 2, and 3.

If either the opaque or invisible field is not present, the implied default value is false.

Student

A Student value is an object containing anonymized information about a student.

Field name	Type of value	Required?	Comment
id	UniqueId	yes	This student’s unique id
instructor	Boolean	yes	Whether student is actually an instructor or teaching assistant^*
gender	String	no	Student’s gender^†
experience	Int	no	Self-reported prior experience: 0=none, 1=some, 2=prior coursework
major	String	no	Student’s major field of study
finished	Boolean	no	Whether the student completed the course (false if student dropped or withdrew)
finalgrade	Real	no	Final grade in the course

^* Instructors and teaching assistants will often have accounts for testing purposes; this field is used to identify those accounts

^† Possible values include “m” and “f”, but implementations should accept other values

Position

A Position value represents a position in a text file (typically, a source file).

Field name	Type of value	Required	Comment
row	Int	yes	row (i.e., line) number (0 is first row of file)
col	Int	yes	column number within line (0 is first column of a row)

Edit

Edit events represent changes in the text of a source file.

Source file text is modeled as a sequence of lines. Each line is a sequence of characters. Lines in the file and characters within a line are indexed starting at 0. At the beginning of a student work history, each file is assumed to be empty. So, the first edit for a file is relative to an empty file. All insertion and deletion events specify the exact text to be inserted into or deleted from the source text.

Lines in source file text are terminated by a single newline (“\n”) character. Exporters for progsnap data must convert other line termination characters or sequences to single newline characters.

Field name	Type of value	Required?	Comment
ts	Timestamp	yes	Timestamp of edit event
editid	Int	yes	Unique^* id of the edit event
filename	String	yes	Filename of the edited source file
type	String	yes	Type of edit: “fulltext”, “insert”, or “delete”
start	Position	yes, except for “fulltext” events^†	Position in source file where text is being inserted or deleted
text	String	yes	The text being inserted or deleted (“insert” or “delete” events), or the text replacing the entire file contents (“fulltext” events)
snapids	Array of Int	only for “snapshot” events^‡	Array of snapshot ids (corresponding to Submission, Compilation, and/or TestResults events)

^* Each edit event must be assigned an editid that is unique within the context of the work history file in which it appears. These ids are not necessarily unique over all work history files, although they could be.

^† The start field is only required for “insert” and “delete” edits; it may be omitted for “fulltext” edits

^‡ snapids is a list of unique identifiers (encoded as a JSON array of Int values) indicating one or more snapshots. A snapshot consists of one or more edit events identifying the source file contents associated with a Submission, Compilation, and/or TestResults (one edit event per file). Because a specific version of a file can be part of multiple snapshots, an edit event can have multiple snapshot ids. If the activity involves multiple files, then each edit event belonging to a specific snapshot should include the corresponding snapshot id as one of the values of its snapids field. It is strongly recommended that the edit event or events in a snapshot be “fulltext” events, to avoid consumers having to reconstruct the full text of the file or files by applying a series of individual edits. Like edit ids, snapshot ids are guaranteed to be unique only within the work history file in which they appear.

Edit events with types “insert” or “delete” specify the insertion or deletion of text in a file. The start field indicates the position in the edited file at which the specified text should be inserted or deleted.

“fulltext” edit events should be considered to completely replace the contents of a file. As mentioned above, they are not required to have start and end fields.

Submission

Submission events indicate that the student submitted code for grading/assessment.

Field name	Type of value	Required?	Comment
ts	Timestamp	yes	Timestamp of submission event
snapid	Int	yes	Snapshot id identifying text of submitted code

The snapid value specifies the snapshot identifying the submitted source file or files.

Compilation

Compilation events indicate that a student’s submission was compiled.

Field name	Type of value	Required?	Comment
ts	Timestamp	yes	Timestamp of compilation event
snapid	Int	yes	Snapshot id identifying text of submitted code
result	String	yes	Result of compilation: “success” or “failure”

The snapid value specifies the snapshot identifying the compiled source file or files. It is guaranteed that there will be a Submission with the same snapid.

A result value of “success” means that the submission was successfully translated to executable form. Note that this does not imply that there were no warnings or other compiler diagnostics.

A result value of “failure” means that submission could not be translated to executable form, most likely because of syntactic or semantic errors in the submitted code.

TestResults

A TestResults event records the results of running tests on a compiled Submission.

Field name	Type of value	Required?	Comment
ts	Timestamp	yes	Timestamp of test results event
snapid	Int	yes	Snapshot id identifying text of submitted code
numtests	Int	yes	Total number of tests executed
numpassed	Int	yes	Number of tests passed
statuses	Array of String	yes	Array of test statuses, which are “passed”, “failed”, “timeout”, and “exception”

The snapid value specifies the snapshot identifying the source file or files compiled to produce the tested executable. It is guaranteed that there will be a Compilation with the same snapid.

Note that the value of the statuses field is a JSON array, where each element is a string. The ordering of the elements corresponds to the numbering of the Tests in the Activity: i.e., the number of a test can be used as an index into the statuses array.

The meanings of the values in the statuses array are as follows:

Status	Meaning
passed	The test passed
failed	The test failed due to incorrect output/behavior
timeout	The test failed because it exceeded the allowed runtime
exception	The test failed due to a fatal exception

Types of files

This section describes the types of files in a progsnap data set. For each file type, there is a particular naming scheme that instances of the file type must follow, and there are specific tagged values which may appear in the file. Some file types may mandate a particular ordering of lines.

README.txt

Each progsnap data set should have a file whose path (relative to BaseDir) is README.txt. This file is a free-form text file with descriptive information about the data set. If the data set contains any custom tags or fields (starting with “x-“), their meaning should be documented here. This file is also a good place to document how to interpret values whose meanings may vary by course/institution, such as the numeric range for final course grades.

Dataset file

Each progsnap data set contains a single dataset file, whose path (relative to BaseDir) is dataset.txt.

The dataset file specifies general information about the data set. It contains the following lines:

Tag name	Type of value	Occurrences	Comment
psversion	String	1	the version of the progsnap specification the data set conforms to, e.g., “0.1”
name	String	1	name of data set, e.g., “CS 101, Spring 2015, Unseen University”
contact	String	1	name of person to contact regarding the data set
email	String	1	email address of person to contact regarding the data set
courseurl	String	0..1	optional URL of web page for the course

Activities file

Each progsnap data set contains a single activities file, whose path (relative to BaseDir) is activities.txt.

The activities file specifies the activities that are included in the data set. It contains the following lines:

Tag name	Type of value	Occurrences	Comment
activity	Activity	0..*	Reference to an activity file

Note that any useful progsnap data set will contain at least one activity, since work history files are associated with activities.

Students file

A progsnap data set may optionally contain a single students file, whose path (if present, relative to BaseDir) is students.txt.

The students file specifies anonymized information about students in the course the data set represents. It contains the following lines:

Tag name	Type of value	Occurrences	Comment
student	Student	0..*	Information about a student

Activity file

A progsnap data set must contain at least one activity file, and may contain multiple activity files. An activity file has a path (relative to BaseDir) of the form activity/NNNN.txt</i>, where NNNN is an integer activity number. It is recommended (but not required) that the activity number is padded with leading zeroes as necessary so that all activity filenames in a data set have the same length.

An activity file contains the following lines:

Tag name	Type of value	Occurrences	Comment
name	String	1	the activity name, e.g., “Activity 1: Tic-Tac-Toe”
language	String	1	the programming language (e.g., “Java”, “Python”, “C++”)
url	String	0..1	URL of a web page describing the activity
assigned	Timestamp	0..1	timestamp indicating when the activity was made available to students)
due	Timestamp	0..1	timestamp indicating when the activity was due
test	Test	0..*	test cases for the activity

Work history file

A work history file represents one student’s work on one activity. Each progsnap data set will typically have many work history files. Work history files have paths (relative to BaseDir) of the form history/NNNN/XXXX.txt, where NNNN is an activity number, and XXXX is a student id (corresponding to the student’s id value.)

Each line in a work history file represents an event. One common feature of each event (other than those tagged with custom tags beginning with x-) is that the value of the line is guaranteed to have a field called “ts” whose value is a Timestamp, which records the time when the event occurred. Exporters must ensure that timestamps match the chronology of the work history, and due to the vagaries of clocks, some post-processing may be necessary to ensure that this is the case.

The lines in a work history file (with the possible exception of lines with custom tags) are ordered by nondecreasing event timestamp values, and thus, indicate the chronology of events.

The types of lines (representing events) are the following (note that this table does not indicate a required order, since events are ordered by timestamp):

Tag name	Type of value	Comment
edit	Edit	Code edit
submission	Submission	Submission for grading and/or automated testing
compilation	Compilation	Code compilation attempt
testresults	TestResults	Results of automated testing