<aside> <img src="https://s3-us-west-2.amazonaws.com/secure.notion-static.com/3820396d-7df4-4da7-9343-8c6590679a4b/FaviconPlagiarism.svg" alt="https://s3-us-west-2.amazonaws.com/secure.notion-static.com/3820396d-7df4-4da7-9343-8c6590679a4b/FaviconPlagiarism.svg" width="40px" /> An article by Will DePue for plagiarism.fyi:

Plagiarism.fyi - Beat your school's code plagiarism checker

It’s really hard to catch plagiarism in software.

For one, there’s a limited # of ways to solve a certain problem which means it’s not obvious when something is stolen.

But worse is just the sheer scale of free material to copy on the internet: simply searching Github for your assignment name is sure to result in tens of results.

To help mitigate this, every college uses a submission comparison algorithm to catch cheaters:

  1. Save every submission for assignment X over the last 5 years.

  2. Every time a student submits a new assignment X, check it against all other solutions.

  3. If any submission is detected to be similar, give it to a human to review and serve justice.

Simple, right? Not in practice. ****Computer science courses now have in the order of hundreds of students per semester: doing line by line comparisons from hundreds of documents to thousands of submitted solutions is infeasible. That’s a lot of data.

The existing solution is document fingerprinting, which is what we’ll cover in this article:

Create a unique “fingerprint” for each file and only compare the fingerprints to each other, not the entire code.

How do you fingerprint a document?

Before we jump into fingerprinting algorithms, let’s start with the basics: How do you even compare two pieces of code?

Here we have a piece of text “AABAACAADAABAABA” and want to see if it contains “AABA.”

Pattern-Searching-2-1.png

The simplest solution is just to check letter-by letter. Starting at the beginning and compare each letter in “AABA’” to the one below, then move forward one letter and try again. This looks like sliding “AABA” forward a step at a time and looking for matches.

Improvement #1: Convert to numbers

In order to compare each “string” (a fancy word for a sequence of letters) faster, we change every letter into a number representation. For example, we can convert an “A” into 1, “B” into 2, etc. and then compare. This is faster because of how computers store strings vs. numbers.

Improvement #2: Convert to a hash