Skip to content

Short Answer Responses ‐ Answer Normalization

drewjhart edited this page Nov 7, 2023 · 6 revisions

Updated: 2023-11-07

Per the Overview, Short Answer Responses are processed in two phases: normalization and checks for equality. This article describes the process of normalization in detail. For information on equality checks, please see Short Answer Response ‐ Equality Checks.

The main guiding principle behind how normalization works in play is that it is more of a disruption for students to miss out on points than it is for them to be awarded them by mistake. That is, our normalization algorithm is focused more on allowing us to find the correct answer in the student's input than it is on filtering out wrong answers.

Given this focus, a students answer is taken as a series of data objects that may or may not be correct. If any of these data objects matches the correct answer, the student will be awarded points. For example, if a question has an answer of 1080 and a student enters: "1200 1080 1100", we will attempt to normalize the answer into [1200, 1080, 1100] and, after finding 1080 in that array, will award the student points.

This simplifies our logic considerably because we now don't need to care about the relationships that exist between subobjects in the array. We don't have to try and detect if there are other answers that are located beside this answer, or if these subobjects somehow invalidate the given answer. This can create odd scenarios (for example if the student enters "the answer isn't 1080" they will get awarded points) but because we are deferring to awarding students points, I think these edge cases will produce minimal student friction.

Algorithm

The actual algorithm is here:

export const isNumeric = (num: any) => // eslint-disable-line @typescript-eslint/no-explicit-any
  (typeof num === 'number' || (typeof num === 'string' && num.trim() !== '')) && 
  !isNaN(num as number); // eslint-disable-line no-restricted-globals

export const handleNormalizeAnswers = (currentContents: any) => { // eslint-disable-line @typescript-eslint/no-explicit-any
  // used later in the map for removing special characters
  // eslint-disable-next-line prefer-regex-literals
  const specialCharsRegex = new RegExp(
    `[!@#$%^&*()_\\+=\\[\\]{};:'"\\\\|,.<>\\/?~-]`,
    'gm'
  );
  const extractedAnswer = getAnswerFromDelta(currentContents);
  const rawArray: string[] = [];
  const normalizedAnswer: INormAnswer = {
    [AnswerType.NUMBER]: [], 
    [AnswerType.STRING]: [], 
    [AnswerType.EXPRESSION]: []
  };
  extractedAnswer.forEach((answer) => {
      // replaces \n with spaces, maintain everything else
      const raw = `${answer.value.replace(/\n/g, ' ')}`;
      rawArray.push(raw);
      if (answer) {
        if (Number(answer.type) === AnswerType.EXPRESSION) {
          // 1. answer is a formula
          // removes all spaces
          normalizedAnswer[AnswerType.EXPRESSION].push(
            raw.replace(/(\r\n|\n|\r|\s|" ")/gm, ''),
          );
        } else if (isNumeric(raw) === true) {
          // 2. answer is a number, exclusively
          normalizedAnswer[AnswerType.NUMBER].push(Number(raw));
        } else {
          // 3. answer is a string
          //  we will produce a naive normalization of the string, attempting to extract numeric answers and then
          //  reducing case and removing characters

          // this extracts numeric values from a string and adds them to the normalized text array.
          // cuts special characters first so 5% and 50% don't match based on % (when numbers are removed)
          // it then removes those numbers from the string
          const specialCharRemoved = raw.replace(specialCharsRegex, '');
          const extractedNumbers = specialCharRemoved.match(/-?\d+(\.\d+)?/g)?.map(Number);
          if (extractedNumbers) {
            normalizedAnswer[AnswerType.NUMBER].push(
              ...extractedNumbers.map((value) => (value
              ))
            );
          }
          const numbersRemoved = specialCharRemoved.replace(/-?\d+(\.\d+)?/g, '');

          // this attempts to extract any written numbers (ex. fifty five) after removing any special characters
          // eslint-disable-next-line prefer-regex-literals
          const detectedNumbers = nlp(
            numbersRemoved.replace(specialCharsRegex, '')
          )
            .numbers()
            .json();
          if (detectedNumbers.length > 0) {
            normalizedAnswer[AnswerType.NUMBER].push(
              ...detectedNumbers.map((num: any) => ( // eslint-disable-line @typescript-eslint/no-explicit-any
                Number(num.number.num)
               ))
            );
          }
          // 4. any remaining content remaining is just a plain string
          //    set normalized input to lower case and remove spaces

        if (numbersRemoved !== '') {
          normalizedAnswer[AnswerType.STRING].push(
            numbersRemoved
              .toLowerCase()
              .replace(/(\r\n|\n|\r|" ")/gm, '')
              .trim()
          );
        }
        }
      }
      return normalizedAnswer;
    }
  );
  // if a student enters multiple numeric answers, we will treat those answers as a single string
  // this prevents them from being awarded points as well as matching other students single number answers
  if (normalizedAnswer[AnswerType.NUMBER].length > 1){
    normalizedAnswer[AnswerType.STRING].push(normalizedAnswer[AnswerType.NUMBER].toString());
    normalizedAnswer[AnswerType.NUMBER] = [];
  }
  const rawAnswer = rawArray.join('').trim();
  return { normalizedAnswer, rawAnswer };
};

The main organizing principle here is that we sort answers primarily by answer type (number, expression, or string) and then by the content of the answer. Answers are compared on answer types first (to save on needlessly comparing things) and they are compared by types that are the easiest/most accurate comparison first, in the following order:

  1. Exclusively numeric answers: Given that we're a math app, it's reasonable to assume that there will be a lot of numeric-based answers. Additionally, this is one of the easier comparisons to make because we can just use the inbuilt javascript comparisons to determine equality. This will catch negative numbers, trailing zeros, precisions etc. We use isNumeric() to detect that an answer is exclusively a number.

  2. Expressions: Quill stores expressions separately from the other content (as they are input separately), so we know with 100% certainty when an answer is an expression. If this is the case, we do some simple cleanup (remove line breaks) and write the formula to the object.

  3. Strings: Given the above, a string could be an answer with either words or numbers in it (#1 catches only numbers). Therefore, we first remove special characters and use RegEx to find an numbers in the string. We extract those numbers and then remove them from the original string so that they are not double calculated.

Secondly, we could have a student enter an answer like "eighty one", intending 81. While this scenario is likely uncommon (i.e., typing "81" may be more natural and quicker than the written-out equivalent), we are able to use nlp from the compromise library to scan the string for these kinds of answers. This isn't perfect, but it is kind of a bonus to catch anything at that point.

Finally, with the numbers and written numbers extracted, we assume that the rest of the answer is just written words. We take those words, lower case them, and write them to the answer object.

Per the overview, we store the raw answer first and then store the normalized answers as an array of objects containing the value and the answer type. This allows us to compare every subobject of an answer so that we dont miss anything but start that comparison by answer type, so we don't over compare.