Analyzing Text Classification

Before diving into the detail of this doc, you're strongly recommended to know some important concepts about system analyses.

In this file we describe how to analyze text classification models. We will give an example using the text-classification sst2 dataset, but other datasets can be analyzed in a similar way.

Data Preparation

Format of `Dataset` File

(1) datalab: if your datasets have been supported by datalab, you fortunately don't need to prepare the dataset.
(2) tsv (without column names at the first row), see one example

I love this movie   positive
The movie is too long   negative
...

(3) json (basically, it's a list of dictionaries with two keys: text and true_label)

[
  {"text": "I love this movie", "true_label": "positive"},
  {"text": "The movie is too long", "true_label": "negative"}
  ...
]

Format of `System Output` File

In this task, your system outputs should be as follows:

(1) text: one predicted label per line

predicted_label

(2) json: a list of dictionaries with one key: predicted_label)

[
  {"predicted_label": "positive"},
  {"predicted_label": "negative"}
  ...
]

Let's say we have several files such as

sst2-lstm.tsv
sst2-cnn.tsv

etc. from different systems.

Performing Basic Analysis

The below example loads the sst2 dataset from DataLab:

explainaboard --task text-classification --dataset sst2 --system-outputs ./data/system_outputs/sst2/sst2-lstm-output.txt

where

--task: denotes the task name, you can find all supported task names here
--system-outputs: denote the path of system outputs. Multiple one should be separated by space, for example, system1 system2
--dataset: denotes the dataset name
report.json: the generated analysis file with json format. Tips: use a json viewer like this one for better interpretation.

Alternatively, you can load the dataset from an existing file using the --custom-dataset-paths option

explainaboard --task text-classification --custom-dataset-paths ./data/system_outputs/sst2/sst2-dataset.tsv --system-outputs ./data/system_outputs/sst2/sst2-lstm-output.txt

in which case the file format of this file is TSV

text \t true_label

Advanced Analysis Options

One also can perform pair-wise analysis:

explainaboard --task text-classification --dataset sst2 --system-outputs ./data/system_outputs/sst2/sst2-lstm-output.txt ./data/system_outputs/sst2/sst2-cnn-output.txt > report.json

where two system outputs are fed separated by space.

report.json: the generated analysis file with json format, whose schema is similar to the above one with single system evaluation except that all performance values are obtained using the sys1 subtract sys2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

task_text_classification.md

task_text_classification.md

Analyzing Text Classification

Data Preparation

Format of `Dataset` File

Format of `System Output` File

Performing Basic Analysis

Advanced Analysis Options

Files

task_text_classification.md

Latest commit

History

task_text_classification.md

File metadata and controls

Analyzing Text Classification

Data Preparation

Format of Dataset File

Format of System Output File

Performing Basic Analysis

Advanced Analysis Options

Format of `Dataset` File

Format of `System Output` File