Data Processors

Basic Processor

Abstract class that provides methods for loading train/dev/test/unlabeled examples for a given task.

class DataProcessor(labels: Optional[Sequence[Any]] = None, labels_path: Optional[str] = None)[source]

labels of the dataset is optional

here’s the examples of loading the labels:

I: DataProcessor(labels = ['positive', 'negative'])

II: DataProcessor(labels_path = 'datasets/labels.txt') labels file should have label names separated by any blank characters, such as

positive neutral
negative
Parameters
  • labels (Sequence[Any], optional) – class labels of the dataset. Defaults to None.

  • labels_path (str, optional) – Defaults to None. If set and labels is None, load labels from labels_path.

get_label_id(label: Any) int[source]

get label id of the corresponding label

Parameters

label – label in dataset

Returns

the index of label

Return type

int

get_labels() List[Any][source]

get labels of the dataset

Returns

labels of the dataset

Return type

List[Any]

get_num_labels()[source]

get the number of labels in the dataset

Returns

number of labels in the dataset

Return type

int

get_train_examples(data_dir: Optional[str] = None) openprompt.data_utils.utils.InputExample[source]

get train examples from the training file under data_dir

call get_examples(data_dir, "train"), see get_examples()

get_dev_examples(data_dir: Optional[str] = None) List[openprompt.data_utils.utils.InputExample][source]

get dev examples from the development file under data_dir

call get_examples(data_dir, "dev"), see get_examples()

get_test_examples(data_dir: Optional[str] = None) List[openprompt.data_utils.utils.InputExample][source]

get test examples from the test file under data_dir

call get_examples(data_dir, "test"), see get_examples()

get_unlabeled_examples(data_dir: Optional[str] = None) List[openprompt.data_utils.utils.InputExample][source]

get unlabeled examples from the unlabeled file under data_dir

call get_examples(data_dir, "unlabeled"), see get_examples()

abstract get_examples(data_dir: Optional[str] = None, split: Optional[str] = None) List[openprompt.data_utils.utils.InputExample][source]

get the split of dataset under data_dir

data_dir is the base path of the dataset, for example:

training file could be located in data_dir/train.txt

Parameters
  • data_dir (str) – the base path of the dataset

  • split (str) – train / dev / test / unlabeled

Returns

return a list of InputExample

Return type

List[InputExample]

Text Classification Processor

AgnewsProcessor

class AgnewsProcessor[source]

AG News is a News Topic classification dataset

we use dataset provided by LOTClass

Examples:

from openprompt.data_utils.text_classification_dataset import PROCESSORS

base_path = "datasets/TextClassification"

dataset_name = "agnews"
dataset_path = os.path.join(base_path, dataset_name)
processor = PROCESSORS[dataset_name.lower()]()
trainvalid_dataset = processor.get_train_examples(dataset_path)
test_dataset = processor.get_test_examples(dataset_path)

assert processor.get_num_labels() == 4
assert processor.get_labels() == ["World", "Sports", "Business", "Tech"]
assert len(trainvalid_dataset) == 120000
assert len(test_dataset) == 7600
assert test_dataset[0].text_a == "Fears for T N pension after talks"
assert test_dataset[0].text_b == "Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul."
assert test_dataset[0].label == 2
get_examples(data_dir, split)[source]

get the split of dataset under data_dir

data_dir is the base path of the dataset, for example:

training file could be located in data_dir/train.txt

Parameters
  • data_dir (str) – the base path of the dataset

  • split (str) – train / dev / test / unlabeled

Returns

return a list of InputExample

Return type

List[InputExample]

DBpediaProcessor

class DBpediaProcessor[source]

Dbpedia is a Wikipedia Topic Classification dataset.

we use dataset provided by LOTClass

Examples:

from openprompt.data_utils.text_classification_dataset import PROCESSORS

base_path = "datasets/TextClassification"

dataset_name = "dbpedia"
dataset_path = os.path.join(base_path, dataset_name)
processor = PROCESSORS[dataset_name.lower()]()
trainvalid_dataset = processor.get_train_examples(dataset_path)
test_dataset = processor.get_test_examples(dataset_path)

assert processor.get_num_labels() == 14
assert len(trainvalid_dataset) == 560000
assert len(test_dataset) == 70000
get_examples(data_dir, split)[source]

get the split of dataset under data_dir

data_dir is the base path of the dataset, for example:

training file could be located in data_dir/train.txt

Parameters
  • data_dir (str) – the base path of the dataset

  • split (str) – train / dev / test / unlabeled

Returns

return a list of InputExample

Return type

List[InputExample]

ImdbProcessor

class ImdbProcessor[source]

IMDB is a Movie Review Sentiment Classification dataset.

we use dataset provided by LOTClass

Examples:

from openprompt.data_utils.text_classification_dataset import PROCESSORS

base_path = "datasets/TextClassification"

dataset_name = "imdb"
dataset_path = os.path.join(base_path, dataset_name)
processor = PROCESSORS[dataset_name.lower()]()
trainvalid_dataset = processor.get_train_examples(dataset_path)
test_dataset = processor.get_test_examples(dataset_path)

assert processor.get_num_labels() == 2
assert len(trainvalid_dataset) == 25000
assert len(test_dataset) == 25000
get_examples(data_dir, split)[source]

get the split of dataset under data_dir

data_dir is the base path of the dataset, for example:

training file could be located in data_dir/train.txt

Parameters
  • data_dir (str) – the base path of the dataset

  • split (str) – train / dev / test / unlabeled

Returns

return a list of InputExample

Return type

List[InputExample]

SST2Processor

class SST2Processor[source]

SST-2 dataset is a dataset for sentiment analysis. It is a modified version containing only binary labels (negative or somewhat negative vs somewhat positive or positive with neutral sentences discarded) on top of the original 5-labeled dataset released first in Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

We use the data released in Making Pre-trained Language Models Better Few-shot Learners (Gao et al. 2020)

Examples:

from openprompt.data_utils.lmbff_dataset import PROCESSORS

base_path = "datasets/TextClassification"

dataset_name = "SST-2"
dataset_path = os.path.join(base_path, dataset_name)
processor = PROCESSORS[dataset_name.lower()]()
train_dataset = processor.get_train_examples(dataset_path)
dev_dataset = processor.get_dev_examples(dataset_path)
test_dataset = processor.get_test_examples(dataset_path)

assert processor.get_num_labels() == 2
assert processor.get_labels() == ['0','1']
assert len(train_dataset) == 6920
assert len(dev_dataset) == 872
assert len(test_dataset) == 1821
assert train_dataset[0].text_a == 'a stirring , funny and finally transporting re-imagining of beauty and the beast and 1930s horror films'
assert train_dataset[0].label == 1
get_examples(data_dir, split)[source]

get the split of dataset under data_dir

data_dir is the base path of the dataset, for example:

training file could be located in data_dir/train.txt

Parameters
  • data_dir (str) – the base path of the dataset

  • split (str) – train / dev / test / unlabeled

Returns

return a list of InputExample

Return type

List[InputExample]

Entity Typing Processor

FewNERDProcessor

class FewNERDProcessor[source]

Few-NERD a large-scale, fine-grained manually annotated named entity recognition dataset

It was released together with Few-NERD: Not Only a Few-shot NER Dataset (Ning Ding et al. 2021)

Examples:

from openprompt.data_utils.typing_dataset import PROCESSORS

base_path = "datasets/Typing"

dataset_name = "FewNERD"
dataset_path = os.path.join(base_path, dataset_name)
processor = PROCESSORS[dataset_name.lower()]()
train_dataset = processor.get_train_examples(dataset_path)
dev_dataset = processor.get_dev_examples(dataset_path)
test_dataset = processor.get_test_examples(dataset_path)

assert processor.get_num_labels() == 66
assert processor.get_labels() == [
    "person-actor", "person-director", "person-artist/author", "person-athlete", "person-politician", "person-scholar", "person-soldier", "person-other",
    "organization-showorganization", "organization-religion", "organization-company", "organization-sportsteam", "organization-education", "organization-government/governmentagency", "organization-media/newspaper", "organization-politicalparty", "organization-sportsleague", "organization-other",
    "location-GPE", "location-road/railway/highway/transit", "location-bodiesofwater", "location-park", "location-mountain", "location-island", "location-other",
    "product-software", "product-food", "product-game", "product-ship", "product-train", "product-airplane", "product-car", "product-weapon", "product-other",
    "building-theater", "building-sportsfacility", "building-airport", "building-hospital", "building-library", "building-hotel", "building-restaurant", "building-other",
    "event-sportsevent", "event-attack/battle/war/militaryconflict", "event-disaster", "event-election", "event-protest", "event-other",
    "art-music", "art-writtenart", "art-film", "art-painting", "art-broadcastprogram", "art-other",
    "other-biologything", "other-chemicalthing", "other-livingthing", "other-astronomything", "other-god", "other-law", "other-award", "other-disease", "other-medical", "other-language", "other-currency", "other-educationaldegree",
]
assert dev_dataset[0].text_a == "The final stage in the development of the Skyfox was the production of a model with tricycle landing gear to better cater for the pilot training market ."
assert dev_dataset[0].meta["entity"] == "Skyfox"
assert dev_dataset[0].label == 30

Relation Classification Processor

TACREDProcessor

class TACREDProcessor[source]

TAC Relation Extraction Dataset (TACRED) is one of the largest and most widely used datasets for relation classification. It was released together with the paper Position-aware Attention and Supervised Data Improve Slot Filling (Zhang et al. 2017) This processor is also inherited by TACREVProcessor and ReTACREDProcessor.

Examples:

from openprompt.data_utils.relation_classification_dataset import PROCESSORS

base_path = "datasets/RelationClassification"

dataset_name = "TACRED"
dataset_path = os.path.join(base_path, dataset_name)
processor = PROCESSORS[dataset_name.lower()]()
train_dataset = processor.get_train_examples(dataset_path)
dev_dataset = processor.get_dev_examples(dataset_path)
test_dataset = processor.get_test_examples(dataset_path)

assert processor.get_num_labels() == 42
assert processor.get_labels() == ["no_relation", "org:founded", "org:subsidiaries", "per:date_of_birth", "per:cause_of_death", "per:age", "per:stateorprovince_of_birth", "per:countries_of_residence", "per:country_of_birth", "per:stateorprovinces_of_residence", "org:website", "per:cities_of_residence", "per:parents", "per:employee_of", "per:city_of_birth", "org:parents", "org:political/religious_affiliation", "per:schools_attended", "per:country_of_death", "per:children", "org:top_members/employees", "per:date_of_death", "org:members", "org:alternate_names", "per:religion", "org:member_of", "org:city_of_headquarters", "per:origin", "org:shareholders", "per:charges", "per:title", "org:number_of_employees/members", "org:dissolved", "org:country_of_headquarters", "per:alternate_names", "per:siblings", "org:stateorprovince_of_headquarters", "per:spouse", "per:other_family", "per:city_of_death", "per:stateorprovince_of_death", "org:founded_by"]
assert len(train_dataset) == 68124
assert len(dev_dataset) == 22631
assert len(test_dataset) == 15509
assert train_dataset[0].text_a == 'Tom Thabane resigned in October last year to form the All Basotho Convention -LRB- ABC -RRB- , crossing the floor with 17 members of parliament , causing constitutional monarch King Letsie III to dissolve parliament and call the snap election .'
assert train_dataset[0].meta["head"] == "All Basotho Convention"
assert train_dataset[0].meta["tail"] == "Tom Thabane"
assert train_dataset[0].label == 41

TACREVProcessor

class TACREVProcessor[source]

TACRED Revisted (TACREV) is a variant of the TACRED dataset

It was proposed by the paper TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task (Alt et al. 2020)

This processor inherit TACREDProcessor and can be used similarly

Examples:

from openprompt.data_utils.relation_classification_dataset import PROCESSORS

base_path = "datasets/RelationClassification"

dataset_name = "TACREV"
dataset_path = os.path.join(base_path, dataset_name)
processor = PROCESSORS[dataset_name.lower()]()
train_dataset = processor.get_train_examples(dataset_path)
dev_dataset = processor.get_dev_examples(dataset_path)
test_dataset = processor.get_test_examples(dataset_path)
assert processor.get_num_labels() == 42
assert processor.get_labels() == ["no_relation", "org:founded", "org:subsidiaries", "per:date_of_birth", "per:cause_of_death", "per:age", "per:stateorprovince_of_birth", "per:countries_of_residence", "per:country_of_birth", "per:stateorprovinces_of_residence", "org:website", "per:cities_of_residence", "per:parents", "per:employee_of", "per:city_of_birth", "org:parents", "org:political/religious_affiliation", "per:schools_attended", "per:country_of_death", "per:children", "org:top_members/employees", "per:date_of_death", "org:members", "org:alternate_names", "per:religion", "org:member_of", "org:city_of_headquarters", "per:origin", "org:shareholders", "per:charges", "per:title", "org:number_of_employees/members", "org:dissolved", "org:country_of_headquarters", "per:alternate_names", "per:siblings", "org:stateorprovince_of_headquarters", "per:spouse", "per:other_family", "per:city_of_death", "per:stateorprovince_of_death", "org:founded_by"]
assert len(train_dataset) == 68124
assert len(dev_dataset) == 22631
assert len(test_dataset) == 15509

ReTACREDProcessor

class ReTACREDProcessor[source]

Re-TACRED is a variant of the TACRED dataset

It was proposed by the paper Re-TACRED: Addressing Shortcomings of the TACRED Dataset (Stoica et al. 2021)

This processor inherit TACREDProcessor and can be used similarly

Examples:

from openprompt.data_utils.relation_classification_dataset import PROCESSORS

base_path = "datasets/RelationClassification"

dataset_name = "ReTACRED"
dataset_path = os.path.join(base_path, dataset_name)
processor = PROCESSORS[dataset_name.lower()]()
train_dataset = processor.get_train_examples(dataset_path)
dev_dataset = processor.get_dev_examples(dataset_path)
test_dataset = processor.get_test_examples(dataset_path)
assert processor.get_num_labels() == 40
assert processor.get_labels() == ["no_relation", "org:members", "per:siblings", "per:spouse", "org:country_of_branch", "per:country_of_death", "per:parents", "per:stateorprovinces_of_residence", "org:top_members/employees", "org:dissolved", "org:number_of_employees/members", "per:stateorprovince_of_death", "per:origin", "per:children", "org:political/religious_affiliation", "per:city_of_birth", "per:title", "org:shareholders", "per:employee_of", "org:member_of", "org:founded_by", "per:countries_of_residence", "per:other_family", "per:religion", "per:identity", "per:date_of_birth", "org:city_of_branch", "org:alternate_names", "org:website", "per:cause_of_death", "org:stateorprovince_of_branch", "per:schools_attended", "per:country_of_birth", "per:date_of_death", "per:city_of_death", "org:founded", "per:cities_of_residence", "per:age", "per:charges", "per:stateorprovince_of_birth"]
assert len(train_dataset) == 58465
assert len(dev_dataset) == 19584
assert len(test_dataset) == 13418

SemEvalProcessor

class SemEvalProcessor[source]

SemEval-2010 Task 8 is a a traditional dataset in relation classification.

It was released together with the paper SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations Between Pairs of Nominals (Hendrickx et al. 2010)

Examples:

from openprompt.data_utils.relation_classification_dataset import PROCESSORS

base_path = "datasets/RelationClassification"

dataset_name = "SemEval"
dataset_path = os.path.join(base_path, dataset_name)
processor = PROCESSORS[dataset_name.lower()]()
train_dataset = processor.get_train_examples(dataset_path)
dev_dataset = processor.get_dev_examples(dataset_path)
test_dataset = processor.get_test_examples(dataset_path)
assert processor.get_num_labels() == 19
assert processor.get_labels() == ["Other", "Member-Collection(e1,e2)", "Entity-Destination(e1,e2)", "Content-Container(e1,e2)", "Message-Topic(e1,e2)", "Entity-Origin(e1,e2)", "Cause-Effect(e1,e2)", "Product-Producer(e1,e2)", "Instrument-Agency(e1,e2)", "Component-Whole(e1,e2)", "Member-Collection(e2,e1)", "Entity-Destination(e2,e1)", "Content-Container(e2,e1)", "Message-Topic(e2,e1)", "Entity-Origin(e2,e1)", "Cause-Effect(e2,e1)", "Product-Producer(e2,e1)", "Instrument-Agency(e2,e1)", "Component-Whole(e2,e1)"]
assert len(train_dataset) == 6507
assert len(dev_dataset) == 1493
assert len(test_dataset) == 2717
assert dev_dataset[0].text_a == 'the system as described above has its greatest application in an arrayed configuration of antenna elements .'
assert dev_dataset[0].meta["head"] == "configuration"
assert dev_dataset[0].meta["tail"] == "elements"
assert dev_dataset[0].label == 18

Language Inference Processor

SNLIProcessor

class SNLIProcessor[source]

The Stanford Natural Language Inference (SNLI) corpus is a dataset for natural language inference. It is first released in A large annotated corpus for learning natural language inference (Bowman et al. 2015)

We use the data released in Making Pre-trained Language Models Better Few-shot Learners (Gao et al. 2020)

Examples:

from openprompt.data_utils.lmbff_dataset import PROCESSORS

base_path = "datasets"

dataset_name = "SNLI"
dataset_path = os.path.join(base_path, dataset_name, '16-13')
processor = PROCESSORS[dataset_name.lower()]()
train_dataset = processor.get_train_examples(dataset_path)
dev_dataset = processor.get_dev_examples(dataset_path)
test_dataset = processor.get_test_examples(dataset_path)

assert processor.get_num_labels() == 3
assert processor.get_labels() == ['entailment', 'neutral', 'contradiction']
assert len(train_dataset) == 549367
assert len(dev_dataset) == 9842
assert len(test_dataset) == 9824
assert train_dataset[0].text_a == 'A person on a horse jumps over a broken down airplane.'
assert train_dataset[0].text_b == 'A person is training his horse for a competition.'
assert train_dataset[0].label == 1

Conditional Generation Processor

WebNLGProcessor

class WebNLGProcessor[source]

# TODO citation

Examples:

from openprompt.data_utils.conditional_generation_dataset import PROCESSORS

base_path = "datasets/CondGen"

dataset_name = "webnlg_2017"
dataset_path = os.path.join(base_path, dataset_name)
processor = PROCESSORS[dataset_name.lower()]()
train_dataset = processor.get_train_examples(dataset_path)
valid_dataset = processor.get_train_examples(dataset_path)
test_dataset = processor.get_test_examples(dataset_path)

assert len(train_dataset) == 18025
assert len(valid_dataset) == 18025
assert len(test_dataset) == 4928
assert test_dataset[0].text_a == " | Abilene_Regional_Airport : cityServed : Abilene,_Texas"
assert test_dataset[0].text_b == ""
assert test_dataset[0].tgt_text == "Abilene, Texas is served by the Abilene regional airport."
get_examples(data_dir: str, split: str) List[openprompt.data_utils.utils.InputExample][source]

get the split of dataset under data_dir

data_dir is the base path of the dataset, for example:

training file could be located in data_dir/train.txt

Parameters
  • data_dir (str) – the base path of the dataset

  • split (str) – train / dev / test / unlabeled

Returns

return a list of InputExample

Return type

List[InputExample]