Data Processors¶
Basic Processor¶
Abstract class that provides methods for loading train/dev/test/unlabeled examples for a given task.
- class DataProcessor(labels: Optional[Sequence[Any]] = None, labels_path: Optional[str] = None)[source]¶
labels of the dataset is optional
here’s the examples of loading the labels:
I
:DataProcessor(labels = ['positive', 'negative'])
II
:DataProcessor(labels_path = 'datasets/labels.txt')
labels file should have label names separated by any blank characters, such aspositive neutral negative
- Parameters
labels (
Sequence[Any]
, optional) – class labels of the dataset. Defaults to None.labels_path (
str
, optional) – Defaults to None. If set andlabels
is None, load labels fromlabels_path
.
- get_label_id(label: Any) int [source]¶
get label id of the corresponding label
- Parameters
label – label in dataset
- Returns
the index of label
- Return type
- get_labels() List[Any] [source]¶
get labels of the dataset
- Returns
labels of the dataset
- Return type
List[Any]
- get_num_labels()[source]¶
get the number of labels in the dataset
- Returns
number of labels in the dataset
- Return type
- get_train_examples(data_dir: Optional[str] = None) openprompt.data_utils.utils.InputExample [source]¶
get train examples from the training file under
data_dir
call
get_examples(data_dir, "train")
, seeget_examples()
- get_dev_examples(data_dir: Optional[str] = None) List[openprompt.data_utils.utils.InputExample] [source]¶
get dev examples from the development file under
data_dir
call
get_examples(data_dir, "dev")
, seeget_examples()
- get_test_examples(data_dir: Optional[str] = None) List[openprompt.data_utils.utils.InputExample] [source]¶
get test examples from the test file under
data_dir
call
get_examples(data_dir, "test")
, seeget_examples()
- get_unlabeled_examples(data_dir: Optional[str] = None) List[openprompt.data_utils.utils.InputExample] [source]¶
get unlabeled examples from the unlabeled file under
data_dir
call
get_examples(data_dir, "unlabeled")
, seeget_examples()
Text Classification Processor¶
AgnewsProcessor¶
- class AgnewsProcessor[source]¶
AG News is a News Topic classification dataset
we use dataset provided by LOTClass
Examples:
from openprompt.data_utils.text_classification_dataset import PROCESSORS base_path = "datasets/TextClassification" dataset_name = "agnews" dataset_path = os.path.join(base_path, dataset_name) processor = PROCESSORS[dataset_name.lower()]() trainvalid_dataset = processor.get_train_examples(dataset_path) test_dataset = processor.get_test_examples(dataset_path) assert processor.get_num_labels() == 4 assert processor.get_labels() == ["World", "Sports", "Business", "Tech"] assert len(trainvalid_dataset) == 120000 assert len(test_dataset) == 7600 assert test_dataset[0].text_a == "Fears for T N pension after talks" assert test_dataset[0].text_b == "Unions representing workers at Turner Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul." assert test_dataset[0].label == 2
DBpediaProcessor¶
- class DBpediaProcessor[source]¶
Dbpedia is a Wikipedia Topic Classification dataset.
we use dataset provided by LOTClass
Examples:
from openprompt.data_utils.text_classification_dataset import PROCESSORS base_path = "datasets/TextClassification" dataset_name = "dbpedia" dataset_path = os.path.join(base_path, dataset_name) processor = PROCESSORS[dataset_name.lower()]() trainvalid_dataset = processor.get_train_examples(dataset_path) test_dataset = processor.get_test_examples(dataset_path) assert processor.get_num_labels() == 14 assert len(trainvalid_dataset) == 560000 assert len(test_dataset) == 70000
ImdbProcessor¶
- class ImdbProcessor[source]¶
IMDB is a Movie Review Sentiment Classification dataset.
we use dataset provided by LOTClass
Examples:
from openprompt.data_utils.text_classification_dataset import PROCESSORS base_path = "datasets/TextClassification" dataset_name = "imdb" dataset_path = os.path.join(base_path, dataset_name) processor = PROCESSORS[dataset_name.lower()]() trainvalid_dataset = processor.get_train_examples(dataset_path) test_dataset = processor.get_test_examples(dataset_path) assert processor.get_num_labels() == 2 assert len(trainvalid_dataset) == 25000 assert len(test_dataset) == 25000
SST2Processor¶
- class SST2Processor[source]¶
SST-2 dataset is a dataset for sentiment analysis. It is a modified version containing only binary labels (negative or somewhat negative vs somewhat positive or positive with neutral sentences discarded) on top of the original 5-labeled dataset released first in Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
We use the data released in Making Pre-trained Language Models Better Few-shot Learners (Gao et al. 2020)
Examples:
from openprompt.data_utils.lmbff_dataset import PROCESSORS base_path = "datasets/TextClassification" dataset_name = "SST-2" dataset_path = os.path.join(base_path, dataset_name) processor = PROCESSORS[dataset_name.lower()]() train_dataset = processor.get_train_examples(dataset_path) dev_dataset = processor.get_dev_examples(dataset_path) test_dataset = processor.get_test_examples(dataset_path) assert processor.get_num_labels() == 2 assert processor.get_labels() == ['0','1'] assert len(train_dataset) == 6920 assert len(dev_dataset) == 872 assert len(test_dataset) == 1821 assert train_dataset[0].text_a == 'a stirring , funny and finally transporting re-imagining of beauty and the beast and 1930s horror films' assert train_dataset[0].label == 1
Entity Typing Processor¶
FewNERDProcessor¶
- class FewNERDProcessor[source]¶
Few-NERD a large-scale, fine-grained manually annotated named entity recognition dataset
It was released together with Few-NERD: Not Only a Few-shot NER Dataset (Ning Ding et al. 2021)
Examples:
from openprompt.data_utils.typing_dataset import PROCESSORS base_path = "datasets/Typing" dataset_name = "FewNERD" dataset_path = os.path.join(base_path, dataset_name) processor = PROCESSORS[dataset_name.lower()]() train_dataset = processor.get_train_examples(dataset_path) dev_dataset = processor.get_dev_examples(dataset_path) test_dataset = processor.get_test_examples(dataset_path) assert processor.get_num_labels() == 66 assert processor.get_labels() == [ "person-actor", "person-director", "person-artist/author", "person-athlete", "person-politician", "person-scholar", "person-soldier", "person-other", "organization-showorganization", "organization-religion", "organization-company", "organization-sportsteam", "organization-education", "organization-government/governmentagency", "organization-media/newspaper", "organization-politicalparty", "organization-sportsleague", "organization-other", "location-GPE", "location-road/railway/highway/transit", "location-bodiesofwater", "location-park", "location-mountain", "location-island", "location-other", "product-software", "product-food", "product-game", "product-ship", "product-train", "product-airplane", "product-car", "product-weapon", "product-other", "building-theater", "building-sportsfacility", "building-airport", "building-hospital", "building-library", "building-hotel", "building-restaurant", "building-other", "event-sportsevent", "event-attack/battle/war/militaryconflict", "event-disaster", "event-election", "event-protest", "event-other", "art-music", "art-writtenart", "art-film", "art-painting", "art-broadcastprogram", "art-other", "other-biologything", "other-chemicalthing", "other-livingthing", "other-astronomything", "other-god", "other-law", "other-award", "other-disease", "other-medical", "other-language", "other-currency", "other-educationaldegree", ] assert dev_dataset[0].text_a == "The final stage in the development of the Skyfox was the production of a model with tricycle landing gear to better cater for the pilot training market ." assert dev_dataset[0].meta["entity"] == "Skyfox" assert dev_dataset[0].label == 30
Relation Classification Processor¶
TACREDProcessor¶
- class TACREDProcessor[source]¶
TAC Relation Extraction Dataset (TACRED) is one of the largest and most widely used datasets for relation classification. It was released together with the paper Position-aware Attention and Supervised Data Improve Slot Filling (Zhang et al. 2017) This processor is also inherited by
TACREVProcessor
andReTACREDProcessor
.Examples:
from openprompt.data_utils.relation_classification_dataset import PROCESSORS base_path = "datasets/RelationClassification" dataset_name = "TACRED" dataset_path = os.path.join(base_path, dataset_name) processor = PROCESSORS[dataset_name.lower()]() train_dataset = processor.get_train_examples(dataset_path) dev_dataset = processor.get_dev_examples(dataset_path) test_dataset = processor.get_test_examples(dataset_path) assert processor.get_num_labels() == 42 assert processor.get_labels() == ["no_relation", "org:founded", "org:subsidiaries", "per:date_of_birth", "per:cause_of_death", "per:age", "per:stateorprovince_of_birth", "per:countries_of_residence", "per:country_of_birth", "per:stateorprovinces_of_residence", "org:website", "per:cities_of_residence", "per:parents", "per:employee_of", "per:city_of_birth", "org:parents", "org:political/religious_affiliation", "per:schools_attended", "per:country_of_death", "per:children", "org:top_members/employees", "per:date_of_death", "org:members", "org:alternate_names", "per:religion", "org:member_of", "org:city_of_headquarters", "per:origin", "org:shareholders", "per:charges", "per:title", "org:number_of_employees/members", "org:dissolved", "org:country_of_headquarters", "per:alternate_names", "per:siblings", "org:stateorprovince_of_headquarters", "per:spouse", "per:other_family", "per:city_of_death", "per:stateorprovince_of_death", "org:founded_by"] assert len(train_dataset) == 68124 assert len(dev_dataset) == 22631 assert len(test_dataset) == 15509 assert train_dataset[0].text_a == 'Tom Thabane resigned in October last year to form the All Basotho Convention -LRB- ABC -RRB- , crossing the floor with 17 members of parliament , causing constitutional monarch King Letsie III to dissolve parliament and call the snap election .' assert train_dataset[0].meta["head"] == "All Basotho Convention" assert train_dataset[0].meta["tail"] == "Tom Thabane" assert train_dataset[0].label == 41
TACREVProcessor¶
- class TACREVProcessor[source]¶
TACRED Revisted (TACREV) is a variant of the TACRED dataset
It was proposed by the paper TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task (Alt et al. 2020)
This processor inherit
TACREDProcessor
and can be used similarlyExamples:
from openprompt.data_utils.relation_classification_dataset import PROCESSORS base_path = "datasets/RelationClassification" dataset_name = "TACREV" dataset_path = os.path.join(base_path, dataset_name) processor = PROCESSORS[dataset_name.lower()]() train_dataset = processor.get_train_examples(dataset_path) dev_dataset = processor.get_dev_examples(dataset_path) test_dataset = processor.get_test_examples(dataset_path) assert processor.get_num_labels() == 42 assert processor.get_labels() == ["no_relation", "org:founded", "org:subsidiaries", "per:date_of_birth", "per:cause_of_death", "per:age", "per:stateorprovince_of_birth", "per:countries_of_residence", "per:country_of_birth", "per:stateorprovinces_of_residence", "org:website", "per:cities_of_residence", "per:parents", "per:employee_of", "per:city_of_birth", "org:parents", "org:political/religious_affiliation", "per:schools_attended", "per:country_of_death", "per:children", "org:top_members/employees", "per:date_of_death", "org:members", "org:alternate_names", "per:religion", "org:member_of", "org:city_of_headquarters", "per:origin", "org:shareholders", "per:charges", "per:title", "org:number_of_employees/members", "org:dissolved", "org:country_of_headquarters", "per:alternate_names", "per:siblings", "org:stateorprovince_of_headquarters", "per:spouse", "per:other_family", "per:city_of_death", "per:stateorprovince_of_death", "org:founded_by"] assert len(train_dataset) == 68124 assert len(dev_dataset) == 22631 assert len(test_dataset) == 15509
ReTACREDProcessor¶
- class ReTACREDProcessor[source]¶
Re-TACRED is a variant of the TACRED dataset
It was proposed by the paper Re-TACRED: Addressing Shortcomings of the TACRED Dataset (Stoica et al. 2021)
This processor inherit
TACREDProcessor
and can be used similarlyExamples:
from openprompt.data_utils.relation_classification_dataset import PROCESSORS base_path = "datasets/RelationClassification" dataset_name = "ReTACRED" dataset_path = os.path.join(base_path, dataset_name) processor = PROCESSORS[dataset_name.lower()]() train_dataset = processor.get_train_examples(dataset_path) dev_dataset = processor.get_dev_examples(dataset_path) test_dataset = processor.get_test_examples(dataset_path) assert processor.get_num_labels() == 40 assert processor.get_labels() == ["no_relation", "org:members", "per:siblings", "per:spouse", "org:country_of_branch", "per:country_of_death", "per:parents", "per:stateorprovinces_of_residence", "org:top_members/employees", "org:dissolved", "org:number_of_employees/members", "per:stateorprovince_of_death", "per:origin", "per:children", "org:political/religious_affiliation", "per:city_of_birth", "per:title", "org:shareholders", "per:employee_of", "org:member_of", "org:founded_by", "per:countries_of_residence", "per:other_family", "per:religion", "per:identity", "per:date_of_birth", "org:city_of_branch", "org:alternate_names", "org:website", "per:cause_of_death", "org:stateorprovince_of_branch", "per:schools_attended", "per:country_of_birth", "per:date_of_death", "per:city_of_death", "org:founded", "per:cities_of_residence", "per:age", "per:charges", "per:stateorprovince_of_birth"] assert len(train_dataset) == 58465 assert len(dev_dataset) == 19584 assert len(test_dataset) == 13418
SemEvalProcessor¶
- class SemEvalProcessor[source]¶
SemEval-2010 Task 8 is a a traditional dataset in relation classification.
It was released together with the paper SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations Between Pairs of Nominals (Hendrickx et al. 2010)
Examples:
from openprompt.data_utils.relation_classification_dataset import PROCESSORS base_path = "datasets/RelationClassification" dataset_name = "SemEval" dataset_path = os.path.join(base_path, dataset_name) processor = PROCESSORS[dataset_name.lower()]() train_dataset = processor.get_train_examples(dataset_path) dev_dataset = processor.get_dev_examples(dataset_path) test_dataset = processor.get_test_examples(dataset_path) assert processor.get_num_labels() == 19 assert processor.get_labels() == ["Other", "Member-Collection(e1,e2)", "Entity-Destination(e1,e2)", "Content-Container(e1,e2)", "Message-Topic(e1,e2)", "Entity-Origin(e1,e2)", "Cause-Effect(e1,e2)", "Product-Producer(e1,e2)", "Instrument-Agency(e1,e2)", "Component-Whole(e1,e2)", "Member-Collection(e2,e1)", "Entity-Destination(e2,e1)", "Content-Container(e2,e1)", "Message-Topic(e2,e1)", "Entity-Origin(e2,e1)", "Cause-Effect(e2,e1)", "Product-Producer(e2,e1)", "Instrument-Agency(e2,e1)", "Component-Whole(e2,e1)"] assert len(train_dataset) == 6507 assert len(dev_dataset) == 1493 assert len(test_dataset) == 2717 assert dev_dataset[0].text_a == 'the system as described above has its greatest application in an arrayed configuration of antenna elements .' assert dev_dataset[0].meta["head"] == "configuration" assert dev_dataset[0].meta["tail"] == "elements" assert dev_dataset[0].label == 18
Language Inference Processor¶
SNLIProcessor¶
- class SNLIProcessor[source]¶
The Stanford Natural Language Inference (SNLI) corpus is a dataset for natural language inference. It is first released in A large annotated corpus for learning natural language inference (Bowman et al. 2015)
We use the data released in Making Pre-trained Language Models Better Few-shot Learners (Gao et al. 2020)
Examples:
from openprompt.data_utils.lmbff_dataset import PROCESSORS base_path = "datasets" dataset_name = "SNLI" dataset_path = os.path.join(base_path, dataset_name, '16-13') processor = PROCESSORS[dataset_name.lower()]() train_dataset = processor.get_train_examples(dataset_path) dev_dataset = processor.get_dev_examples(dataset_path) test_dataset = processor.get_test_examples(dataset_path) assert processor.get_num_labels() == 3 assert processor.get_labels() == ['entailment', 'neutral', 'contradiction'] assert len(train_dataset) == 549367 assert len(dev_dataset) == 9842 assert len(test_dataset) == 9824 assert train_dataset[0].text_a == 'A person on a horse jumps over a broken down airplane.' assert train_dataset[0].text_b == 'A person is training his horse for a competition.' assert train_dataset[0].label == 1
Conditional Generation Processor¶
WebNLGProcessor¶
- class WebNLGProcessor[source]¶
# TODO citation
Examples:
from openprompt.data_utils.conditional_generation_dataset import PROCESSORS base_path = "datasets/CondGen" dataset_name = "webnlg_2017" dataset_path = os.path.join(base_path, dataset_name) processor = PROCESSORS[dataset_name.lower()]() train_dataset = processor.get_train_examples(dataset_path) valid_dataset = processor.get_train_examples(dataset_path) test_dataset = processor.get_test_examples(dataset_path) assert len(train_dataset) == 18025 assert len(valid_dataset) == 18025 assert len(test_dataset) == 4928 assert test_dataset[0].text_a == " | Abilene_Regional_Airport : cityServed : Abilene,_Texas" assert test_dataset[0].text_b == "" assert test_dataset[0].tgt_text == "Abilene, Texas is served by the Abilene regional airport."