openprompt

Contents

prompt_base

class Template(tokenizer: transformers.tokenization_utils.PreTrainedTokenizer, placeholder_mapping: dict = {'<text_a>': 'text_a', '<text_b>': 'text_b'})[source]

Base class for all the templates. Most of methods are abstract, with some exceptions to hold the common methods for all template, such as loss_ids, save, load.

Parameters
  • tokenizer (PreTrainedTokenizer) – A tokenizer to appoint the vocabulary and the tokenization strategy.

  • placeholder_mapping (dict) – A place holder to represent the original input text.

registered_inputflag_names = ['loss_ids', 'shortenable_ids']
get_default_loss_ids() List[int][source]

Get the loss indices for the template using mask. e.g. when self.text is '{"placeholder": "text_a"}. {"meta": "word"} is {"mask"}.', output is [0, 0, 0, 0, 1, 0].

Returns

A list of integers in the range [0, 1]:

  • 1 for a masked tokens.

  • 0 for a sequence tokens.

Return type

List[int]

get_default_shortenable_ids() List[int][source]

Every template needs shortenable_ids, denoting which part of the template can be truncate to fit the language model’s max_seq_length. Default: the input text is shortenable, while the template text and other special tokens are not shortenable.

e.g. when self.text is '{"placeholder": "text_a"} {"placeholder": "text_b", "shortenable": False} {"meta": "word"} is {"mask"}.', output is [1, 0, 0, 0, 0, 0, 0].

Returns

A list of integers in the range [0, 1]:

  • 1 for the input tokens.

  • 0 for the template sequence tokens.

Return type

List[int]

get_default_soft_token_ids() List[int][source]

This function identifies which tokens are soft tokens.

Sometimes tokens in the template are not from the vocabulary, but a sequence of soft tokens. In this case, you need to implement this function

Raises

NotImplementedError – if needed, add soft_token_ids into registered_inputflag_names attribute of Template class and implement this method.

incorporate_text_example(example: openprompt.data_utils.utils.InputExample, text=None)[source]
parse_text(text: str) List[Dict][source]
wrap_one_example(example: openprompt.data_utils.utils.InputExample) List[Dict][source]

Given an input example which contains input text, which can be referenced by self.template.placeholder_mapping ‘s value. This function process the example into a list of dict, Each dict functions as a group, which has the sample properties, such as whether it’s shortenable, whether it’s the masked position, whether it’s soft token, etc. Since a text will be tokenized in the subsequent processing procedure, these attributes are broadcasted along the tokenized sentence.

Parameters

example (InputExample) – An InputExample object, which should have attributes that are able to be filled in the template.

Returns

A list of dict of the same length as self.text. e.g. [{"loss_ids": 0, "text": "It was"}, {"loss_ids": 1, "text": "<mask>"}, ]

Return type

List[Dict]

abstract process_batch(batch)[source]

Template should rewrite this method if you need to process the batch input such as substituting embeddings.

post_processing_outputs(outputs)[source]

Post processing the outputs of language models according to the need of template. Most templates don’t need post processing, The template like SoftTemplate, which appends soft template as a module (rather than a sequence of input tokens) to the input, should remove the outputs on these positions to keep the seq_len the same

save(path: str, **kwargs) None[source]

A save method API.

Parameters

path (str) – A path to save your template.

property text
safe_on_text_set() None[source]

With this wrapper function, setting text inside on_text_set() will not trigger on_text_set() again to prevent endless recursion.

abstract on_text_set()[source]

A hook to do something when template text was set. The designer of the template should explicitly know what should be down when the template text is set.

from_file(path: str, choice: int = 0)[source]

Read the template from a local file.

Parameters
  • path (str) – The path of the local template file.

  • choice (int) – The id-th line of the file.

classmethod from_config(config: yacs.config.CfgNode, **kwargs)[source]

load a template from template’s configuration node.

Parameters
  • config (CfgNode) – the sub-configuration of template, i.e. config[config.template] if config is a global config node.

  • kwargs – Other kwargs that might be used in initialize the verbalizer. The actual value should match the arguments of __init__ functions.

training: bool
class Verbalizer(tokenizer: Optional[transformers.tokenization_utils.PreTrainedTokenizer] = None, classes: Optional[Sequence[str]] = None, num_classes: Optional[int] = None)[source]

Base class for all the verbalizers.

Parameters
  • tokenizer (PreTrainedTokenizer) – A tokenizer to appoint the vocabulary and the tokenization strategy.

  • classes (Sequence[str]) – A sequence of classes that need to be projected.

property label_words

Label words means the words in the vocabulary projected by the labels. E.g. if we want to establish a projection in sentiment classification: positive \(\rightarrow\) {wonderful, good}, in this case, wonderful and good are label words.

safe_on_label_words_set()[source]
on_label_words_set()[source]

A hook to do something when textual label words were set.

property vocab: Dict
property vocab_size: int
abstract generate_parameters(**kwargs) List[source]

The verbalizer can be seen as an extra layer on top of the original pre-trained models. In manual verbalizer, it is a fixed one-hot vector of dimension vocab_size, with the position of the label word being 1 and 0 everywhere else. In other situation, the parameters may be a continuous vector over the vocab, with each dimension representing a weight of that token. Moreover, the parameters may be set to trainable to allow label words selection.

Therefore, this function serves as an abstract methods for generating the parameters of the verbalizer, and must be instantiated in any derived class.

Note that the parameters need to be registered as a part of pytorch’s module to It can be achieved by wrapping a tensor using nn.Parameter().

register_calibrate_logits(logits: torch.Tensor)[source]

This function aims to register logits that need to be calibrated, and detach the original logits from the current graph.

process_outputs(outputs: torch.Tensor, batch: Union[Dict, openprompt.data_utils.utils.InputFeatures], **kwargs)[source]

By default, the verbalizer will process the logits of the PLM’s output.

Parameters
  • logits (torch.Tensor) – The current logits generated by pre-trained language models.

  • batch (Union[Dict, InputFeatures]) – The input features of the data.

gather_outputs(outputs: transformers.file_utils.ModelOutput)[source]

retrieve useful output for the verbalizer from the whole model output By default, it will only retrieve the logits

Parameters

outputs (ModelOutput) –

Returns

torch.Tensor The gathered output, should be of shape (batch_size, seq_len, any)

static aggregate(label_words_logits: torch.Tensor) torch.Tensor[source]

To aggregate logits on multiple label words into the label’s logits Basic aggregator: mean of each label words’ logits to a label’s logits Can be re-implemented in advanced verbaliezer.

Parameters

label_words_logits (torch.Tensor) – The logits of the label words only.

Returns

The final logits calculated by the label words.

Return type

torch.Tensor

normalize(logits: torch.Tensor) torch.Tensor[source]

Given logits regarding the entire vocab, calculate the probs over the label words set by softmax.

Parameters

logits (Tensor) – The logits of the entire vocab.

Returns

The probability distribution over the label words set.

Return type

Tensor

training: bool
abstract project(logits: torch.Tensor, **kwargs) torch.Tensor[source]

This method receives input logits of shape [batch_size, vocab_size], and use the parameters of this verbalizer to project the logits over entire vocab into the logits of labels words.

Parameters

logits (Tensor) – The logits over entire vocab generated by the pre-trained language model with shape [batch_size, max_seq_length, vocab_size]

Returns

The normalized probs (sum to 1) of each label .

Return type

Tensor

handle_multi_token(label_words_logits, mask)[source]

Support multiple methods to handle the multi tokens produced by the tokenizer. We suggest using ‘first’ or ‘max’ if the some parts of the tokenization is not meaningful. Can broadcast to 3-d tensor.

Parameters

label_words_logits (torch.Tensor) –

Returns

torch.Tensor

classmethod from_config(config: yacs.config.CfgNode, **kwargs)[source]

load a verbalizer from verbalizer’s configuration node.

Parameters
  • config (CfgNode) – the sub-configuration of verbalizer, i.e. config[config.verbalizer] if config is a global config node.

  • kwargs – Other kwargs that might be used in initialize the verbalizer. The actual value should match the arguments of __init__ functions.

from_file(path: str, choice: Optional[int] = 0)[source]

Load the predefined label words from verbalizer file. Currently support three types of file format: 1. a .jsonl or .json file, in which is a single verbalizer in dict format. 2. a .jsonal or .json file, in which is a list of verbalizers in dict format 3. a .txt or a .csv file, in which is the label words of a class are listed in line, separated by commas. Begin a new verbalizer by an empty line. This format is recommended when you don’t know the name of each class.

The details of verbalizer format can be seen in How to Write a Verbalizer?.

Parameters
  • path (str) – The path of the local template file.

  • choice (int) – The choice of verbalizer in a file containing multiple verbalizers.

Returns

self object

Return type

Template

class Template(tokenizer: transformers.tokenization_utils.PreTrainedTokenizer, placeholder_mapping: dict = {'<text_a>': 'text_a', '<text_b>': 'text_b'})[source]

Base class for all the templates. Most of methods are abstract, with some exceptions to hold the common methods for all template, such as loss_ids, save, load.

Parameters
  • tokenizer (PreTrainedTokenizer) – A tokenizer to appoint the vocabulary and the tokenization strategy.

  • placeholder_mapping (dict) – A place holder to represent the original input text.

registered_inputflag_names = ['loss_ids', 'shortenable_ids']
get_default_loss_ids() List[int][source]

Get the loss indices for the template using mask. e.g. when self.text is '{"placeholder": "text_a"}. {"meta": "word"} is {"mask"}.', output is [0, 0, 0, 0, 1, 0].

Returns

A list of integers in the range [0, 1]:

  • 1 for a masked tokens.

  • 0 for a sequence tokens.

Return type

List[int]

get_default_shortenable_ids() List[int][source]

Every template needs shortenable_ids, denoting which part of the template can be truncate to fit the language model’s max_seq_length. Default: the input text is shortenable, while the template text and other special tokens are not shortenable.

e.g. when self.text is '{"placeholder": "text_a"} {"placeholder": "text_b", "shortenable": False} {"meta": "word"} is {"mask"}.', output is [1, 0, 0, 0, 0, 0, 0].

Returns

A list of integers in the range [0, 1]:

  • 1 for the input tokens.

  • 0 for the template sequence tokens.

Return type

List[int]

get_default_soft_token_ids() List[int][source]

This function identifies which tokens are soft tokens.

Sometimes tokens in the template are not from the vocabulary, but a sequence of soft tokens. In this case, you need to implement this function

Raises

NotImplementedError – if needed, add soft_token_ids into registered_inputflag_names attribute of Template class and implement this method.

incorporate_text_example(example: openprompt.data_utils.utils.InputExample, text=None)[source]
parse_text(text: str) List[Dict][source]
wrap_one_example(example: openprompt.data_utils.utils.InputExample) List[Dict][source]

Given an input example which contains input text, which can be referenced by self.template.placeholder_mapping ‘s value. This function process the example into a list of dict, Each dict functions as a group, which has the sample properties, such as whether it’s shortenable, whether it’s the masked position, whether it’s soft token, etc. Since a text will be tokenized in the subsequent processing procedure, these attributes are broadcasted along the tokenized sentence.

Parameters

example (InputExample) – An InputExample object, which should have attributes that are able to be filled in the template.

Returns

A list of dict of the same length as self.text. e.g. [{"loss_ids": 0, "text": "It was"}, {"loss_ids": 1, "text": "<mask>"}, ]

Return type

List[Dict]

abstract process_batch(batch)[source]

Template should rewrite this method if you need to process the batch input such as substituting embeddings.

post_processing_outputs(outputs)[source]

Post processing the outputs of language models according to the need of template. Most templates don’t need post processing, The template like SoftTemplate, which appends soft template as a module (rather than a sequence of input tokens) to the input, should remove the outputs on these positions to keep the seq_len the same

save(path: str, **kwargs) None[source]

A save method API.

Parameters

path (str) – A path to save your template.

property text
safe_on_text_set() None[source]

With this wrapper function, setting text inside on_text_set() will not trigger on_text_set() again to prevent endless recursion.

abstract on_text_set()[source]

A hook to do something when template text was set. The designer of the template should explicitly know what should be down when the template text is set.

from_file(path: str, choice: int = 0)[source]

Read the template from a local file.

Parameters
  • path (str) – The path of the local template file.

  • choice (int) – The id-th line of the file.

classmethod from_config(config: yacs.config.CfgNode, **kwargs)[source]

load a template from template’s configuration node.

Parameters
  • config (CfgNode) – the sub-configuration of template, i.e. config[config.template] if config is a global config node.

  • kwargs – Other kwargs that might be used in initialize the verbalizer. The actual value should match the arguments of __init__ functions.

training: bool