openprompt¶
Contents
prompt_base¶
- class Template(tokenizer: transformers.tokenization_utils.PreTrainedTokenizer, placeholder_mapping: dict = {'<text_a>': 'text_a', '<text_b>': 'text_b'})[source]¶
Base class for all the templates. Most of methods are abstract, with some exceptions to hold the common methods for all template, such as
loss_ids
,save
,load
.- Parameters
tokenizer (
PreTrainedTokenizer
) – A tokenizer to appoint the vocabulary and the tokenization strategy.placeholder_mapping (
dict
) – A place holder to represent the original input text.
- registered_inputflag_names = ['loss_ids', 'shortenable_ids']¶
- get_default_loss_ids() List[int] [source]¶
Get the loss indices for the template using mask. e.g. when self.text is
'{"placeholder": "text_a"}. {"meta": "word"} is {"mask"}.'
, output is[0, 0, 0, 0, 1, 0]
.- Returns
A list of integers in the range [0, 1]:
1 for a masked tokens.
0 for a sequence tokens.
- Return type
List[int]
- get_default_shortenable_ids() List[int] [source]¶
Every template needs shortenable_ids, denoting which part of the template can be truncate to fit the language model’s
max_seq_length
. Default: the input text is shortenable, while the template text and other special tokens are not shortenable.e.g. when self.text is
'{"placeholder": "text_a"} {"placeholder": "text_b", "shortenable": False} {"meta": "word"} is {"mask"}.'
, output is[1, 0, 0, 0, 0, 0, 0]
.- Returns
A list of integers in the range
[0, 1]
:1 for the input tokens.
0 for the template sequence tokens.
- Return type
List[int]
- get_default_soft_token_ids() List[int] [source]¶
This function identifies which tokens are soft tokens.
Sometimes tokens in the template are not from the vocabulary, but a sequence of soft tokens. In this case, you need to implement this function
- Raises
NotImplementedError – if needed, add
soft_token_ids
intoregistered_inputflag_names
attribute of Template class and implement this method.
- wrap_one_example(example: openprompt.data_utils.utils.InputExample) List[Dict] [source]¶
Given an input example which contains input text, which can be referenced by self.template.placeholder_mapping ‘s value. This function process the example into a list of dict, Each dict functions as a group, which has the sample properties, such as whether it’s shortenable, whether it’s the masked position, whether it’s soft token, etc. Since a text will be tokenized in the subsequent processing procedure, these attributes are broadcasted along the tokenized sentence.
- Parameters
example (
InputExample
) – AnInputExample
object, which should have attributes that are able to be filled in the template.- Returns
A list of dict of the same length as self.text. e.g.
[{"loss_ids": 0, "text": "It was"}, {"loss_ids": 1, "text": "<mask>"}, ]
- Return type
List[Dict]
- abstract process_batch(batch)[source]¶
Template should rewrite this method if you need to process the batch input such as substituting embeddings.
- post_processing_outputs(outputs)[source]¶
Post processing the outputs of language models according to the need of template. Most templates don’t need post processing, The template like SoftTemplate, which appends soft template as a module (rather than a sequence of input tokens) to the input, should remove the outputs on these positions to keep the seq_len the same
- save(path: str, **kwargs) None [source]¶
A save method API.
- Parameters
path (str) – A path to save your template.
- property text¶
- safe_on_text_set() None [source]¶
With this wrapper function, setting text inside
on_text_set()
will not triggeron_text_set()
again to prevent endless recursion.
- abstract on_text_set()[source]¶
A hook to do something when template text was set. The designer of the template should explicitly know what should be down when the template text is set.
- classmethod from_config(config: yacs.config.CfgNode, **kwargs)[source]¶
load a template from template’s configuration node.
- Parameters
config (
CfgNode
) – the sub-configuration of template, i.e. config[config.template] if config is a global config node.kwargs – Other kwargs that might be used in initialize the verbalizer. The actual value should match the arguments of __init__ functions.
- class Verbalizer(tokenizer: Optional[transformers.tokenization_utils.PreTrainedTokenizer] = None, classes: Optional[Sequence[str]] = None, num_classes: Optional[int] = None)[source]¶
Base class for all the verbalizers.
- Parameters
tokenizer (
PreTrainedTokenizer
) – A tokenizer to appoint the vocabulary and the tokenization strategy.classes (
Sequence[str]
) – A sequence of classes that need to be projected.
- property label_words¶
Label words means the words in the vocabulary projected by the labels. E.g. if we want to establish a projection in sentiment classification: positive \(\rightarrow\) {wonderful, good}, in this case, wonderful and good are label words.
- property vocab: Dict¶
- abstract generate_parameters(**kwargs) List [source]¶
The verbalizer can be seen as an extra layer on top of the original pre-trained models. In manual verbalizer, it is a fixed one-hot vector of dimension
vocab_size
, with the position of the label word being 1 and 0 everywhere else. In other situation, the parameters may be a continuous vector over the vocab, with each dimension representing a weight of that token. Moreover, the parameters may be set to trainable to allow label words selection.Therefore, this function serves as an abstract methods for generating the parameters of the verbalizer, and must be instantiated in any derived class.
Note that the parameters need to be registered as a part of pytorch’s module to It can be achieved by wrapping a tensor using
nn.Parameter()
.
- register_calibrate_logits(logits: torch.Tensor)[source]¶
This function aims to register logits that need to be calibrated, and detach the original logits from the current graph.
- process_outputs(outputs: torch.Tensor, batch: Union[Dict, openprompt.data_utils.utils.InputFeatures], **kwargs)[source]¶
By default, the verbalizer will process the logits of the PLM’s output.
- Parameters
logits (
torch.Tensor
) – The current logits generated by pre-trained language models.batch (
Union[Dict, InputFeatures]
) – The input features of the data.
- gather_outputs(outputs: transformers.file_utils.ModelOutput)[source]¶
retrieve useful output for the verbalizer from the whole model output By default, it will only retrieve the logits
- Parameters
outputs (
ModelOutput
) –- Returns
torch.Tensor
The gathered output, should be of shape (batch_size
,seq_len
,any
)
- static aggregate(label_words_logits: torch.Tensor) torch.Tensor [source]¶
To aggregate logits on multiple label words into the label’s logits Basic aggregator: mean of each label words’ logits to a label’s logits Can be re-implemented in advanced verbaliezer.
- Parameters
label_words_logits (
torch.Tensor
) – The logits of the label words only.- Returns
The final logits calculated by the label words.
- Return type
torch.Tensor
- normalize(logits: torch.Tensor) torch.Tensor [source]¶
Given logits regarding the entire vocab, calculate the probs over the label words set by softmax.
- Parameters
logits (
Tensor
) – The logits of the entire vocab.- Returns
The probability distribution over the label words set.
- Return type
Tensor
- abstract project(logits: torch.Tensor, **kwargs) torch.Tensor [source]¶
This method receives input logits of shape
[batch_size, vocab_size]
, and use the parameters of this verbalizer to project the logits over entire vocab into the logits of labels words.- Parameters
logits (
Tensor
) – The logits over entire vocab generated by the pre-trained language model with shape [batch_size
,max_seq_length
,vocab_size
]- Returns
The normalized probs (sum to 1) of each label .
- Return type
Tensor
- handle_multi_token(label_words_logits, mask)[source]¶
Support multiple methods to handle the multi tokens produced by the tokenizer. We suggest using ‘first’ or ‘max’ if the some parts of the tokenization is not meaningful. Can broadcast to 3-d tensor.
- Parameters
label_words_logits (
torch.Tensor
) –- Returns
torch.Tensor
- classmethod from_config(config: yacs.config.CfgNode, **kwargs)[source]¶
load a verbalizer from verbalizer’s configuration node.
- Parameters
config (
CfgNode
) – the sub-configuration of verbalizer, i.e.config[config.verbalizer]
if config is a global config node.kwargs – Other kwargs that might be used in initialize the verbalizer. The actual value should match the arguments of
__init__
functions.
- from_file(path: str, choice: Optional[int] = 0)[source]¶
Load the predefined label words from verbalizer file. Currently support three types of file format: 1. a .jsonl or .json file, in which is a single verbalizer in dict format. 2. a .jsonal or .json file, in which is a list of verbalizers in dict format 3. a .txt or a .csv file, in which is the label words of a class are listed in line, separated by commas. Begin a new verbalizer by an empty line. This format is recommended when you don’t know the name of each class.
The details of verbalizer format can be seen in How to Write a Verbalizer?.
- class Template(tokenizer: transformers.tokenization_utils.PreTrainedTokenizer, placeholder_mapping: dict = {'<text_a>': 'text_a', '<text_b>': 'text_b'})[source]¶
Base class for all the templates. Most of methods are abstract, with some exceptions to hold the common methods for all template, such as
loss_ids
,save
,load
.- Parameters
tokenizer (
PreTrainedTokenizer
) – A tokenizer to appoint the vocabulary and the tokenization strategy.placeholder_mapping (
dict
) – A place holder to represent the original input text.
- registered_inputflag_names = ['loss_ids', 'shortenable_ids']¶
- get_default_loss_ids() List[int] [source]¶
Get the loss indices for the template using mask. e.g. when self.text is
'{"placeholder": "text_a"}. {"meta": "word"} is {"mask"}.'
, output is[0, 0, 0, 0, 1, 0]
.- Returns
A list of integers in the range [0, 1]:
1 for a masked tokens.
0 for a sequence tokens.
- Return type
List[int]
- get_default_shortenable_ids() List[int] [source]¶
Every template needs shortenable_ids, denoting which part of the template can be truncate to fit the language model’s
max_seq_length
. Default: the input text is shortenable, while the template text and other special tokens are not shortenable.e.g. when self.text is
'{"placeholder": "text_a"} {"placeholder": "text_b", "shortenable": False} {"meta": "word"} is {"mask"}.'
, output is[1, 0, 0, 0, 0, 0, 0]
.- Returns
A list of integers in the range
[0, 1]
:1 for the input tokens.
0 for the template sequence tokens.
- Return type
List[int]
- get_default_soft_token_ids() List[int] [source]¶
This function identifies which tokens are soft tokens.
Sometimes tokens in the template are not from the vocabulary, but a sequence of soft tokens. In this case, you need to implement this function
- Raises
NotImplementedError – if needed, add
soft_token_ids
intoregistered_inputflag_names
attribute of Template class and implement this method.
- wrap_one_example(example: openprompt.data_utils.utils.InputExample) List[Dict] [source]¶
Given an input example which contains input text, which can be referenced by self.template.placeholder_mapping ‘s value. This function process the example into a list of dict, Each dict functions as a group, which has the sample properties, such as whether it’s shortenable, whether it’s the masked position, whether it’s soft token, etc. Since a text will be tokenized in the subsequent processing procedure, these attributes are broadcasted along the tokenized sentence.
- Parameters
example (
InputExample
) – AnInputExample
object, which should have attributes that are able to be filled in the template.- Returns
A list of dict of the same length as self.text. e.g.
[{"loss_ids": 0, "text": "It was"}, {"loss_ids": 1, "text": "<mask>"}, ]
- Return type
List[Dict]
- abstract process_batch(batch)[source]¶
Template should rewrite this method if you need to process the batch input such as substituting embeddings.
- post_processing_outputs(outputs)[source]¶
Post processing the outputs of language models according to the need of template. Most templates don’t need post processing, The template like SoftTemplate, which appends soft template as a module (rather than a sequence of input tokens) to the input, should remove the outputs on these positions to keep the seq_len the same
- save(path: str, **kwargs) None [source]¶
A save method API.
- Parameters
path (str) – A path to save your template.
- property text¶
- safe_on_text_set() None [source]¶
With this wrapper function, setting text inside
on_text_set()
will not triggeron_text_set()
again to prevent endless recursion.
- abstract on_text_set()[source]¶
A hook to do something when template text was set. The designer of the template should explicitly know what should be down when the template text is set.
- classmethod from_config(config: yacs.config.CfgNode, **kwargs)[source]¶
load a template from template’s configuration node.
- Parameters
config (
CfgNode
) – the sub-configuration of template, i.e. config[config.template] if config is a global config node.kwargs – Other kwargs that might be used in initialize the verbalizer. The actual value should match the arguments of __init__ functions.