$ fairseq-hydra-train -m --config-dir fairseq/examples/roberta/config/pretraining --config-name sinusoidal task.data=/home/jason-chou/Downloads/$DATA_DIR
Saturday, August 6, 2022
Train a sinusoidal one
Thursday, August 4, 2022
RoBERTa uses learned position embeddings!
args.encoder_learned_pos = safe_getattr(args, "encoder_learned_pos", True)
So...I need to find a variation that uses sinusoidal position embeddings to test long.
Transformer Language Models without Positional Encodings Still Learn Positional Information is the only preprint/paper mentioning MLM ALiBi so far?
Wednesday, August 3, 2022
PyTorch checkpoint state dict
Sadly RobertaModel doesn't give you the full picture:
>>> from fairseq.models.roberta import RobertaModel
>>> roberta = RobertaModel.from_pretrained('./roberta.base/')
One has to load from the checkpoint themself:
>>> path = './roberta.base/model.pt'
>>> with open(path, 'rb') as f:
>>> state = torch.load(f, map_location=torch.device("cpu"))
>>> state['args']
Namespace(no_progress_bar=False, log_interval=25, log_format='json', tbmf_wrapper=False, seed=1, cpu=False, fp16=True, memory_efficient_fp16=True, fp16_init_scale=4, fp16_scale_window=128, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=1.0, user_dir=None, criterion='masked_lm', tokenizer=None, bpe=None, optimizer='adam', lr_scheduler='polynomial_decay', task='masked_lm', num_workers=2, skip_invalid_size_inputs_valid_test=True, max_tokens=999999, max_sentences=16, required_batch_size_multiple=1, dataset_impl='mmap', train_subset='train', valid_subset='valid', validate_interval=1, disable_validation=False, only_validate=False, max_sentences_valid=16, curriculum=0, distributed_world_size=512, distributed_rank=0, distributed_backend='nccl', distributed_port=19812, device_id=0, distributed_no_spawn=False, ddp_backend='c10d', bucket_cap_mb=200, fix_batches_to_gpus=False, find_unused_parameters=True, arch='roberta_base', max_epoch=0, max_update=500000, clip_norm=0.0, sentence_avg=False, update_freq=[1], lr=[0.0006], min_lr=-1, use_bmuf=False, global_sync_iter=10, restore_file='checkpoint_last.pt', reset_dataloader=True, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, optimizer_overrides='{}', save_interval=1, save_interval_updates=2000, keep_interval_updates=-1, keep_last_epochs=-1, no_save=False, no_epoch_checkpoints=True, no_last_checkpoints=False, no_save_optimizer_state=False, best_checkpoint_metric='loss', maximize_best_checkpoint_metric=False, adam_betas='(0.9, 0.98)', adam_eps=1e-06, weight_decay=0.01, force_anneal=None, warmup_updates=24000, end_learning_rate=0.0, power=1.0, total_num_update=500000, sample_break_mode='complete', tokens_per_sample=512, mask_prob=0.15, leave_unmasked_prob=0.1, random_token_prob=0.1, activation_fn='gelu', dropout=0.1, attention_dropout=0.1, encoder_embed_dim=768, encoder_layers=12, encoder_attention_heads=12, encoder_ffn_embed_dim=3072, pooler_activation_fn='tanh', max_positions=512, activation_dropout=0.0)
Revisit roberta-large embeddings
def topk_similar_tokens(roberta, index, k, normalize=False, beta=100.):
embed_tokens = roberta.get_parameter('model.encoder.sentence_encoder.embed_tokens.weight')
if normalize:
embed_tokens = embed_tokens / embed_tokens.norm(dim=1, keepdim=True)
prob = (beta * embed_tokens[index] @ embed_tokens.T).softmax(dim=-1)
values, indices = prob.topk(k)
# Print the result
print("\nTop predictions:\n")
for value, index in zip(values, indices):
print(f"{roberta.decode(index.unsqueeze(0)) if index.item() != roberta.task.source_dictionary.pad() else '<pad>'}: {100 * value.item():.2f}%")
>>> topk_similar_tokens(roberta, roberta.task.mask_idx, 10, normalize=True, beta=10.)
Top predictions:
<mask>: 0.54%
: 0.02%
the: 0.01%
and: 0.01%
,: 0.01%
to: 0.01%
.: 0.01%
that: 0.01%
in: 0.01%
GG: 0.01%
>>> topk_similar_tokens(roberta, roberta.task.source_dictionary.bos(), 10, normalize=True, beta=10.)
Top predictions:
: 2.59%
<mask>: 0.02%
: 0.01%
.: 0.01%
the: 0.01%
,: 0.01%
a: 0.01%
!: 0.01%
。: 0.01%
?: 0.01%
>>> topk_similar_tokens(roberta, roberta.task.source_dictionary.pad(), 10, normalize=True, beta=10.)
Top predictions:
<pad>: 74.20%
: 0.02%
channelAvailability: 0.02%
PsyNetMessage: 0.01%
guiIcon: 0.01%
NetMessage: 0.01%
: 0.01%
?????-?????-: 0.01%
0.01%
EStreamFrame: 0.01%
>>> topk_similar_tokens(roberta, roberta.task.source_dictionary.eos(), 10, normalize=True, beta=10.)
Top predictions:
: 0.79%
.: 0.03%
<mask>: 0.03%
,: 0.02%
(: 0.02%
": 0.02%
and: 0.02%
the: 0.02%
The: 0.02%
-: 0.01%
Nothing shocking here. Other than <pad> it needs to get quite cold in terms of beta for the aligned embeddings to stand out though.
Tuesday, August 2, 2022
validate roberta.base with wikitext-103
$ python3 fairseq_cli/validate.py examples/language_model/data-bin/wikitext-103 --path ~/Downloads/roberta.base/model.pt --task masked_lm --max-tokens 512
2022-08-02 14:16:30 | INFO | fairseq.data.data_utils | loaded 3,760 examples from: examples/language_model/data-bin/wikitext-103/valid
2022-08-02 14:16:30 | INFO | fairseq.tasks.masked_lm | loaded 580 blocks from: examples/language_model/data-bin/wikitext-103/valid
2022-08-02 14:43:30 | INFO | valid | valid on 'valid' subset | loss 2.055 | ppl 4.16 | wps 0 | wpb 246983 | bsz 580
$ python3 fairseq_cli/validate.py examples/language_model/data-bin/wikitext-103 --path ~/Downloads/roberta.base/model.pt --task masked_lm --max-tokens 1024
2022-08-02 15:04:04 | INFO | fairseq.data.data_utils | loaded 3,760 examples from: examples/language_model/data-bin/wikitext-103/valid
2022-08-02 15:04:04 | INFO | fairseq.tasks.masked_lm | loaded 580 blocks from: examples/language_model/data-bin/wikitext-103/valid
2022-08-02 15:31:16 | INFO | valid | valid on 'valid' subset | loss 2.055 | ppl 4.16 | wps 0 | wpb 246983 | bsz 580
$ python3 fairseq_cli/validate.py examples/language_model/data-bin/wikitext-103 --path ~/Downloads/roberta.base/model.pt --task masked_lm --max-tokens 256
2022-08-02 15:33:29 | INFO | fairseq.data.data_utils | loaded 3,760 examples from: examples/language_model/data-bin/wikitext-103/valid
2022-08-02 15:33:29 | INFO | fairseq.tasks.masked_lm | loaded 580 blocks from: examples/language_model/data-bin/wikitext-103/valid
Traceback (most recent call last):
File "/Users/jason_chou/Documents/GitHub/fairseq/fairseq_cli/validate.py", line 153, in <module>
cli_main()
File "/Users/jason_chou/Documents/GitHub/fairseq/fairseq_cli/validate.py", line 147, in cli_main
distributed_utils.call_main(
File "/Users/jason_chou/Documents/GitHub/fairseq/fairseq/distributed/utils.py", line 369, in call_main
main(cfg, **kwargs)
File "/Users/jason_chou/Documents/GitHub/fairseq/fairseq_cli/validate.py", line 93, in main
itr = task.get_batch_iterator(
File "/Users/jason_chou/Documents/GitHub/fairseq/fairseq/tasks/fairseq_task.py", line 295, in get_batch_iterator
batch_sampler = dataset.batch_by_size(
File "/Users/jason_chou/Documents/GitHub/fairseq/fairseq/data/base_wrapper_dataset.py", line 61, in batch_by_size
return self.dataset.batch_by_size(
File "/Users/jason_chou/Documents/GitHub/fairseq/fairseq/data/fairseq_dataset.py", line 145, in batch_by_size
return data_utils.batch_by_size(
File "/Users/jason_chou/Documents/GitHub/fairseq/fairseq/data/data_utils.py", line 340, in batch_by_size
return batch_by_size_fn(
File "fairseq/data/data_utils_fast.pyx", line 108, in fairseq.data.data_utils_fast.batch_by_size_fn
cpdef list batch_by_size_fn(
File "fairseq/data/data_utils_fast.pyx", line 123, in fairseq.data.data_utils_fast.batch_by_size_fn
return batch_by_size_vec(indices, num_tokens_vec, max_tokens,
File "fairseq/data/data_utils_fast.pyx", line 30, in fairseq.data.data_utils_fast.batch_by_size_vec
assert max_tokens <= 0 or np.max(num_tokens_vec) <= max_tokens, (
AssertionError: Sentences lengths should not exceed max_tokens=256
The first two runs are probably equivalent. I need to figure out how to actually change the context length. Exact same with max_tokens=2048 and 4096 as well:
$ python3 fairseq_cli/validate.py examples/language_model/data-bin/wikitext-103 --path ~/Downloads/roberta.base/model.pt --task masked_lm --max-tokens 2048
2022-08-02 15:48:54 | INFO | fairseq.data.data_utils | loaded 3,760 examples from: examples/language_model/data-bin/wikitext-103/valid
2022-08-02 15:48:55 | INFO | fairseq.tasks.masked_lm | loaded 580 blocks from: examples/language_model/data-bin/wikitext-103/valid
2022-08-02 16:16:02 | INFO | valid | valid on 'valid' subset | loss 2.055 | ppl 4.16 | wps 0 | wpb 246983 | bsz 580
$ python3 fairseq_cli/validate.py examples/language_model/data-bin/wikitext-103 --path ~/Downloads/roberta.base/model.pt --task masked_lm --max-tokens 4096
2022-08-02 16:18:01 | INFO | fairseq.data.data_utils | loaded 3,760 examples from: examples/language_model/data-bin/wikitext-103/valid
2022-08-02 16:18:01 | INFO | fairseq.tasks.masked_lm | loaded 580 blocks from: examples/language_model/data-bin/wikitext-103/valid
2022-08-02 16:45:08 | INFO | valid | valid on 'valid' subset | loss 2.055 | ppl 4.16 | wps 0 | wpb 246983 | bsz 580
Monday, August 1, 2022
What tokens are most similar to , , , and in roberta-large?
>>> import torch
>>> roberta = torch.hub.load('pytorch/fairseq', 'roberta.large')
>>> def topk_similar_tokens(roberta, index, k, normalize=False):
... embed_tokens = roberta.get_parameter('model.encoder.sentence_encoder.embed_tokens.weight')
... if normalize:
... embed_tokens = embed_tokens / embed_tokens.norm(dim=1, keepdim=True)
... _, indices = (embed_tokens[index] @ embed_tokens.T).topk(k)
... return indices
...
>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.mask_idx, 10))
' TM JC CSI Zeus Karma CG BG GG MM Harmony'
>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.source_dictionary.bos(), 10))
'.?*.,. *.!.,,.*+.-.'
>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.source_dictionary.pad(), 10))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/jason_chou/.cache/torch/hub/pytorch_fairseq_main/fairseq/models/roberta/hub_interface.py", line 75, in decode
sentences = [
File "/Users/jason_chou/.cache/torch/hub/pytorch_fairseq_main/fairseq/models/roberta/hub_interface.py", line 76, in <listcomp>
self.bpe.decode(self.task.source_dictionary.string(s)) for s in sentences
File "/Users/jason_chou/.cache/torch/hub/pytorch_fairseq_main/fairseq/data/encoders/gpt2_bpe.py", line 41, in decode
[int(tok) if tok not in {"<unk>", "<mask>"} else tok for tok in x.split()]
File "/Users/jason_chou/.cache/torch/hub/pytorch_fairseq_main/fairseq/data/encoders/gpt2_bpe.py", line 41, in <listcomp>
[int(tok) if tok not in {"<unk>", "<mask>"} else tok for tok in x.split()]
ValueError: invalid literal for int() with base 10: '<pad>'
>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.source_dictionary.pad(), 10)[1:])
'channelAvailability\x05EngineDebug<|endoftext|>PsyNetMessage 裏覚醒'
>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.source_dictionary.eos(), 10))
' \u200b,, .... TM ..….. \u200e…… MM'
>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.mask_idx, 10, normalize=True))
'<mask> the and, to. that in GG'
>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.source_dictionary.bos(), 10, normalize=True))
'<mask>. the, a!。?'
>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.source_dictionary.pad(), 10, normalize=True))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/jason_chou/.cache/torch/hub/pytorch_fairseq_main/fairseq/models/roberta/hub_interface.py", line 75, in decode
sentences = [
File "/Users/jason_chou/.cache/torch/hub/pytorch_fairseq_main/fairseq/models/roberta/hub_interface.py", line 76, in <listcomp>
self.bpe.decode(self.task.source_dictionary.string(s)) for s in sentences
File "/Users/jason_chou/.cache/torch/hub/pytorch_fairseq_main/fairseq/data/encoders/gpt2_bpe.py", line 41, in decode
[int(tok) if tok not in {"<unk>", "<mask>"} else tok for tok in x.split()]
File "/Users/jason_chou/.cache/torch/hub/pytorch_fairseq_main/fairseq/data/encoders/gpt2_bpe.py", line 41, in <listcomp>
[int(tok) if tok not in {"<unk>", "<mask>"} else tok for tok in x.split()]
ValueError: invalid literal for int() with base 10: '<pad>'
>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.source_dictionary.pad(), 10, normalize=True)[1:])
'channelAvailabilityPsyNetMessage guiIconNetMessage\x05?????-?????-\x1bEStreamFrame'
>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.source_dictionary.eos(), 10, normalize=True))
'.<mask>, ( " and the The-'
Conclusion: in terms of their embeddings, they are most similar to meaningless tokens.
(Unnormalized inner product is informative here since untie_weights_roberta=False)