Random Things Across The Globe: August 2022

Saturday, August 6, 2022

Train a sinusoidal one

$ fairseq-hydra-train -m --config-dir fairseq/examples/roberta/config/pretraining --config-name sinusoidal task.data=/home/jason-chou/Downloads/$DATA_DIR

Thursday, August 4, 2022

RoBERTa uses learned position embeddings!

args.encoder_learned_pos = safe_getattr(args, "encoder_learned_pos", True)

So...I need to find a variation that uses sinusoidal position embeddings to test long.

Transformer Language Models without Positional Encodings Still Learn Positional Information is the only preprint/paper mentioning MLM ALiBi so far?

Wednesday, August 3, 2022

PyTorch checkpoint state dict

Sadly RobertaModel doesn't give you the full picture:

>>> from fairseq.models.roberta import RobertaModel

>>> roberta = RobertaModel.from_pretrained('./roberta.base/')

One has to load from the checkpoint themself:

>>> path = './roberta.base/model.pt'

>>> with open(path, 'rb') as f:

>>> state = torch.load(f, map_location=torch.device("cpu"))

>>> state['args']

Namespace(no_progress_bar=False, log_interval=25, log_format='json', tbmf_wrapper=False, seed=1, cpu=False, fp16=True, memory_efficient_fp16=True, fp16_init_scale=4, fp16_scale_window=128, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=1.0, user_dir=None, criterion='masked_lm', tokenizer=None, bpe=None, optimizer='adam', lr_scheduler='polynomial_decay', task='masked_lm', num_workers=2, skip_invalid_size_inputs_valid_test=True, max_tokens=999999, max_sentences=16, required_batch_size_multiple=1, dataset_impl='mmap', train_subset='train', valid_subset='valid', validate_interval=1, disable_validation=False, only_validate=False, max_sentences_valid=16, curriculum=0, distributed_world_size=512, distributed_rank=0, distributed_backend='nccl', distributed_port=19812, device_id=0, distributed_no_spawn=False, ddp_backend='c10d', bucket_cap_mb=200, fix_batches_to_gpus=False, find_unused_parameters=True, arch='roberta_base', max_epoch=0, max_update=500000, clip_norm=0.0, sentence_avg=False, update_freq=[1], lr=[0.0006], min_lr=-1, use_bmuf=False, global_sync_iter=10, restore_file='checkpoint_last.pt', reset_dataloader=True, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, optimizer_overrides='{}', save_interval=1, save_interval_updates=2000, keep_interval_updates=-1, keep_last_epochs=-1, no_save=False, no_epoch_checkpoints=True, no_last_checkpoints=False, no_save_optimizer_state=False, best_checkpoint_metric='loss', maximize_best_checkpoint_metric=False, adam_betas='(0.9, 0.98)', adam_eps=1e-06, weight_decay=0.01, force_anneal=None, warmup_updates=24000, end_learning_rate=0.0, power=1.0, total_num_update=500000, sample_break_mode='complete', tokens_per_sample=512, mask_prob=0.15, leave_unmasked_prob=0.1, random_token_prob=0.1, activation_fn='gelu', dropout=0.1, attention_dropout=0.1, encoder_embed_dim=768, encoder_layers=12, encoder_attention_heads=12, encoder_ffn_embed_dim=3072, pooler_activation_fn='tanh', max_positions=512, activation_dropout=0.0)

Revisit roberta-large embeddings

def topk_similar_tokens(roberta, index, k, normalize=False, beta=100.):

embed_tokens = roberta.get_parameter('model.encoder.sentence_encoder.embed_tokens.weight')

if normalize:

embed_tokens = embed_tokens / embed_tokens.norm(dim=1, keepdim=True)

prob = (beta * embed_tokens[index] @ embed_tokens.T).softmax(dim=-1)

values, indices = prob.topk(k)

# Print the result

print("\nTop predictions:\n")

for value, index in zip(values, indices):

print(f"{roberta.decode(index.unsqueeze(0)) if index.item() != roberta.task.source_dictionary.pad() else '<pad>'}: {100 * value.item():.2f}%")

>>> topk_similar_tokens(roberta, roberta.task.mask_idx, 10, normalize=True, beta=10.)

Top predictions:

<mask>: 0.54%

: 0.02%

the: 0.01%

and: 0.01%

,: 0.01%

to: 0.01%

.: 0.01%

that: 0.01%

in: 0.01%

GG: 0.01%

>>> topk_similar_tokens(roberta, roberta.task.source_dictionary.bos(), 10, normalize=True, beta=10.)

Top predictions:

: 2.59%

<mask>: 0.02%

: 0.01%

.: 0.01%

the: 0.01%

,: 0.01%

a: 0.01%

!: 0.01%

。: 0.01%

?: 0.01%

>>> topk_similar_tokens(roberta, roberta.task.source_dictionary.pad(), 10, normalize=True, beta=10.)

Top predictions:

<pad>: 74.20%

: 0.02%

channelAvailability: 0.02%

PsyNetMessage: 0.01%

guiIcon: 0.01%

NetMessage: 0.01%

: 0.01%

?????-?????-: 0.01%

0.01%

EStreamFrame: 0.01%

>>> topk_similar_tokens(roberta, roberta.task.source_dictionary.eos(), 10, normalize=True, beta=10.)

Top predictions:

: 0.79%

.: 0.03%

<mask>: 0.03%

,: 0.02%

(: 0.02%

": 0.02%

and: 0.02%

the: 0.02%

The: 0.02%

-: 0.01%

Nothing shocking here. Other than <pad> it needs to get quite cold in terms of beta for the aligned embeddings to stand out though.

Tuesday, August 2, 2022

validate roberta.base with wikitext-103

$ python3 fairseq_cli/validate.py examples/language_model/data-bin/wikitext-103 --path ~/Downloads/roberta.base/model.pt --task masked_lm --max-tokens 512

2022-08-02 14:16:30 | INFO | fairseq.data.data_utils | loaded 3,760 examples from: examples/language_model/data-bin/wikitext-103/valid

2022-08-02 14:16:30 | INFO | fairseq.tasks.masked_lm | loaded 580 blocks from: examples/language_model/data-bin/wikitext-103/valid

2022-08-02 14:43:30 | INFO | valid | valid on 'valid' subset | loss 2.055 | ppl 4.16 | wps 0 | wpb 246983 | bsz 580

$ python3 fairseq_cli/validate.py examples/language_model/data-bin/wikitext-103 --path ~/Downloads/roberta.base/model.pt --task masked_lm --max-tokens 1024

2022-08-02 15:04:04 | INFO | fairseq.data.data_utils | loaded 3,760 examples from: examples/language_model/data-bin/wikitext-103/valid

2022-08-02 15:04:04 | INFO | fairseq.tasks.masked_lm | loaded 580 blocks from: examples/language_model/data-bin/wikitext-103/valid

2022-08-02 15:31:16 | INFO | valid | valid on 'valid' subset | loss 2.055 | ppl 4.16 | wps 0 | wpb 246983 | bsz 580

$ python3 fairseq_cli/validate.py examples/language_model/data-bin/wikitext-103 --path ~/Downloads/roberta.base/model.pt --task masked_lm --max-tokens 256

2022-08-02 15:33:29 | INFO | fairseq.data.data_utils | loaded 3,760 examples from: examples/language_model/data-bin/wikitext-103/valid

2022-08-02 15:33:29 | INFO | fairseq.tasks.masked_lm | loaded 580 blocks from: examples/language_model/data-bin/wikitext-103/valid

Traceback (most recent call last):

File "/Users/jason_chou/Documents/GitHub/fairseq/fairseq_cli/validate.py", line 153, in <module>

cli_main()

File "/Users/jason_chou/Documents/GitHub/fairseq/fairseq_cli/validate.py", line 147, in cli_main

distributed_utils.call_main(

File "/Users/jason_chou/Documents/GitHub/fairseq/fairseq/distributed/utils.py", line 369, in call_main

main(cfg, **kwargs)

File "/Users/jason_chou/Documents/GitHub/fairseq/fairseq_cli/validate.py", line 93, in main

itr = task.get_batch_iterator(

File "/Users/jason_chou/Documents/GitHub/fairseq/fairseq/tasks/fairseq_task.py", line 295, in get_batch_iterator

batch_sampler = dataset.batch_by_size(

File "/Users/jason_chou/Documents/GitHub/fairseq/fairseq/data/base_wrapper_dataset.py", line 61, in batch_by_size

return self.dataset.batch_by_size(

File "/Users/jason_chou/Documents/GitHub/fairseq/fairseq/data/fairseq_dataset.py", line 145, in batch_by_size

return data_utils.batch_by_size(

File "/Users/jason_chou/Documents/GitHub/fairseq/fairseq/data/data_utils.py", line 340, in batch_by_size

return batch_by_size_fn(

File "fairseq/data/data_utils_fast.pyx", line 108, in fairseq.data.data_utils_fast.batch_by_size_fn

cpdef list batch_by_size_fn(

File "fairseq/data/data_utils_fast.pyx", line 123, in fairseq.data.data_utils_fast.batch_by_size_fn

return batch_by_size_vec(indices, num_tokens_vec, max_tokens,

File "fairseq/data/data_utils_fast.pyx", line 30, in fairseq.data.data_utils_fast.batch_by_size_vec

assert max_tokens <= 0 or np.max(num_tokens_vec) <= max_tokens, (

AssertionError: Sentences lengths should not exceed max_tokens=256

The first two runs are probably equivalent. I need to figure out how to actually change the context length. Exact same with max_tokens=2048 and 4096 as well:

$ python3 fairseq_cli/validate.py examples/language_model/data-bin/wikitext-103 --path ~/Downloads/roberta.base/model.pt --task masked_lm --max-tokens 2048

2022-08-02 15:48:54 | INFO | fairseq.data.data_utils | loaded 3,760 examples from: examples/language_model/data-bin/wikitext-103/valid

2022-08-02 15:48:55 | INFO | fairseq.tasks.masked_lm | loaded 580 blocks from: examples/language_model/data-bin/wikitext-103/valid

2022-08-02 16:16:02 | INFO | valid | valid on 'valid' subset | loss 2.055 | ppl 4.16 | wps 0 | wpb 246983 | bsz 580

$ python3 fairseq_cli/validate.py examples/language_model/data-bin/wikitext-103 --path ~/Downloads/roberta.base/model.pt --task masked_lm --max-tokens 4096

2022-08-02 16:18:01 | INFO | fairseq.data.data_utils | loaded 3,760 examples from: examples/language_model/data-bin/wikitext-103/valid

2022-08-02 16:18:01 | INFO | fairseq.tasks.masked_lm | loaded 580 blocks from: examples/language_model/data-bin/wikitext-103/valid

2022-08-02 16:45:08 | INFO | valid | valid on 'valid' subset | loss 2.055 | ppl 4.16 | wps 0 | wpb 246983 | bsz 580

Monday, August 1, 2022

What tokens are most similar to , , , and in roberta-large?

>>> import torch

>>> roberta = torch.hub.load('pytorch/fairseq', 'roberta.large')

>>> def topk_similar_tokens(roberta, index, k, normalize=False):

... embed_tokens = roberta.get_parameter('model.encoder.sentence_encoder.embed_tokens.weight')

... if normalize:

... embed_tokens = embed_tokens / embed_tokens.norm(dim=1, keepdim=True)

... _, indices = (embed_tokens[index] @ embed_tokens.T).topk(k)

... return indices

...

>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.mask_idx, 10))

' TM JC CSI Zeus Karma CG BG GG MM Harmony'

>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.source_dictionary.bos(), 10))

'.?*.,. *.!.,,.*+.-.'

>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.source_dictionary.pad(), 10))

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

File "/Users/jason_chou/.cache/torch/hub/pytorch_fairseq_main/fairseq/models/roberta/hub_interface.py", line 75, in decode

sentences = [

File "/Users/jason_chou/.cache/torch/hub/pytorch_fairseq_main/fairseq/models/roberta/hub_interface.py", line 76, in <listcomp>

self.bpe.decode(self.task.source_dictionary.string(s)) for s in sentences

File "/Users/jason_chou/.cache/torch/hub/pytorch_fairseq_main/fairseq/data/encoders/gpt2_bpe.py", line 41, in decode

[int(tok) if tok not in {"<unk>", "<mask>"} else tok for tok in x.split()]

File "/Users/jason_chou/.cache/torch/hub/pytorch_fairseq_main/fairseq/data/encoders/gpt2_bpe.py", line 41, in <listcomp>

[int(tok) if tok not in {"<unk>", "<mask>"} else tok for tok in x.split()]

ValueError: invalid literal for int() with base 10: '<pad>'

>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.source_dictionary.pad(), 10)[1:])

'channelAvailability\x05EngineDebug<|endoftext|>PsyNetMessage 裏覚醒'

>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.source_dictionary.eos(), 10))

' \u200b,, .... TM ..….. \u200e…… MM'

>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.mask_idx, 10, normalize=True))

'<mask> the and, to. that in GG'

>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.source_dictionary.bos(), 10, normalize=True))

'<mask>. the, a!。?'

>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.source_dictionary.pad(), 10, normalize=True))

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

File "/Users/jason_chou/.cache/torch/hub/pytorch_fairseq_main/fairseq/models/roberta/hub_interface.py", line 75, in decode

sentences = [

File "/Users/jason_chou/.cache/torch/hub/pytorch_fairseq_main/fairseq/models/roberta/hub_interface.py", line 76, in <listcomp>

self.bpe.decode(self.task.source_dictionary.string(s)) for s in sentences

File "/Users/jason_chou/.cache/torch/hub/pytorch_fairseq_main/fairseq/data/encoders/gpt2_bpe.py", line 41, in decode

[int(tok) if tok not in {"<unk>", "<mask>"} else tok for tok in x.split()]

File "/Users/jason_chou/.cache/torch/hub/pytorch_fairseq_main/fairseq/data/encoders/gpt2_bpe.py", line 41, in <listcomp>

[int(tok) if tok not in {"<unk>", "<mask>"} else tok for tok in x.split()]

ValueError: invalid literal for int() with base 10: '<pad>'

>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.source_dictionary.pad(), 10, normalize=True)[1:])

'channelAvailabilityPsyNetMessage guiIconNetMessage\x05?????-?????-\x1bEStreamFrame'

>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.source_dictionary.eos(), 10, normalize=True))

'.<mask>, ( " and the The-'

Conclusion: in terms of their embeddings, they are most similar to meaningless tokens.
(Unnormalized inner product is informative here since untie_weights_roberta=False)