ELMoForManyLangs
ELMoForManyLangs copied to clipboard
关于padding
作者您好,就是我想在利用elmo embedding的时候自己先加padding,类似于如下:
e = Embedder('/path/to/your/model/')
sents = [['今', '天', '天气', '真', '好', '<pad>', '<pad>', '<pad>'],
['潮水', '退', '了', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']]
e.sents2elmo(sents)
我想问问加padding的关键字是用'
不需要手动padding,代码会自动做padding。
因为我需要利用elmo得出的词向量拼接到我的LSTM得出的隐状态中,因此想要e.sents2elmo(sents)得出的向量长度统一,因此想要人为的padding,如果不这样做的话怎么才能使得出的向量长度统一呢?
得到elmo输出后再做padding?
因为我需要利用elmo得出的词向量拼接到我的LSTM得出的隐状态中,因此想要
e.sents2elmo(sents)得出的向量长度统一,因此想要人为的padding,如果不这样做的话怎么才能使得出的向量长度统一呢?
您好,请问您最后找到了好的解决方案了吗?
我最后去源代码里看了看,发现他最后返回numpy的时候是按照实际长度剪切过的:
for w, c, lens, masks, texts in zip(test_w, test_c, test_lens, test_masks, test_text):
output = self.model.forward(w, c, masks)
for i, text in enumerate(texts):
if self.config['encoder']['name'].lower() == 'lstm':
data = output[i, 1:lens[i]-1, :].data
if self.use_cuda:
data = data.cpu()
data = data.numpy()
elif self.config['encoder']['name'].lower() == 'elmo':
data = output[:, i, 1:lens[i]-1, :].data
if self.use_cuda:
data = data.cpu()
data = data.numpy()
if output_layer == -1:
payload = np.average(data, axis=0)
elif output_layer == -2:
payload = data
else:
payload = data[output_layer]
after_elmo.append(payload)
cnt += 1
if cnt % 1000 == 0:
logging.info('Finished {0} sentences.'.format(cnt))
就代码里面的这句话:data = output[i, 1:lens[i]-1, :].data,并且我最终想要的是tensor形式,于是我自己稍微改了下源代码就好了(其实只改动了一点点):
def sents2elmo(self, sents, output_layer=-1):
read_function = read_list
if self.config['token_embedder']['name'].lower() == 'cnn':
test, text = read_function(sents, self.config['token_embedder']['max_characters_per_token'])
else:
test, text = read_function(sents)
# create test batches from the input data.
test_w, test_c, test_lens, test_masks, test_text, recover_ind = create_batches(
test, self.batch_size, self.word_lexicon, self.char_lexicon, self.config, text=text)
cnt = 0
with torch.no_grad():
after_elmo = []
for w, c, lens, masks, texts in zip(test_w, test_c, test_lens, test_masks, test_text):
output = self.model.forward(w, c, masks)
length = output.shape[2]
for i, text in enumerate(texts):
if self.config['encoder']['name'].lower() == 'lstm':
data = output[i, 1:length - 1, :].data
elif self.config['encoder']['name'].lower() == 'elmo':
data = output[:, i, 1:length - 1, :].data
if output_layer == -1:
payload = torch.mean(data, dim=0)
elif output_layer == -2:
payload = data
else:
payload = data[output_layer]
after_elmo.append(payload)
cnt += 1
if cnt % 1000 == 0:
logging.info('Finished {0} sentences.'.format(cnt))
after_elmo = recover(after_elmo, recover_ind)
return torch.stack(after_elmo, dim=0)