ELMoForManyLangs 关于padding

作者您好，就是我想在利用elmo embedding的时候自己先加padding，类似于如下：

e = Embedder('/path/to/your/model/')

sents = [['今', '天', '天气', '真', '好', '<pad>', '<pad>', '<pad>'],
['潮水', '退', '了',  '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']]

e.sents2elmo(sents)

我想问问加padding的关键字是用''吗？以及这样加padding正确吗？

Mar 13 '19 11:03 A-Rain

不需要手动padding，代码会自动做padding。

Mar 14 '19 01:03 Oneplus

因为我需要利用elmo得出的词向量拼接到我的LSTM得出的隐状态中，因此想要e.sents2elmo(sents)得出的向量长度统一，因此想要人为的padding，如果不这样做的话怎么才能使得出的向量长度统一呢？

Mar 14 '19 08:03 A-Rain

得到elmo输出后再做padding？

Mar 19 '19 16:03 Oneplus

因为我需要利用elmo得出的词向量拼接到我的LSTM得出的隐状态中，因此想要e.sents2elmo(sents)得出的向量长度统一，因此想要人为的padding，如果不这样做的话怎么才能使得出的向量长度统一呢？

您好，请问您最后找到了好的解决方案了吗？

Mar 21 '19 14:03 HqWu-HITCS

我最后去源代码里看了看，发现他最后返回numpy的时候是按照实际长度剪切过的：

for w, c, lens, masks, texts in zip(test_w, test_c, test_lens, test_masks, test_text):
            output = self.model.forward(w, c, masks)
            for i, text in enumerate(texts):

                if self.config['encoder']['name'].lower() == 'lstm':
                    data = output[i, 1:lens[i]-1, :].data
                    if self.use_cuda:
                        data = data.cpu()
                    data = data.numpy()
                elif self.config['encoder']['name'].lower() == 'elmo':
                    data = output[:, i, 1:lens[i]-1, :].data
                    if self.use_cuda:
                        data = data.cpu()
                    data = data.numpy()

                if output_layer == -1:
                    payload = np.average(data, axis=0)
                elif output_layer == -2:
                    payload = data
                else:
                    payload = data[output_layer]
                after_elmo.append(payload)

                cnt += 1
                if cnt % 1000 == 0:
                    logging.info('Finished {0} sentences.'.format(cnt))

就代码里面的这句话：data = output[i, 1:lens[i]-1, :].data，并且我最终想要的是tensor形式，于是我自己稍微改了下源代码就好了(其实只改动了一点点)：

    def sents2elmo(self, sents, output_layer=-1):
        read_function = read_list

        if self.config['token_embedder']['name'].lower() == 'cnn':
            test, text = read_function(sents, self.config['token_embedder']['max_characters_per_token'])
        else:
            test, text = read_function(sents)

        # create test batches from the input data.
        test_w, test_c, test_lens, test_masks, test_text, recover_ind = create_batches(
            test, self.batch_size, self.word_lexicon, self.char_lexicon, self.config, text=text)

        cnt = 0

        with torch.no_grad():
            after_elmo = []
            for w, c, lens, masks, texts in zip(test_w, test_c, test_lens, test_masks, test_text):
                output = self.model.forward(w, c, masks)
                length = output.shape[2]
                for i, text in enumerate(texts):
                    if self.config['encoder']['name'].lower() == 'lstm':
                        data = output[i, 1:length - 1, :].data
                    elif self.config['encoder']['name'].lower() == 'elmo':
                        data = output[:, i, 1:length - 1, :].data

                    if output_layer == -1:
                        payload = torch.mean(data, dim=0)
                    elif output_layer == -2:
                        payload = data
                    else:
                        payload = data[output_layer]
                    after_elmo.append(payload)

                    cnt += 1
                    if cnt % 1000 == 0:
                        logging.info('Finished {0} sentences.'.format(cnt))

            after_elmo = recover(after_elmo, recover_ind)
            return torch.stack(after_elmo, dim=0)

Mar 22 '19 01:03 A-Rain