Error while training LSTM
I've been frequently getting this error while training the net. Can someone tell me what the problem might be?
Index exceeds matrix dimensions.
Error in varObj>@(C)full(C(:,r)) (line 45) v = cellfun(@(C) full(C(:,r)), obj.v, 'UniformOutput', false);
Error in varObj/getmb (line 45) v = cellfun(@(C) full(C(:,r)), obj.v, 'UniformOutput', false);
Error in nnCostFunctionLSTM (line 13) Y = varObj(nn.Y.getmb(r), nn.defs, nn.defs.TYPES.OUTPUT);
Error in testSumNumbersGenerator>@(nn,r,newRandGen)nnCostFunctionLSTM(nn,r,newRandGen)
Error in gradientDescentAdaDelta (line 69) [J, dJdW, dJdB] = feval(f, nn, r, true);
Error in testSumNumbersGenerator (line 137) nn = gradientDescentAdaDelta(costFunc, nn, defs, [], [], [], [], 'Training Entire Network');
I've been trying to train a network that predicts the sum of the past two numbers. However i keep running into this kind of problem. If i increase the length of the sequence the training completes, however i don't get the expected output when i try to rebuild the sequence... Am I doing something wrong? This should be a very simple sequence to train. Here is my code
`clear; close all force; addpath('../../nn_gui'); addpath('../../nn_core'); addpath('../../nn_core/cuda'); addpath('../../nn_core/mmx'); addpath('../../nn_core/Optimizers'); addpath('../../nn_core/Activations'); addpath('../../nn_core/Activations'); addpath('../../nn_core/Wrappers'); addpath('../../nn_core/ConvNet'); addpath('Text');
PRECISION = 'double';
% definitions(PRECISION, useGPU, whichThreads, plotOn)
defs = definitions(PRECISION, true, [1], true);
% % Load the Shakespeares training set % Nchars = 50; % How many characters long for each sequence % offset = 0; % How much to shift the labels from the input data % txtpath = 'shakespeare_subset.txt'; % [X, vmap] = streamText2mat(txtpath, Nchars, offset); % offset = 1; % [Y, ~] = streamText2mat(txtpath, Nchars, offset);
seqLen = 2; N = 100000; offset = 0;
Xlong = 1:2;
for i=1:N
Xlong(end+1) = Xlong(end) + Xlong(end-1); if abs(Xlong(end)) > 2 Xlong(end) = -Xlong(end); end
% Xlong(end+1) = Xlong(end) +1; % if (Xlong(end) > 10) % Xlong(end) =1; % end end
% x123 = findstr(Xlong, [0 -1]); % Xlong(x123(1):x123(1)+5) % length(x123)
[X, vmap] = getSparseMatrix(Xlong, seqLen, offset);
offset = 1;
[Y, ~] = getSparseMatrix(Xlong, seqLen, offset);
% end
[BinCount] = hist(Xlong,vmap);
input_size = size(X{1},1); output_size = input_size; T = numel(X); % Length of all time sequences
% Both X and Y must include T=0 and T=Tf+1 'boundary conditions' filled % with zeros for convenience X(2:end+1) = X(:); X{1} = 0*X{1}; X(end+1) = X(1);
Y(2:end+1) = Y(:); Y{1} = 0*Y{1}; Y(end+1) = Y(1);
%%%%%%%%%%%%%%%%%%%%% Fine tuning Parameters %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
params = struct();
params.maxIter = precision(3000,defs);
params.momentum = precision(0.9,defs);
params.maxnorm = precision(0,defs);
params.lambda = precision(0,defs);
params.alphaTau = precision(0.25_params.maxIter,defs); % alpha_i = alpha_tau/(tau+i) (see "A Stochastic Quasi-Newton Method for Online Convex Optimization", Eqn. 7)
params.denoise = precision(0,defs); % set to 0 to disable
params.dropout = precision(0.6,defs); % set to 1 to disable
params.miniBatchSize = precision(50,defs); % set to zero to disable mini-batches
params.tieWeights = false;
params.T = T;
params.Tos = 0; % This is the "offset" time before the cost starts accumulating for the LSTM output
% The idea here is that the LSTM can be fed with inputs
% for n time steps, and won't be penalized for predictions
% until t>=Tos. This helps by giving the LSTM context.
% Optimization routine parameters:
params.alpha = precision(.001,defs); % If this is non-zero, use this learning rate for the entire network
params.rho = precision(0.95, defs); % AdaDelta hyperparameter (don't generally need to modify)
params.eps = precision(1e-6, defs); % AdaDelta hyperparameter (don't generally need to modify)
params.cg.N = 10; % Max CG iterations before reset
params.cg.sigma0 = 0.01; % CG Secant step-method parameter
params.cg.jmax = 10; % Maximum CG Secant iterations
params.cg.eps = 1e-4; % Update threshold for CG
params.cg.mbIters = 10; % How many CG iterations per minibatch?
%%%%%%%%%%%%%%%%%%%%%%%%% Layer Setup %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% layers.af{1} = []; layers.sz{1} = [input_size 1 1]; layers.typ{1} = defs.TYPES.INPUT;
layers.af{end+1} = tanh_af(defs, []); layers.sz{end+1} = [128 1 1]; layers.typ{end+1} = defs.TYPES.LSTM;
layers.af{end+1} = softmax(defs, defs.COSTS.CROSS_ENTROPY); layers.sz{end+1} = [output_size 1 1]; layers.typ{end+1} = defs.TYPES.FULLY_CONNECTED; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
if defs.plotOn nnShow(423, layers, defs); end
% Process Y such that first time sequence is stripped off and replaced by % a null at the end. This will cause LSTM to predict next character in the % sequence. The final prediction for a sequence should be null (zero).
X = varObj(X,defs,defs.TYPES.INPUT); Y = varObj(Y,defs,defs.TYPES.OUTPUT);
modelName = 'modelSumNumbers13.mat'; if ~exist(modelName)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% TRAINING %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
nn = nnLayers(params, layers, X, Y, {}, {}, defs); nn.initWeightsBiases();
costFunc = @(nn,r,newRandGen) nnCostFunctionLSTM(nn,r,newRandGen); nn = gradientDescentAdaDelta(costFunc, nn, defs, [], [], [], [], 'Training Entire Network'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
save(modelName, 'nn','vmap');
else load(modelName);
end
% outputText = []; outputSeq = [];
%% Generate some text by 'sampling' from LSTM % T_samp_temp = 1/35; % "Temperature" of random sampling process (higher temperatures lead to more randomness) T_samp_temp = 1; T_samp = 5000; % Length of sequence to generate % seedText = 'ROMEO:'; % seedText = 'Rafael Iriya';
% seedMatrix = unidrnd(3,3,6); seqLen = 2; seedMatrix = Xlong(1:seqLen); outputSeq = seedMatrix;
% Initial text (to provide some context) % vsize = numel(vmap); vsize = size(vmap,2); % Xs = full(ascii2onehot(seedText, vmap)); Xs = full(vec2map(seedMatrix, vmap)); % Xs = [zeros(vsize,1) Xs]; % Xtmp = zeros(vsize,1,size(seedMatrix,2)+1); Xtmp = zeros(vsize,1,size(seedMatrix,2)); Xtmp(:,1,:) = Xs; Xs = Xtmp;
nn.disableCuda(); nn.A{1} = varObj(Xs, nn.defs); preallocateMemory(nn, 1, size(seedMatrix,2)+1); % % Load up the LSTM with the context % for t=2:size(seedMatrix,2)+1 % feedforwardLSTM(nn, 1, t, false, true); % [~,cout] = max(nn.A{end}.v(:,1,t)); % % outputText = [outputText vmap(cout)] % outputSeq = [outputSeq vmap(:,cout)]; % %fprintf('%s', vmap(cout)); % end
for t=1:size(seedMatrix,2)+100 feedforwardLSTM(nn, 1, seqLen, false, true); [value,cout] = max(nn.A{end}.v(:,1,seqLen));
outputSeq = [outputSeq vmap(:,cout)];
tmp = zeros(vsize,1);
tmp(cout) = 1;
for tt = 1:seqLen-1
nn.A{1}.v(:,1,tt) = nn.A{1}.v(:,1,tt+1);
end
nn.A{1}.v(:,1,end) = tmp;
end
% Start sampling characters by feeding output back into the input for the
% next time step
% for t=size(seedMatrix,2)+2:size(seedMatrix,2)+T_samp
% % Generate a random sample from the softmax probability distribution
% % First, adjust/scale the distribution by a "temperature" that controls
% % how likely we are to pick the maximum likelihood prediction
% % P_next_char = exp(1/T_samp_temp*nn.A{end}.v(:,1,t-1));
% % P_next_char = P_next_char./sum(P_next_char); % normalize distribution
% % cin = randsample(vsize,1,true,P_next_char);
%
% [value,cin] = max(nn.A{end}.v(:,1,t-1));
%
% % fprintf('%s', vmap(cin));
% % outputText = [outputText vmap(cin)]
% outputSeq = [outputSeq vmap(:,cin)];
%
% % Plot the distribution over characters
% %{
% figure(777);
% plot(P_next_char);
% set(gca, 'XTick',1:numel(P_next_char), 'XTickLabel',vmap)
% waitforbuttonpress;
% %}
%
% % Feed back the output to the input
% % Generate the input for the next time step
% tmp = zeros(vsize,1);
% tmp(cin) = 1;
% nn.A{1}.v(:,1,t) = tmp;
%
% % Step the RNN forward
% feedforwardLSTM(nn, 1, t, false, true);
%
% end
% disp('');
`
`function [X, vmap] = getSparseMatrix(Xlong, seqLen, offset)
Xlong = Xlong(:,1+offset:end); N = size(Xlong,2);
m = floor(N/seqLen);
Xlong = Xlong(1:m*seqLen);
dimX = size(Xlong,1);
Npad = size(Xlong,2);
vmap = unique(Xlong', 'rows')'; vocabSize = size(vmap,2);
Xmap = zeros(1,Npad);
% Map all ASCII values to the reduced set for i=1:vocabSize indx = find(all(bsxfun(@eq, Xlong', vmap(:,i)'), 2)); % [~,indx]=ismember(vmap(:,i)',Xlong','rows') Xmap(indx) = i; end
Xmap = reshape(Xmap,seqLen,[]);
%
% Xmap= hankel(Xmap, 1:seqLen);
% Xmap = Xmap';%
% xreal = vmap(Xmap);
Xca = cell(seqLen,1); I = speye(vocabSize);
I = speye(vocabSize);
for t=1:seqLen
Xt = Xmap(t,:);
Xca{t} = I(Xt(:),:)';
end
Xfull1 = full(Xca{1});
Xfull2 = full(Xca{2});
X = Xca;
`
I realized the problem is the trained network is not taking the memory into account, if I enter [1 2] the output should be -3, but its giving me something else, and if I change it to [3 2], [-1,2] or anything else it gives me the same result, which means its only taking 2 into account and not what comes before it. How do i change the network for it to take the memory into account?