I am wondering if there is a way that I can use different learning rate for different layers like what is in Caffe. I am trying to modify a pre-trained model and use it for other tasks. What I want is to speed up the training for new added layers and keep the trained layers at low learning rate in order to prevent them from being distorted. for example, I have a 5-conv-layer pre-trained model. Now I add a new conv layer and fine tune it. The first 5 layers would have learning rate of 0.00001 and the last one would have 0.001. Any idea how to achieve this?
It can be achieved quite easily with 2 optimizers:
var_list1 = [variables from first 5 layers] var_list2 = [the rest of variables] train_op1 = GradientDescentOptimizer(0.00001).minimize(loss, var_list=var_list1) train_op2 = GradientDescentOptimizer(0.0001).minimize(loss, var_list=var_list2) train_op = tf.group(train_op1, train_op2)
One disadvantage of this implementation is that it computes tf.gradients(.) twice inside the optimizers and thus it might not be optimal in terms of execution speed. This can be mitigated by explicitly calling tf.gradients(.), splitting the list into 2 and passing corresponding gradients to both optimizers.
Related question: Holding variables constant during optimizer
EDIT: Added more efficient but longer implementation:
var_list1 = [variables from first 5 layers] var_list2 = [the rest of variables] opt1 = tf.train.GradientDescentOptimizer(0.00001) opt2 = tf.train.GradientDescentOptimizer(0.0001) grads = tf.gradients(loss, var_list1 + var_list2) grads1 = grads[:len(var_list1)] grads2 = grads[len(var_list1):] tran_op1 = opt1.apply_gradients(zip(grads1, var_list1)) train_op2 = opt2.apply_gradients(zip(grads2, var_list2)) train_op = tf.group(train_op1, train_op2)
You can use
tf.trainable_variables() to get all training variables and decide to select from them.
The difference is that in the first implementation
tf.gradients(.) is called twice inside the optimizers. This may cause some redundant operations to be executed (e.g. gradients on the first layer can reuse some computations for the gradients of the following layers).
Update Jan 22: recipe below is only a good idea for
GradientDescentOptimizer , other optimizers that keep a running average will apply learning rate before the parameter update, so recipe below won’t affect that part of the equation
In addition to Rafal’s approach, you could use
apply_gradients interface of
Optimizer. For instance, here’s a toy network where I use 2x the learning rate for second parameter
x = tf.Variable(tf.ones()) y = tf.Variable(tf.zeros()) loss = tf.square(x-y) global_step = tf.Variable(0, name="global_step", trainable=False) opt = tf.GradientDescentOptimizer(learning_rate=0.1) grads_and_vars = opt.compute_gradients(loss, [x, y]) ygrad, _ = grads_and_vars train_op = opt.apply_gradients([grads_and_vars, (ygrad*2, y)], global_step=global_step) init_op = tf.initialize_all_variables() sess = tf.Session() sess.run(init_op) for i in range(5): sess.run([train_op, loss, global_step]) print sess.run([x, y])
You should see
[0.80000001, 0.40000001] [0.72000003, 0.56] [0.68800002, 0.62400001] [0.67520005, 0.64960003] [0.67008007, 0.65984005]
Collect learning rate multipliers for each variable like:
self.lr_multipliers[var.op.name] = lr_mult
and then apply them during before applying the gradients like:
def _train_op(self): tf.scalar_summary('learning_rate', self._lr_placeholder) opt = tf.train.GradientDescentOptimizer(self._lr_placeholder) grads_and_vars = opt.compute_gradients(self._loss) grads_and_vars_mult =  for grad, var in grads_and_vars: grad *= self._network.lr_multipliers[var.op.name] grads_and_vars_mult.append((grad, var)) tf.histogram_summary('variables/' + var.op.name, var) tf.histogram_summary('gradients/' + var.op.name, grad) return opt.apply_gradients(grads_and_vars_mult)
You can find the whole example here.
The first 5 layers would have learning rate of 0.00001 and the last one would have 0.001. Any idea how to achieve this?
There is an easy way to do that using tf.stop_gradient.
Here is an example with 3 layers:
x = layer1(input) x = layer2(x) output = layer3(x)
You can shrink your gradient in the first two layers by a ratio of 1/100:
x = layer1(input) x = layer2(x) x = 1/100*x + (1-1/100)*tf.stop_gradient(x) output = layer3(x)
On the layer2, the “flow” is split in two branches: one which has a contribution of 1/100 computes its gradient regularly but with a gradient magnitude shrinked by a proportion of 1/100, the other branch provides the remaining “flow” without contributing to the gradient because of the tf.stop_gradient operator. As a result, if you use a learning rate of 0.001 on your model optimizer, the first two layers will virtually have a learning rate of 0.00001.