TensorFlow is a powerful open source software library for numerical computation, particularly well suited and fine-tuned for large-scale Machine Learning. Its basic principle is simple: you first define in Python a graph of computations to perform (for example, the one in Figure 9-1), and then TensorFlow takes that graph and runs it efficiently using optimized C++ code.
Figure 9-1. A simple computation graph
Most importantly, it is possible to break up the graph into several chunks and run them in parallel across multiple CPUs or GPUs (as shown in Figure 9-2). TensorFlow also supports distributed computing, so you can train colossal neural networks on humongous training sets in a reasonable amount of time by splitting the computations across hundreds of servers (see Chapter 12). TensorFlow can train a network with millions of parameters on a training set composed of billions of instances with millions of features each. This should come as no surprise, since TensorFlow was developed by the Google Brain team and it powers many of Google’s large-scale services, such as Google Cloud Speech, Google Photos, and Google Search.
Figure 9-2. Parallel computation on multiple CPUs/GPUs/servers
When TensorFlow was open-sourced in November 2015, there were already many popular open source libraries for Deep Learning (Table 9-1 lists a few), and to be fair most of TensorFlow’s features already existed in one library or another. Nevertheless, TensorFlow’s clean design, scalability, flexibility, and great documentation (not to mention Google’s name) quickly boosted it to the top of the list. In this chapter, we will go through the basics of TensorFlow, from installation to creating, running, saving, and visualizing simple computational graphs. Mastering these basics is important before you build your first neural network (which we will do in the next chapter).
Installation
Let’s get started! Assuming you installed Jupyter and Scikit-Learn by following the installation instructions in Chapter 2, you can simply use pip3 to install TensorFlow. :
NOTE.
Creating Your First Graph and Running It in a Session
The following code creates the graph represented in Figure 9-1:
That’s all there is to it! The most important thing to understand is that this code does not actually perform any computation, even though it looks like it does (especially the last line). It just creates a computation graph. In fact, even the variables are not initialized yet. To evaluate this graph, you need to open a TensorFlow sessionand use it to initialize the variables and evaluate f. A TensorFlow session takes care of placing the operations onto devices such as CPUs and GPUs and running them, and it holds all the variable values. The following code creates a session, initializes the variables, and evaluates, and f then closes the session (which frees up resources):
Having to repeat sess.run() all the time is a bit cumbersome, but fortunately there is a better way:
- with tf.Session() as sess:
- x.initializer.run()
- y.initializer.run()
- result = f.eval()
Instead of manually running the initializer for every single variable, you can use the global_variables_initializer() function. Note that it does not actually perform the initialization immediately, but rather creates a node in the graph that will initialize all variables when it is run:
- init = tf.global_variables_initializer() # prepare an init node
- with tf.Session() as sess:
- init.run() # actually initialize all the variables
- result = f.eval()
A TensorFlow program is typically split into two parts: the first part builds a computation graph (this is called the construction phase), and the second part runs it (this is the execution phase). The construction phase typically builds a computation graph representing the ML model and the computations required to train it. The execution phase generally runs a loop that evaluates a training step repeatedly (for example, one step per mini-batch), gradually improving the model parameters. We will go through an example shortly.
Managing Graphs
Any node you create is automatically added to the default graph:
In most cases this is fine, but sometimes you may want to manage multiple independent graphs. You can do this by creating a new Graph and temporarily making it the default graph inside a with block, like so:
TIP.
Lifecycle of a Node Value
When you evaluate a node, TensorFlow automatically determines the set of nodes that it depends on and it evaluates these nodes first. For example, consider the following code:
- w = tf.constant(3)
- x = w + 2
- y = x + 5
- z = x * 3
- with tf.Session() as sess:
- print(y.eval()) # 10
- print(z.eval()) # 15
All node values are dropped between graph runs, except variable values, which are maintained by the session across graph runs (queues and readers also maintain some state, as we will see in Chapter 12). A variable starts its life when its initializer is run, and it ends when the session is closed. If you want to evaluate y and z efficiently, without evaluating w and x twice as in the previous code, you must ask TensorFlow to evaluate both y and z in just one graph run, as shown in the following code:
- with tf.Session() as sess:
- y_val, z_val = sess.run([y, z])
- print(y_val) # 10
- print(z_val) # 15
Linear Regression with TensorFlow
TensorFlow operations (also called ops for short) can take any number of inputs and produce any number of outputs. For example, the addition and multiplication ops each take two inputs and produce one output. Constants and variables take no input (they are called source ops). The inputs and outputs are multidimensional arrays, called tensors (hence the name “tensor flow”). Just like NumPy arrays, tensors have a type and a shape. In fact, in the Python API tensors are simply represented by NumPy ndarrays. They typically contain floats, but you can also use them to carry strings (arbitrary byte arrays).
In the examples so far, the tensors just contained a single scalar value, but you can of course perform computations on arrays of any shape. For example, the following code manipulates 2D arrays to perform Linear Regression on the California housing dataset (introduced in Chapter 2). It starts by fetching the dataset; then it adds an extra bias input feature (x0 = 1) to all training instances (it does so using NumPy so it runs immediately); then it creates two TensorFlow constant nodes, X and y, to hold this data and the targets, and it uses some of the matrix operations provided by TensorFlow to define theta. These matrix functions—transpose(), matmul(), and matrix_inverse()—are self-explanatory, but as usual they do not perform any computations immediately; instead, they create nodes in the graph that will perform them when the graph is run. You may recognize that the definition of theta corresponds to the Normal Equation (see Chapter 4):
Finally, the code creates a session and uses it to evaluate theta.
- import numpy as np
- from sklearn.datasets import fetch_california_housing
- housing = fetch_california_housing()
- m, n = housing.data.shape
- housing_data_plus_bias = np.c_[np.ones((m, 1)), housing.data]
- X = tf.constant(housing_data_plus_bias, dtype=tf.float32, name="X")
- y = tf.constant(housing.target.reshape(-1, 1), dtype=tf.float32, name="y")
- XT = tf.transpose(X)
- theta = tf.matmul(tf.matmul(tf.matrix_inverse(tf.matmul(XT, X)), XT), y)
- with tf.Session() as sess:
- theta_value = theta.eval()
Implementing Gradient Descent
Let’s try using Batch Gradient Descent (introduced in Chapter 4) instead of the Normal Equation. First we will do this by manually computing the gradients, then we will use TensorFlow’s autodiff feature to let TensorFlow compute the gradients automatically, and finally we will use a couple of TensorFlow’s out-of-the-box optimizers.
WARNING.
Manually Computing the Gradients
The following code should be fairly self-explanatory, except for a few new elements:
- n_epochs = 1000
- learning_rate = 0.01
- X = tf.constant(scaled_housing_data_plus_bias, dtype=tf.float32, name="X")
- y = tf.constant(housing.target.reshape(-1, 1), dtype=tf.float32, name="y")
- theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0), name="theta")
- y_pred = tf.matmul(X, theta, name="predictions")
- error = y_pred - y
- mse = tf.reduce_mean(tf.square(error), name="mse")
- gradients = 2/m * tf.matmul(tf.transpose(X), error)
- training_op = tf.assign(theta, theta - learning_rate * gradients)
- init = tf.global_variables_initializer()
- with tf.Session() as sess:
- sess.run(init)
- for epoch in range(n_epochs):
- if epoch % 100 == 0:
- print("Epoch", epoch, "MSE =", mse.eval())
- sess.run(training_op)
- best_theta = theta.eval()
The preceding code works fine, but it requires mathematically deriving the gradients from the cost function (MSE). In the case of Linear Regression, it is reasonably easy, but if you had to do this with deep neural networks you would get quite a headache: it would be tedious and error-prone. You could use symbolic differentiation to automatically find the equations for the partial derivatives for you, but the resulting code would not necessarily be very efficient. Fortunately, TensorFlow’s autodiff feature comes to the rescue: it can automatically and efficiently compute the gradients for you. Simply replace the gradients = ... line in the Gradient Descent code in the previous section with the following line, and the code will continue to work just fine:
- gradients = tf.gradients(mse, [theta])[0]
Table 9-2
Using an Optimizer
So TensorFlow computes the gradients for you. But it gets even easier: it also provides a number of optimizers out of the box, including a Gradient Descent optimizer. You can simply replace the preceding gradients = ... and training_op = ... lines with the following code, and once again everything will just work fine:
- optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
- training_op = optimizer.minimize(mse)
- optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate, momentum=0.9)
Let’s try to modify the previous code to implement Mini-batch Gradient Descent. For this, we need a way to replace X and y at every iteration with the next mini-batch. The simplest way to do this is to use placeholder nodes. These nodes are special because they don’t actually perform any computation, they just output the data you tell them to output at runtime. They are typically used to pass the training data to TensorFlow during training. If you don’t specify a value at runtime for a placeholder, you get an exception.
To create a placeholder node, you must call the placeholder() function and specify the output tensor’s data type. Optionally, you can also specify its shape, if you want to enforce it. If you specify None for a dimension, it means “any size.” For example, the following code creates a placeholder node A, and also a node B = A + 5. When we evaluate B, we pass a feed_dict to the eval() method that specifies the value of A. Note that A must have rank 2 (i.e., it must be two-dimensional) and there must be three columns (or else an exception is raised), but it can have any number of rows:
NOTE.
To implement Mini-batch Gradient Descent, we only need to tweak the existing code slightly. First change the definition of X and y in the construction phase to make them placeholder nodes:
- X = tf.placeholder(tf.float32, shape=(None, n + 1), name="X")
- y = tf.placeholder(tf.float32, shape=(None, 1), name="y")
- batch_size = 100
- n_batches = int(np.ceil(m / batch_size))
- ch9_t101.py
- #!/usr/bin/env python3
- import tensorflow as tf
- import numpy as np
- from sklearn.datasets import fetch_california_housing
- import numpy.random as rnd
- #tf.reset_default_graph()
- housing = fetch_california_housing()
- m, n = housing.data.shape
- n_epochs = 1000
- learning_rate = 0.01
- from sklearn.preprocessing import StandardScaler
- scaler = StandardScaler()
- scaled_housing_data = scaler.fit_transform(housing.data)
- scaled_housing_data_plus_bias = np.c_[np.ones((m, 1)), scaled_housing_data]
- X = tf.placeholder(tf.float32, shape=(None, n + 1), name="X")
- y = tf.placeholder(tf.float32, shape=(None, 1), name="y")
- theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0, seed=42), name="theta")
- y_pred = tf.matmul(X, theta, name="predictions")
- error = y_pred - y
- mse = tf.reduce_mean(tf.square(error), name="mse")
- optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
- training_op = optimizer.minimize(mse)
- init = tf.global_variables_initializer()
- def fetch_batch(epoch, batch_index, batch_size):
- rnd.seed(epoch * n_batches + batch_index)
- indices = rnd.randint(m, size=batch_size)
- X_batch = scaled_housing_data_plus_bias[indices]
- y_batch = housing.target.reshape(-1, 1)[indices]
- return X_batch, y_batch
- n_epochs = 10
- batch_size = 100
- n_batches = int(np.ceil(m / batch_size))
- with tf.Session() as sess:
- sess.run(init)
- for epoch in range(n_epochs):
- for batch_index in range(n_batches):
- X_batch, y_batch = fetch_batch(epoch, batch_index, batch_size)
- sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
- best_theta = theta.eval()
- print("Best theta:")
- print(best_theta)
Once you have trained your model, you should save its parameters to disk so you can come back to it whenever you want, use it in another program, compare it to other models, and so on. Moreover, you probably want to save checkpoints at regular intervals during training so that if your computer crashes during training you can continue from the last checkpoint rather than start over from scratch. TensorFlow makes saving and restoring a model very easy. Just create a Saver node at the end of the construction phase (after all variable nodes are created); then, in the execution phase, just call its save() method whenever you want to save the model, passing it the session and path of the checkpoint file:
- [...]
- theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0), name="theta")
- [...]
- init = tf.global_variables_initializer()
- saver = tf.train.Saver()
- with tf.Session() as sess:
- sess.run(init)
- for epoch in range(n_epochs):
- if epoch % 100 == 0: # checkpoint every 100 epochs
- save_path = saver.save(sess, "/tmp/my_model.ckpt")
- sess.run(training_op)
- best_theta = theta.eval()
- save_path = saver.save(sess, "/tmp/my_model_final.ckpt")
- with tf.Session() as sess:
- saver.restore(sess, "/tmp/my_model_final.ckpt")
- [...]
- saver = tf.train.Saver({"weights": theta})
So now we have a computation graph that trains a Linear Regression model using Mini-batch Gradient Descent, and we are saving checkpoints at regular intervals. Sounds sophisticated, doesn’t it? However, we are still relying on the print() function to visualize progress during training. There is a better way: enter TensorBoard. If you feed it some training stats, it will display nice interactive visualizations of these stats in your web browser (e.g., learning curves). You can also provide it the graph’s definition and it will give you a great interface to browse through it. This is very useful to identify errors in the graph, to find bottlenecks, and so on.
The first step is to tweak your program a bit so it writes the graph definition and some training stats—for example, the training error (MSE)—to a log directory that TensorBoard will read from. You need to use a different log directory every time you run your program, or else TensorBoard will merge stats from different runs, which will mess up the visualizations. The simplest solution for this is to include a timestamp in the log directory name. Add the following code at the beginning of the program:
- from datetime import datetime
- now = datetime.utcnow().strftime("%Y%m%d%H%M%S")
- root_logdir = "tf_logs"
- logdir = "{}/run-{}/".format(root_logdir, now)
- mse_summary = tf.summary.scalar('MSE', mse)
- file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph())
Next you need to update the execution phase to evaluate the mse_summary node regularly during training (e.g., every 10 mini-batches). This will output a summary that you can then write to the events file using the file_writer. Here is the updated code:
- [...]
- for batch_index in range(n_batches):
- X_batch, y_batch = fetch_batch(epoch, batch_index, batch_size)
- if batch_index % 10 == 0:
- summary_str = mse_summary.eval(feed_dict={X: X_batch, y: y_batch})
- step = epoch * n_batches + batch_index
- file_writer.add_summary(summary_str, step)
- sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
- [...]
Finally, you want to close the FileWriter at the end of the program:
- file_writer.close()
If you run the program a second time, you should see a second directory in the tf_logs/ directory. Great! Now it’s time to fire up the TensorBoard server. You need to activate your virtualenv environment if you created one, then start the server by running the tensorboard command, pointing it to the root log directory. This starts the TensorBoard web server, listening on port 6006 (which is “goog” written upside down):
Next open a browser and go to http://0.0.0.0:6006/ (or http://localhost:6006/). Welcome to TensorBoard! In the SCALAS tab you should see MSE on the right. If you click on it, you will see a plot of the MSE during training, for both runs (Figure 9-3). You can check or uncheck the runs you want to see, zoom in or out, hover over the curve to get details, and so on.
Figure 9-3. Visualizing training stats using TensorBoard
Now click on the Graphs tab. You should see the graph shown in Figure 9-4.
Figure 9-4. Visualizing the graph using TensorBoard
Name Scopes
When dealing with more complex models such as neural networks, the graph can easily become cluttered with thousands of nodes. To avoid this, you can create name scopes to group related nodes. For example, let’s modify the previous code to define the error and mse ops within a name scope called "loss":
- with tf.name_scope("loss") as scope:
- error = y_pred - y
- mse = tf.reduce_mean(tf.square(error), name="mse")
In TensorBoard, the mse and error nodes now appear inside the loss namespace, which appears collapsed by default (Figure 9-5).
Figure 9-5. A collapsed namescope in TensorBoard
Modularity
Suppose you want to create a graph that adds the output of two rectified linear units (ReLU). A ReLU computes a linear function of the inputs, and outputs the result if it is positive, and 0 otherwise, as shown in Equation 9-1.
Equation 9-1. Rectified linear unit
The following code does the job, but it’s quite repetitive:
- n_features = 3
- X = tf.placeholder(tf.float32, shape=(None, n_features), name="X")
- w1 = tf.Variable(tf.random_normal((n_features, 1)), name="weights1")
- w2 = tf.Variable(tf.random_normal((n_features, 1)), name="weights2")
- b1 = tf.Variable(0.0, name="bias1")
- b2 = tf.Variable(0.0, name="bias2")
- z1 = tf.add(tf.matmul(X, w1), b1, name="z1")
- z2 = tf.add(tf.matmul(X, w2), b2, name="z2")
- relu1 = tf.maximum(z1, 0., name="relu1")
- relu2 = tf.maximum(z1, 0., name="relu2")
- output = tf.add(relu1, relu2, name="output")
- def relu(X):
- w_shape = (int(X.get_shape()[1]), 1)
- w = tf.Variable(tf.random_normal(w_shape), name="weights")
- b = tf.Variable(0.0, name="bias")
- z = tf.add(tf.matmul(X, w), b, name="z")
- return tf.maximum(z, 0., name="relu")
- n_features = 3
- X = tf.placeholder(tf.float32, shape=(None, n_features), name="X")
- relus = [relu(X) for i in range(5)]
- output = tf.add_n(relus, name="output")
Figure 9-6. Collapsed node series
Using name scopes, you can make the graph much clearer. Simply move all the content of the relu() function inside a name scope. Figure 9-7 shows the resulting graph. Notice that TensorFlow also gives the name scopes unique names by appending _1, _2, and so on.
- def relu(X):
- with tf.name_scope("relu"):
- [...]
Sharing Variables
If you want to share a variable between various components of your graph, one simple option is to create it first, then pass it as a parameter to the functions that need it. For example, suppose you want to control the ReLU threshold (currently hardcoded to 0) using a shared threshold variable for all ReLUs. You could just create that variable first, and then pass it to the relu() function:
- def relu(X, threshold):
- with tf.name_scope("relu"):
- [...]
- return tf.maximum(z, threshold, name="max")
- threshold = tf.Variable(0.0, name="threshold")
- X = tf.placeholder(tf.float32, shape=(None, n_features), name="X")
- relus = [relu(X, threshold) for i in range(5)]
- output = tf.add_n(relus, name="output")
- def relu(X):
- with tf.name_scope("relu"):
- if not hasattr(relu, "threshold"):
- relu.threshold = tf.Variable(0.0, name="threshold")
- [...]
- return tf.maximum(z, relu.threshold, name="max")
- with tf.variable_scope("relu"):
- threshold = tf.get_variable("threshold", shape=(),
- initializer=tf.constant_initializer(0.0))
- with tf.variable_scope("relu", reuse=True):
- threshold = tf.get_variable("threshold")
- with tf.variable_scope("relu") as scope:
- scope.reuse_variables()
- threshold = tf.get_variable("threshold")
Now you have all the pieces you need to make the relu() function access the threshold variable without having to pass it as a parameter:
- def relu(X):
- with tf.variable_scope("relu", reuse=True):
- threshold = tf.get_variable("threshold") # reuse existing variable
- [...]
- return tf.maximum(z, threshold, name="max")
- X = tf.placeholder(tf.float32, shape=(None, n_features), name="X")
- with tf.variable_scope("relu"): # create the variable
- threshold = tf.get_variable("threshold", shape=(),
- initializer=tf.constant_initializer(0.0))
- relus = [relu(X) for relu_index in range(5)]
- output = tf.add_n(relus, name="output")
NOTE.
It is somewhat unfortunate that the threshold variable must be defined outside the relu() function, where all the rest of the ReLU code resides. To fix this, the following code creates the threshold variable within the relu() function upon the first call, then reuses it in subsequent calls. Now the relu() function does not have to worry about name scopes or variable sharing: it just calls get_variable(), which will create or reuse the threshold variable (it does not need to know which is the case). The rest of the code calls relu() five times, making sure to set reuse=False on the first call, and reuse=True for the other calls.
- def relu(X):
- threshold = tf.get_variable("threshold", shape=(),
- initializer=tf.constant_initializer(0.0))
- [...]
- return tf.maximum(z, threshold, name="max")
- X = tf.placeholder(tf.float32, shape=(None, n_features), name="X")
- relus = []
- for relu_index in range(5):
- with tf.variable_scope("relu", reuse=(relu_index >= 1)) as scope:
- relus.append(relu(X))
- output = tf.add_n(relus, name="output")
Figure 9-9. Five ReLUs sharing the threshold variable
This concludes this introduction to TensorFlow. We will discuss more advanced topics as we go through the following chapters, in particular many operations related to deep neural networks, convolutional neural networks, and recurrent neural networks as well as how to scale up with TensorFlow using multithreading, queues, multiple GPUs, and multiple servers.
Supplement
* Derivation of the Normal Equation for linear regression
* Neural Networks and Deep Learning 1 - Up and Running with TensorFlow
沒有留言:
張貼留言