## Tuesday, January 17, 2017

### My first impressions after a week of using TensorFlow

Last week I went through the TensorFlow (TF) Tutorials here. I found that I hadn't understood some important points about TensorFlow execution, when I read the TensorFlow paper. I am noting them here fresh to capture my experience as a beginner. (As one gathers more experience with a platform, the baffling introductory concepts starts to occur obvious and trivial.)

The biggest realization I had was to see a dichotomy in TensorFlow among two phases. The first phase defines a computation graph (e.g., a neural network to be trained and the operations for doing so). The second phase executes the computation/dataflow graph defined in Phase1 on a set of available devices. This deferred execution model enables optimizations in the execution phase by using global information about the computation graph: graph rewriting can be done to remove redundancies, better scheduling decisions can be made, etc. Another big benefit is in enabling flexibility and ability to explore/experiment in the execution phase through the use of partial executions of subgraphs of the defined computation graph.

In the rest of this post, I first talk about Phase1: Graph construction, Phase2: Graph execution, and then I give a very brief overview of TensorFlow distributed execution, and conclude with a discussion on visualizing and debugging in TensorFlow.

## Phase1: Graph construction

This first phase where you design the computation graph is where most of your efforts are spent. Essentially the computation graph consists of the neural network (NN) to be trained and operations to train it. Here you lay out the computation/dataflow graph brick by brick using TensorFlow operations and tensors. But what you are designing is just a blueprint, nothing gets built yet.

Since you are designing the computation graph, you use placeholders for input and output. Placeholders denote what type of input is expected. For example, x may correspond to your training data, and y_ may be your training labels, and you may define them as follows using the placeholders.
x = tf.placeholder(tf.float32, [None, 784])
y_ = tf.placeholder(tf.float32, [None, 10])

This says that x will later get instantiated unspecified number of rows (you use 'None' to tell this to TensorFlow) of 784 float32 vectors. This setup enables us to feed the training data to the NN in batches, and gives you flexibility in the graph execution phase to instantiate multiple workers in parallel with the computational graph/NN and train them in parallel by feeding them different batches of your input data.

## Phase2: Graph execution using sessions

After you get the computation graph designed to perfection, you switch to the second phase where the graph execution is done. Graph/subgraph execution is done using sessions. A session encapsulates the runtime environment in which graphs/subgraphs instantiate and execute.

When you open a session, you first initialize the variables by calling "tf.global_variables_initializer().run()". Surprise! In Phase1 you had assigned variables initial values, but those did not get assigned/initialized until you got to Phase2 and called "tf.global_variables_initializer". For example, let's say you asked b to be initialized as a vector of size 10 with all zeros "b = tf.Variable(tf.zeros([10]))" in Phase1. That didn't take effect until you opened a session, and called "tf.global_variables_initializer". If you had typed in "print( b.eval() )" in the first part after you wrote "b = tf.Variable(tf.zeros([10]))", you get an error: " ValueError: Cannot evaluate tensor using eval(): No default session is registered. Use with sess.as_default() or pass an explicit session to eval(session=sess)  ".

This is because b.eval() maps to session.run(b), and you don't have any session in Phase1. On the other hand, if you try print (b.eval()) in Phase2 after you call "tf.global_variables_initializer", the initialization takes effect and you get the output [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.].

Each invocation of the Session API is called a step, and TensorFlow supports multiple concurrent steps on the same graph. In other words, the Session API allows multiple calls to Session.run() in parallel to improve throughput. This is basically performing dataflow programming over the symbolic computation graph built in Phase1.

In Phase2, you can open sessions and close sessions to your heart's content. Tearing down a session and reopening a session has several benefits. This way you instruct TensorFlow runtime to forget about the previous values assigned to the variables in the computation graph, and start again with a new slate (which can be useful for hyperparameter tuning). When you close a session you release that state, and when you open a session you initialize the graph again and start from scratch. You can even have multiple sessions open concurrently in theory, and that may even be useful for avoiding variable naming clashes.

An important concept for Phase2 is partial graph execution. When I read the TensorFlow paper first time, I hadn't understood the importance of partial graph execution, but turns out it is important and useful. The API for executing a graph allows the client to specify the subgraph that should be executed. The client selects zero or more edges to feed input tensors into the dataflow, and one or more edges to fetch output tensors from the dataflow. Then the runtime prunes the graph to contain the necessary set of operations.

Partial graph execution is useful in training parts of the NN at a time. However, it is commonly exercised in a more mundane way in basic training of NNs.  When you are training the NN, every K iterations you may like to test with the validation/test set. You had defined those in Phase1 when you define the computation graph, but these validation/test evaluation subgraphs are only included and executed every K iterations, when you ask sess.run() to evaluate them. This reduces the overhead in execution.  Another example is the tf.summary operators, which I will talk about in visualizing and debugging. The tf.summary operators are defined as peripheral operations to collect logs from computation graph operations. You can think of them as an overlay graph. If you like to execute tf.summary operations, you explicitly mention this in sess.run(). And when you leave that out, tf.summary operations (that overlay graph) is pruned out and don't get executed. Mundane it is but it provides a lot of computation optimization as well as flexibility in execution.

This deferred execution model in TensorFlow is very different than the traditional instant-gratification instant-evaluation execution model. But this serves a purpose. The main idea of Phase2 is that, after you have painstakingly constructed the computation graph in Phase1, this is where you try to get as much mileage out of that computation graph.

## Brief overview of TensorFlow distributed execution

A TensorFlow cluster is a set of tasks (named processes that can communicate over a network) that each contain one or more devices (such as CPUs or GPUs). Typically a subset of those tasks is assigned as parameter-server (PS) tasks, and others as worker tasks.

Tasks are run as (Docker) containers in jobs managed by a cluster scheduling system (Kubernetes). After device placement, a subgraph is created per device. Send/Receive node pairs that communicate across worker processes use remote communication mechanisms such as TCP or RDMA to move data across machine boundaries.

Since TensorFlow computation graph is flexible, it is possible to easily allocate subgraphs to devices and machines. Therefore distributed execution is mostly a matter of computation subgraph placement and scheduling. Of course there are many complicating factors: such as heterogeneity of devices, communication overheads, just in time scheduling (to reduce overhead), etc. Google TensorFlow papers mention they perform graph rewriting and inferring of just-in-time scheduling from the computation graphs.

I haven't started delving into TensorFlow distributed, and haven't experimented with it yet. After I experiment with it, I will provide a longer write up.

## Visualizing and debugging

TF.summary operation provides a way to collect and visualize TensorFlow execution information. TF.summary operators are peripheral operators; they attach to other variables/tensors in the computation graph and they capture their values. Again, remember the two phase dichotomy in TensorFlow. In Phase1, you define and describe these TF.summary for the computational graph, but they don't get executed. They only get executed in Phase2 where you create a session, execute the graph, and explicitly mention to execute tf.summary graph as well.

If you use the TF.summary.FileWriter, you can write the values the tf.Summary  operations collected during a sess.run() into a log file. Then you can direct the Tensorboard tool to the log file to visualize and see the computational graph, as well as the how the values evolved over time.

I didn't get much use from the Tensorboard visualization. Maybe it is because I am a beginner. I don't find the graphs useful even after having a basic understanding of how to read them. Maybe they get useful for very very large computation graphs.

The Google TensorFlow whitepaper says that there is also a performance tracing tool called EEG but that is not included in the opensource release.