Thursday, June 15, 2017

Paper Summary: DeepXplore, Automated Whitebox Testing of Deep Learning Systems

This paper was put on arxiv on May 2017, and is authored by Kexin Pei, Yinzhi Cao, Junfeng Yang, Suman Jana at Columbia and Lehigh Universities.

The paper proposes a framework to automatically generate inputs that trigger/cover different parts of a Deep Neural Network (DNN) for inference and identify incorrect behaviors.

It is easy to see the motivation for high-coverage testing of DNNs. We use DNN inference for safety-critical tasks such as self-driving cars; A DNN gives us results, but we don't know how it works, and how much it works. DNN inference is opaque and we don't have any guarantee that it will not mess up spectacularly in a slightly different input then the ones it succeeded. There are too many corner cases to consider for input based testing, and rote testing will not be able to cover all bases.

DeepXplore goes about DNN inference testing in an intelligent manner. It shows that finding inputs triggering differential behaviors while achieving high neuron coverage for DL algorithms can be represented as a joint optimization problem and solved efficiently using gradient-based optimization techniques. (Gradients of DNNs with respect to inputs can be calculated accurately and can be used to solve this joint optimization problem efficiently.)

DeepXplore also leverages multiple DL systems with similar functionality as cross-referencing oracles and thus avoid manual checking for erroneous behaviors. For example, use Uber, Google, Waymo for the driving video, and compare outputs. Majority voting determines the correct behavior. DeepXplore counts on a majority of the independently trained DNNs not to be susceptible to the bug. This is similar to N-version programming for building resilience against software bugs.

Here is DeepXplore workflow for generating test images. DeepXplore takes unlabeled test inputs as seeds and generates new test inputs that cover a large number of different neurons (i.e., activates them to a value above a customizable threshold) while causing the tested DNNs to behave differently.

Figure 6 shows how "gradient ascent" can be employed in this joint optimization problem. This is a walk up-hill towards less certain scoring, so it is a gradient-ascent, rather than a gradient-descent. Starting from a seed input, DeepXplore performs the guided search by the gradient in the input space of two similar DNNs supposed to handle the same task such that it finally uncovers the test inputs that lie between the decision boundary of these two DNNs. Such test inputs will be classified differently by the two DNNs.

The team implemented DeepXplore using Tensorflow 1.0.1 and Keras 2.0.3 DL frameworks. They used Tensorflow's implementation of gradient computations in the joint optimization process. Tensorflow also supports creating subDNNs by marking any arbitrary neuron's output as the subDNN's output while keeping the input same as the original DNN's input. They used this feature to intercept and record the output of neurons in the intermediate layers of a DNN and compute the corresponding gradients with respect to the DNN’s input. All the experiments were run on a Linux laptop with 16GB RAM. I guess since this is inference rather than training, a laptop sufficed for the experiments.

A criticism to the paper could be this. Yes, DeepXplore catches a bad classification on an image, that is good and useful. But probably the self-driving application already has built-in tolerance to occasional misclassifications. For example, the temporal continuity can help; previous images and next images correctly classify the road, so an interim misclassification would not be very bad. Moreover, application-specific invariants can also act as safety net, e.g., do not steer very sharp, and use a Kalman filter. It would be interesting to do evaluations also in an end-to-end application setting.

UPDATE (6/17/2018): I have received clarification from Yinzhi Cao, one of the authors, about these points. Here are his comments:

First, our light effect (or other changes) can be added constantly over the time domain, and thus DeepXplore should be able to fool the decision engine all the time.  That is, the previous images and next images will also lead to incorrect decisions.

Second, DeepXplore can ask the decision engine to gradually switch the steering so that a Kalman filter may not help.  For example, the switching from left to right or vice versa is not that sudden so that a Kalman filter cannot rule out the decision.

Wednesday, June 14, 2017

Scalability, but at what COST

This paper is by Frank McSherry, Michael Isard, Derek G. Murray and appeared in HotOS 2015. The authors are all listed as unaffiliated because this is around the time where Microsoft Research Silicon Valley lab was closed, where they used to work. Michael and Derek are at Google working on TensorFlow framework, but Frank McSherry is still at large and unaffiliated. Frank has a great blog, where you will learn more than you ever wanted to know about dataflow, Rust, differential privacy, and the art of influencing people and making friends. 

COST, defined per system for a problem, is the configuration required before the system outperforms a competent single-threaded implementation. They show that many big data systems have surprisingly large COST, often hundreds of cores.

Let's repeat this again: some single threaded implementations were found to be more than an order of magnitude faster than published results (at SOSP/OSDI!) for systems using hundreds of cores.

The paper's goal is to shed light on this issue so that "future research is directed toward distributed systems whose scalability comes from advances in system design rather than poor baselines and low expectations." (That has gotta be one of the snarkiest lines in a computer science paper. Well, discounting those from Dijkstra, that is.)

What does better baselines mean? It means using better graph layout and better algorithms for performance. The paper gives as an example the label propagation algorithm. The paper argues that label propagation is used for graph connectivity not because it is a good algorithm, but because it fits within the "think like a vertex" computational model, whose implementations scale well. The paper claims the appealing scaling properties are largely due to the algorithm's sub-optimality, as label propagation does more work than better algorithms.

Yes, and on the other hand, I can also see the appeal in the Giraph "think like a vertex" approach (or for that matter the MapReduce and Spark approaches). Giraph optimized for simplicity and ease-of-development. If you make it simple and easy to use, people will be happy to use it, adapt it, and throw it cluster resources when needed. One may argue this is a good tradeoff. Instead of letting people think harder and make programming harder for them, make it easy but wasteful on computing resources. After all, humans are much more expensive than computers, and scalability in terms of human cost is also a factor for practical/industrial systems. A similar argument for BlockChains has been made here, arguing social scalability is more important than computational-efficiency or even computational-scalability.

Of course this can be a false dichotomy, there are (and will be) systems/frameworks that provide both scalability in terms of human cost (by being easy-to-develop-with) and also computationally efficient. And we should strive to design such systems.

The analysis and evaluation was given/studied in the context of graph algorithms: pagerank and connected components. For embarrassingly parallel algorithms, such as SGD, this analysis and the results would not apply.

Here are Hacker News discussions on this paper.

Tuesday, June 13, 2017

Paper Summary: Neurosurgeon, collaborative intelligence between the cloud and mobile edge

This paper is by Yiping Kang, Johann Hauswald, Cao Gao, Austin Rovinski, Trevor Mudge, Jason Mars, and Lingjia Tang from University of Michigan, and appeared at ASPLOS 17.

In Deep Learning (DL), you have a long, computation-intensive training phase where you micro-fiddle/fudge the model parameters until you get desired accuracy. Then you deploy this optimized model parameters (i.e., the Deep Neural Network [DNN])for inference with real-world inputs. The paper is about this inference/serving layer of DL.

In the serving layer, the input goes through the DL with the tuned model parameters activating some subset of neurons at each layer and finally activating the correct neuron[s] at the output layer. This can still be a computation intensive process as the model has millions of parameters, and you apply matrix multiplication layer after layer. So this serving layer still has many juicy problems to work on.

A very relevant problem is that executing inference at the mobile can be slow because of the computational and energy limitations of the mobile. Executing at the cloud backend server is fast, but how do you get the input there? Uploading the input to the cloud can be slow, especially if the input is a large image and the connection is slow. So there is a tradeoff.

In Section 4, the paper shows how beneficial it can be to perform a proper DL inference partitioning. For image processing/computer vision (CV), e.g., AlexNet, partitioning at a middle layer is the most optimal for both latency and energy optimization. Since the input image is large (512Mb is used), uploading it to the cloud is both time and energy consuming. However, if you execute the convolutional layers followed by the pooling at the mobile, you reduce the size of the intermediate output data and it is time and energy efficient to upload this to the cloud. The rest of the computation, carried on the cloud, consists of processing fully connected layers, that are computation intensive. If we were to execute them also on the mobile, we would be waiting for the mobile CPU/GPU to finish execution, where as uploading the intermediate output to the cloud and executing the rest of the layers at the cloud finishes earlier.

The paper also finds that, for Automatic Speech Recognition (ASR) and Natural Language Processing (NLP) applications, usually the best approach is to execute everything at the mobile.

Enter Neurosurgeon

Are we done here then? Why do we need a neurosurgeon tool, if a static lookup can do the trick? At this point, the paper makes another argument. You can't just use this one time static observation per application class (CV, ASR, NLP) and be done with it. The best partition point for a DNN architecture depends on the DNN's topology, which manifests itself in the computation and data size variations of each layer. Moreover, the connectivity conditions are changing, so you need to monitor and adjust your decision with the current network quality.

(The paper also argues that changing cloud backend conditions are a factor, but I am not convinced with the datacenter can get busy/overloaded argument. The evaluation experiments for that part is done synthetically.)

The proposed system to address this problem, Neurosurgeon, consists of a deployment phase and a runtime system that manages the partitioned execution of an inference application. Figure 10 shows the design of Neurosurgeon.

As part of the deployment stage, Neurosurgeon runs once per mobile and server platform for producing performance prediction models. This is application and NN independent. It tries different NN layer types for these mobile and server platforms and estimates regression line wrt changing configuration parameters.

The runtime stage is where the Neurosurgeon uses the layer performance prediction models produced in the deployment stage to dynamically choose the best DNN partition models. Neurosurgeon analyzes the target DNN’s constituent layers, and uses the prediction models to estimate, for each layer, the latency on mobile and cloud, and power consumption on the mobile. As Algorithm 1 shows, this is a simple evaluation of the conditions to choose the partition point.

Figures 11 and 12 show results for latency and energy-efficiency improvements achieved using Neurosurgeon.

What about Maui?

Maui is a general smartphone to cloud offloading framework and appeared in MobiSys 2010. MAUI is control-centric, the partition points are at the procedure/method invocation layer, whereas the neurosurgeon is data-centric, at the NN layer layer.  While Maui requires runtime profiling of the application, Neurosurgeon makes decisions based on the DNN topology one-time deployment stage observations without requiring any runtime profiling.

Figure 13 shows the comparison results. The paper says: Maui makes incorrect offloading choices for more complicated scenarios (e.g., VGG, FACE, DIG and ASR). This is because Maui relies on past invocation of a certain DNN layer type to predict the latency and data size of the future invocations of that layer type, leading to mispredictions. But why don't we make Maui control points *per layer* method invocations? Instead of making Maui control points per layer type, if we made them per layer number method invocations things might improve for Maui.