Incomplete Dot Products for Dynamic Computation Scaling in Neural Network Inference Bradley McDanel Surat Teerapittayanon H.T. Kung Harvard University Harvard University Harvard University Cambridge, MA, USA Cambridge, MA, USA Cambridge, MA, USA Email: mcdanel@fas.harvard.edu Email: steerapi@seas.harvard.edu Email: kung@harvard.edu Abstract—We propose the use of incomplete dot products (IDP) to dynamically adjust the number of input channels used in Standard (CDP) each layer of a convolutional neural network during feedforward Input M Filters with Output inference. IDP adds monotonically non-increasing coefficients, Channels N Channels Channels referred to as a ”profile”, to the channels during training. The profile orders the contribution of each channel in non-increasing N M order. At inference time, the number of channels used can be dynamically adjusted to trade off accuracy for lowered power consumption and reduced latency by selecting only a beginning subset of channels. This approach allows for a single network to dynamically scale over a computation range, as opposed to training and deploying multiple networks to support different levels of computation scaling. Additionally, we extend the notion Proposed (50% IDP) to multiple profiles, each optimized for some specific range of computation scaling. We present experiments on the computation M and accuracy trade-offs of IDP for popular image classification N models and datasets. We demonstrate that, for MNIST and CIFAR-10, IDP reduces computation significantly, e.g., by 75%, without significantly compromising accuracy. We argue that IDP provides a convenient and effective means for devices to lower computation costs dynamically to reflect the current computation budget of the system. For example, VGG-16 with 50% IDP (using Input Channel (Used) Output Channel (Used) only the first 50% of channels) achieves 70% in accuracy on Input Channel (Unused) Approx. Output Channel (Used) the CIFAR-10 dataset compared to the standard network which Filter Channel (Used) Output Channel (Unused) achieves only 35% accuracy when using the reduced channel set. Filter Channel (Unused) Fig. 1: Contrasting computation using complete dot product I. INTRODUCTION (CDP) in standard networks and incomplete dot product (IDP) proposed in this paper for a convolutional layer. Under stan- Inference with deep Convolutional Neural Networks dard CNN, for each filter at a given layer, N input channels are (CNNs) on end or edge devices has received increasing used in the dot product computation (CDP), to compute the attention as more applications begin to use sensor data as corresponding output channel. Under, for example, 50% IDP, input for models running directly on the device (see, e.g., [1]). the filter uses the first 50% of the input channels to compute However, each trained CNN model has a fixed accuracy, size, the corresponding output channel, which is an approximation and latency profile determined by the number of layers and of the CDP. Furthermore, only 50% of these filters are used parameters in each layer. The static nature of these models can since output channels for the other filters will not be utilized be problematic in dynamic contexts, such as when running on in the next layer. This leads to a 75% reduction in computation mobile device, where the power and latency requirements of for 50% IDP. CNN inference may change based on the current battery life or computation latency allowance of the device. One approach to address these dynamic contexts is to train multiple CNN models of ranging sizes, such as by varying instead train a single CNN that could dynamically scale across the number of parameters in each layer as in MobileNet [2], a computation range in high resolution, trading off accuracy and then selecting the appropriate model based on the current for power consumption and latency as desired. system requirements. However, this kind of approach requires To provide a solution for dynamic scaling over a compu- storing multiple models of different sizes to support the desired tation range, we propose to modify CNN training, by adding computation flexibilities of the system. Ideally, we would a profile of monotonically non-increasing channel coefficients for channels in a given layer. These coefficients provide an ordering for channels to allow channels with small coefficients Profiles to be dropped out during inference to save computation. By 1.0 training a CNN with this profile, dot products performed dur- all-one ing forward propagation may use only some of the beginning 0.8 harmonic channels in each filter, without compromising accuracy sig- linear nificantly. We call such a dot product incomplete dot product 0.6 half-exp (IDP). As we show in our evaluation in Section IV, using IDP on networks trained with a properly chosen profile achieves 0.4 much higher accuracy relative to using IDP on standard CNNs (trained without channel coefficients). Using the IDP scheme, 0.2 we are able to train a single CNN that can dynamically scale across a computation range. 0.0 Figure 1 contrasts forward propagation in a standard CNN 1 3 5 7 9 11 13 15 layer using all input channels, which we refer to as complete Channel Coefficient Index (for 1 to 16) dot product (CDP), to that of using only half of the input channels (50% IDP). IDP has two sources of computation Fig. 2: The profiles evaluated in this paper. Each channel reduction: 1) each output channel is computed with a filter coefficient index on the x-axis corresponds to an input channel using only 50% of the input channels (meaning the output (as shown in Figures 3 and 4). The y-axis shows the value is an approximation of using 100% of channels) and 2) half of each coefficient for a profile. The all-one (all coefficients of the filters are unused since their corresponding output equal 1) profile corresponds to a standard network layer with- channels are never consumed in the next layer. Therefore, out channel coefficients. The other profiles provide different 50% IDP leads to a 75% reduction in computation cost. In schemes for the coefficients. For instance, in half-exp, an general, the IDP computation cost is p2 times the original exponential decay for second half of coefficients is used. computation cost, where p is the percentage of channels used in IDP. As we show in our analysis in Section IV, IDP reduces and (b1, b2, . . . , b )TN is a truncated version of the followingcomputations (number of active input channels used in forward expression propagation) of several conventional network structures while still maintaining similar accuracy when all channels are used. ∑N γiaibi = γ1a1b1 + γ2a2b2 + · · ·+ γNaNbN In the next section, we describe how IDP is integrated i=1 into multiple layer types (fully connected, convolution, and which keeps only some number of the beginning terms, where depthwise separable convolution) in order to train such net- γ1, γ2, . . . , γN are the monotonically non-increasing profileworks where only a subset of channels may be used during coefficients. forward propagation. We call such networks incomplete neural networks. To compute IDP with a target IDP percentage, the beginning components, starting with γ1a1b1, are accumulated until the II. INCOMPLETE NEURAL NETWORKS target percentage of components is reached. The contribution of the remaining components, with smaller coefficients, is An incomplete neural network consists of one or more ignored, making the dot product incomplete. This is mathe- incomplete layers. An incomplete layer is similar to a standard matically equivalent to setting the unused coefficients to 0. neural network layer except that incomplete dot product (IDP) operations replace conventional complete dot product (CDP) B. Choice of Profile operations during forward propagation. We begin this section by describing IDP during forward propagation and provide There are multiple ways to set the profile γ = an explanation of how IDP is specifically added to fully- {γ1, γ2, . . . , γi, . . . , γN} corresponding to various policies connected, convolutional, and depthwise separable convolu- which may favor dynamic computation scaling for certain tional layers. ranges at the expense of other regions. In Section IV-C, we discuss the performance implications of the various profiles. A. Incomplete Dot Product The profiles evaluated in this paper, shown in Figure 2, are: 1) all-one: corresponds to a standard network layer without IDP adds a profile consisting of monotonically non- a profile increasing coefficients γ to the components of the dot prod- γi = 1 for i = 1, . . . , N uct computations of forward propagation. These coefficients order the importance of the components in decreasing or- 2) harmonic: a harmonic decay der from the most to least important. Mathematically, the 1 incomplete dot product (IDP) of two vectors (a1, a2, . . . , a TN ) γi = for i = 1, . . . , Ni Channel Coefficient 3) linear: a linear decay Convolution Filter Input γi = 1− i for i = 1, . . . , N N Channel Coefficients N 3x3 3 4) half-exp: all-one for the first half of terms and an exponenti{al decay for latter half of terms 3 1, if i < N γi = ( ) 2 for i = 1, . . . , N exp N2 − i− 1 , otherwise Dot Product Components (Nx3x3) Incomplete Dot Product (IDP): 2x3x3 C. Incomplete Layers In this section, we describe how IDP is integrated into stan- Fig. 3: IDP computation for a filter f (green) consisting dard neural network layers, which we refer to as incomplete of N 3 × 3 kernels f1, f2, . . . , fN and a patch of the in- layers. put x (green) consisting of slices from N input channels, x1,x2, . . . ,xN which is highlighted by the dashed black lines.1) Incomplete Fully Connected Layer: A standard linear On the right, the filter and input are shown in vector form. layer does the following computation: ∑ The coefficients γ1, . . . , γN correspond to each input channel.N The vertical dashed line (red) signifies an IDP where only yj = wjixi, the first two channels are used to compute the dot product i=1 γ1(f1 · x1) + γ2(f2 · x2). where j ∈ {1, . . . ,M}, M is the number of output compo- nents, N is the number of input components, wji is the layer weight corresponding to the j-th output component and -th to the pointwise convolution. A standard depthwise separablei input component, is the -th input component and is convolution layer does the following computation:xi i yj j-th output component. An incomplete linear layer does the ∑N following computation instead: yj = gji ∗ (fi ∗ xi), ∑ i=1N y = γ w x , where j ∈ {1, . . . ,M}, M is the number of output channels,j i ji i i=1 N is the number of input channels, gji is the i-th channel of the j-th pointwise filter, fi is the i-th channel of the depthwisewhere γi is the i-th profile coefficient. filter, xi is the i-th input channel and yj is the j-th output 2) Incomplete Convolution Layer: A standard convolution channel. An incomplete depthwise convolution layer does the layer does the following computation: following computation instead: ∑N ∑N yj = fji ∗ xi, yj = γj(gji ∗ (fi ∗ xi)), i=1 i=1 where j ∈ {1, . . . ,M}, M is the number of output channels, where γj is the profile coefficient of the j-th pointwise filter, N is the number of input channels, fji is the i-th channel of as illustrated in Figure 4. the j-th filter, xi is the i-th input channel and yj is the j-th output channel. When the input data is 2D, fji is a 2D kernel D. Incomplete Blocks and fji ∗ xi is a 2D convolution. For this paper, we use 2D input data in all experiments. An incomplete convolution layer An incomplete block consists of one or more incomplete does the following computation instead: layers, batch normalization and an activation functions. In ∑ this paper, Conv denotes an incomplete convolution blockN containing an incomplete convolution layer, batch normaliza- yj = γi(fji ∗ xi), tion and an activation function. Similarly, S-Conv denotes an i=1 incomplete depthwise separable convolution block. FC denotes where γi is the profile coefficient for the i-th channel of a an incomplete fully connected block. B-Conv denotes an filter, as illustrated in Figure 3. incomplete binary convolution block, where the layer weight values are either -1 or 1, and B-FC denotes an incomplete 3) Incomplete Depthwise Separable Convolution Layer: binary fully connected block. An incomplete depthwise separable convolution [2] consists of depthwise convolution followed by pointwise convolution. To simplify the presentation in this work, IDP is only applied Depthwise Pointwise Convolution FilterConvolution Input FC, 10 neurons FC, 10 neurons N N Channel Coefficients FC, 512 neurons 1x1 3 1 Conv, 32 filters, 3x3 1 3 CDP Conv, 32 filters, 3x3 IDP Conv, 32 filters, 3x3 Conv, 32 filters, 3x3 Dot Product Components (Nx1x1) 1 1Channel Coefficients 0 0 Incomplete Dot Product (IDP): 2x1x1 Profile 1 Profile 2 Fig. 4: Incomplete depthwise separable convolution consists (0% to 50% IDP) (50% to 100% IDP) of depthwise convolution followed by pointwise convolution. In this illustration, IDP is only applied to the pointwise Fig. 5: Training an incomplete neural network with two convolution. Note that IDP can also be applied to depthwise profiles. First, profile 1 is trained using only the first 50% convolution such as illustrated in Figure 3. of channels in each filter for every IDP layer. Then profile 2 is trained using all channels in every IDP layer, but only updates the latter 50% of channels in each filter. Each profile has an independent first and last layer. In this example, linear III. MULTIPLE-PROFILE profiles (see Figure 2) are used. INCOMPLETE NEURAL NETWORKS Multiple profiles, each focusing on a specific IDP range, total model size over a single profile VGG-16 network. can be used in a single network to efficiently cover a larger computation range. Figure 5 provides an overview of the training process for a network with two profiles (the process IV. EVALUATION generalizes to three or more profiles). In the example, the first profile operates on the 0-50% IDP range and the second profile In order to show the general versatility of the approach, we operates on the 50-100% IDP range. The training phase is incorporate IDP into several well studied models and evaluate performed in multiple stages, first training profile 1 and then on two datasets (MNIST [4] and CIFAR-10 [5]). fixing the learned weights and training profile 2. Specifically, when training profile 1, only the channels corresponding to the A. Network Models 0-50% IDP range are learned. After training profile 1, profile 2 is trained in a similar manner, but only learns weights in the The network models used to evaluate IDP are shown in Fig- 50-100% IDP range while still utilizing the previously learned ure 6. These networks cover several network types including weights of profile 1. MLPs, CNNs, Depthwise Separable CNNs (MobileNet [2]), At inference time, a profile is selected based on the current and Binary Networks. IDP percentage requirement. The IDP percentage is chosen 1) MLP: This single hidden layer MLP model highlights by the application power management policy, which dictates the effects of IDP on a simple network. We use the MNIST computation scaling based on, e.g., current battery budget of dataset when evaluating this network structure. the device. For the two-profile example in Figure 5, when IDP is between 0% and 50%, profile 1 is used, otherwise profile 2 2) VGG-16: VGG-16 is a large CNN; therefore it is of is used. interest to see how IDP performs when a large number of input channels are not used at each layer. We use the CIFAR- While the middle IDP layers are shared across all profiles, 10 dataset when evaluating this network structure. each profile maintains a separate first and last layer. This design choice is optional. Experimentally, we found that 3) MobileNet: This network structure is interesting because maintaining a separate first and last layer helps improve the of the 1x1 convolution layer over the depth dimension which accuracy of each profile by adapting the profile to the subset is computationally expensive when depth is large, making it of channels present in the IDP range. Generally, these separate a natural target for IDP. We use the CIFAR-10 dataset when layers add a small amount of memory overhead to the base evaluating this network structure. model. For instance, for VGG-16 [3] model used in our 4) Binarized Networks: Binarized Networks [7] are useful evaluation in Section IV, the additional profile adds a 64, 3x3 in low-power and embedded device applications. We combine filter convolutional layer (first layer) and a 10 neuron fully these networks with a device and cloud layer segmentation as connected layer for the profile classifier (last layer). In this shown in Figure 6. We use the MNIST dataset when evaluating case, the second profile layers translate into a 3% increase in this network structure. CDP IDP B-FC, 10 neurons FC, 10 neurons 2x B-Conv, 32 filters, 3x3 FC, 512 neurons 2x B-Conv, 16 filters, 3x3 5x Conv, 512 filters, 3x3 FC, 10 neurons Cloud 3x Conv, 256 filters, 3x3 3x S-Conv, 256 filters, 3x3 FC, 10 neurons 2x Conv, 128 filters, 3x3 2x S-Conv, 128 filters, 3x3 B-Conv, 8 filters, 3x3 FC, 100 neurons 2x Conv, 64 filters, 3x3 S-Conv, 64 filters, 3x3 B-Conv, 8 filters, 3x3 FC, 100 neurons Conv, 64 filters, 3x3 Conv, 64 filters, 3x3 Device (a) MLP (b) VGG-16 (c) MobileNet (d) Binarized Network Fig. 6: The networks used to evaluate IDP. The complete dot product (CDP) layers (blue) are standard convolutional layers. The IDP layers (red) are trained with a profile, described in Section II-B, in order to use a subset of dot product components during inference across a dynamic computation scaling range. A multiplier (e.g., 2x) denotes multiple repeated layers of the same type. For MobileNet, each S-Conv layer is a depthwise separable convolution layer as described in [2]. For the Binarized Network model, B-Conv and B-FC refer to a binary convolution layer and a binary fully connected layer, respectively. For this model, we segment the layer between the device and the cloud, with the lower layers on device and higher layers in the cloud, as in [6]. In the evaluation, when an IDP percentage is specified, such as 50% IDP, all IDP layer shown will use this percentage of input channels during forward propagation. B. Impact of Profile Selection 100 MLP (MNIST) The choice of profile has a large impact on the IDP range of a trained model. Figure 7 shows the effect of different profiles 98 for the MLP. The MLP is used for this comparison since 96 the smaller model size magnifies the differences in behavior of networks trained with the various profiles. Using the all- 94 one profile (shown in red) is equivalent to training a standard 92 network without IDP. We see that this all-one model achieves the highest accuracy result when the dot product is complete 90 all-one (100% IDP), but falls off quickly with poorer performance as 88 harmonic IDP is decreased. This shows that the standard model does linear not support a large dynamic computation range. By equally 86 half-exp weighting the contribution of each channel, the quality of the 20 40 60 80 100 network deteriorates more quickly as fewer of the channels IDP (%) are used. The other profiles weight the early channels more and the Fig. 7: A comparison of the classification accuracy of the MLP latter channels less. This allows a network to deteriorate more structure in Figure 6 (a), trained using the four profiles (shown gradually, as the channels with the smaller contribution are in Figure 2) for the MNIST dataset. The x-axis shows IDP (%), unused in the IDP. As an example, the MLP model trained with which is the percentage of components used in the IDP layer the linear profile is able to maintain higher accuracy from 60- during forward propagation. The all-one profile is equivalent 100% IDP. At 100% IDP, the linear profile model still achieves to the standard network (without channel coefficients). the same accuracy as the standard model. We will compare the standard network (all-one profile) to linear profile for the remaining analysis in the paper. of lowered accuracy in high IDP percentage regions. Compared to the linear profile, the harmonic profile places C. Single-Profile Incomplete Neural Networks a much higher weight on the beginning subset of channels. This translates to a larger dynamic computation range, from In this section, we evaluate the performance of the networks 30-100% IDP, for a model trained with the harmonic profile. presented in Figure 6 trained with the linear profile compared However, this model performs about 1% worse than the to a standard network. Figure 8 compares the dynamic scaling standard model when all channels are used (100% IDP). of incomplete dot products for standard networks to incom- Fortunately, as described in Section IV-D, we can generally use plete networks over four different networks: MLP, VGG-16, multiple profiles in a single network to mitigate this problem MobileNet and Binarized Networks. Classification Accuracy (%) 100 MLP (MNIST) 90 VGG (CIFAR-10) all-one (standard) 98 linear 80 96 70 94 60 50 92 40 90 30 88 20 all-one (standard) 86 10 linear 20 40 60 80 100 20 40 60 80 100 IDP (%) IDP (%) MobileNet (CIFAR-10) 100 Binarized Network (MNIST) 80 all-one (standard) 70 linear 98 60 50 96 40 94 30 20 92 all-one (standard) 10 linear 90 20 40 60 80 100 20 40 60 80 100 IDP (%) IDP (%) Fig. 8: A comparison of the performance of dynamic scaling with IDP for standard networks (trained without a profile) to incomplete networks (trained with the linear profile), for each network structure described in Section IV-A. Top-left is MLP. Top-right is VGG-16. Bottom-left is MobileNet and bottom-right is Binarized Network. For each network structure, we observe that using the profile. Each profile only learns the weights in its specified linear profile allows the network to achieve a larger dynamic range (e.g., 30-100%), but utilizes the weights learned by computation range compared to the standard network. For the earlier profiles (e.g., 0-30%). In this way, the later profiles can VGG-16 (CIFAR-10) model, at 50% IDP, the model with the use weights learned from the earlier profiles without affecting linear profile achieves an accuracy of 70% as opposed to 35% the performance of the earlier profiles. Training the network by all-one (standard) model. Additionally, for each network in this multi-stage fashion enables a single set of weights to the linear IDP network is able to achieve the same accuracy support multiple profiles. as the standard network when all channels are used. For the first profile of the VGG-16 model, we observe that the accuracy does not improve when an IDP of greater than D. Multiple-Profile Incomplete Neural Networks 30% is used. Since 30% IDP is the maximum for the profile, In this section, we explore the performance of multiple- it does not learn the channels in the 30-100% IDP range, and profile incomplete neural networks, as described in Section III. therefore cannot use the higher ranges during inference. The Figure 9 shows how each profile in a multiple-profile incom- second profile is able to achieve a higher final accuracy than plete network scales across the IDP percentage range for the the first profile but performs worse in the lower part of the IDP MLP and VGG-16 networks. For the MLP, a three profile range. The two profiles are able to achieve a higher accuracy scheme is used, where the first profile is trained for the 0- across the entire IDP range compared to the single profile 20% range, the second profile for the 20-40% range, and the network shown in Figure 8. third profile for the 40-100% range. For VGG-16, a two profile By training the profiles in a multi-stage fashion, the first scheme is used, where the first profile is trained for the 0- profile is restricted to learn a classifier using only the first 30% range and the second profile for the 30-100% range. 30% of channels. This improves the accuracy of the model in The profiles are trained incrementally, starting with the first Classification Accuracy (%) Classification Accuracy (%) Classification Accuracy (%) Classification Accuracy (%) 100 MLP (MNIST) VGG-16 (CIFAR-10) Profile 1 80 Profile 199 Profile 2 Profile 2 Profile 3 70 98 60 97 50 96 40 95 30 94 20 93 10 20 40 60 80 100 20 40 60 80 100 IDP (%) IDP (%) Fig. 9: A comparison of performance of dynamic IDP scaling under three profiles for MLP (left) and two profiles for VGG-16 (right). All profiles for a given model share the same network weights in the IDP layers. Both networks are trained using the linear coefficients. the lower IDP regions compared to the single-profile case. For There is a growing body of work on CNN models that instance, profile 1 achieves a 70% classification accuracy at have a smaller memory footprint [11] and are more power 30% IDP compared to only 30% accuracy at 30% IDP in the efficient. One of the driving innovations is depthwise separable single profile case. While profile 2 does not update the chan- convolution [12], which decouples a convolutional layer into nels learned by profile 1, it still utilizes them during training. a depthwise and pointwise layer, as shown in Figure 4. This Note that profile 2 can still achieve similar performance to the approach is a generalization of the inception module first single-profile model in the 80-100% IDP region. introduced in GoogLeNet [13]–[15]. MobileNet [2] utilized For the MLP model, we observe similar trends, but applied depthwise separable convolutions with the objective of per- to a three profile case. As more profiles are added, we see that forming state of the art CNN inference on mobile devices. the final profile (profile 3) is still able to achieve a high final ShuffleNet [16] extends the concept of depthwise separable accuracy, even though the beginning 40% of input channels in convolution and divides the input into groups to further im- the IDP layer are shared between all three profiles. prove the efficiency of the model. Structured Sparsity Learning (SSL) [17] constrains the structure of the learned filters to V. R W reduce the overall complexity of the model. IDP is orthogonalELATED ORK to these approaches and can be used in conjunction with them, In this section, we first compare IDP to methods that are as we show by incorporating IDP in MobileNet. similar in style but have the objective of preventing overfitting instead of providing a dynamic mechanism to scale over a VI. FUTURE WORK computation range. Dropout [8] is a popular technique for A profile targeted for a specific application can use any dealing with overfitting by using only a subset of features in number of channels in the given IDP range. The current the model for any given training batch. At test time, dropout implementation approach, as described in Section III of this uses the full network whereas IDP allows the network to paper, aims to have the profiles share the same coefficients dynamically adjust its computation by using only a subset of on overlapping channels. For instance, in Figure 5, the first channels. DropConnect [9] is similar to Dropout, but instead half of the profile 2 coefficients are the profile 1 coefficients. of dropping out the output channels, it drops out the network To this end, we train the weights incrementally starting with weights. This approach is similar to IDP, as IDP is also a the innermost profile and extending to the outermost profile. mechanism for removing channels in order to support dynamic In the future, we want to study a more general setting, where computation scaling. However, IDP adds a profile to directly this coefficient sharing constraint could be relaxed by jointly order the contribution of the channels and does not randomly training both the network weights and the profile coefficients. drop them during training. DeCov [10] is a more recent technique for reducing overfitting which directly penalizes VII. CONCLUSION redundant features in a layer during training. At a high level, this work shares a similar goal with multiple-profile IDP, This paper proposes incomplete dot product (IDP), a novel by aiming to create a set of non-redundant channels that way to dynamically adjust the inference costs based on the generalizes well given the restricted computation range of a current computation budget in conserving battery power or single profile. reducing application latency of the device. IDP enables this by Classification Accuracy (%) Classification Accuracy (%) introducing profiles of monotonically non-increasing channel [10] M. Cogswell, F. Ahmed, R. Girshick, L. Zitnick, and D. Batra, “Reduc- coefficients to the layers of CNNs. This allows a single net- ing overfitting in deep networks by decorrelating representations,” arXiv work to scale the amount of computation at inference time by preprint arXiv:1511.06068, 2015.[11] B. McDanel, S. Teerapittayanon, and H. Kung, “Embedded binarized adjusting the number of channels to use (IDP percentage). As neural networks,” arXiv preprint arXiv:1709.02260, 2017. illustrated in Figure 1, IDP has two sources for its computation [12] F. Chollet, “Xception: Deep learning with depthwise separable convo- saving, and their effects are multiplicative. For example, lutions,” arXiv preprint arXiv:1610.02357, 2016.50% [13] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, IDP will lead to reduce computation by 75%. V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern Additionally, we can improve the flexibility and effective- recognition, 2015, pp. 1–9. ness of IDP at inference time by introducing multiple profiles. [14] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking Each profile is trained to target a specific IDP range. At the inception architecture for computer vision,” in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, 2016, inference time, the current profile is chosen by the target IDP pp. 2818–2826. range, which is selected by the application or according to a [15] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, power management policy. By using multiple profiles, we are inception-resnet and the impact of residual connections on learning.” inAAAI, 2017, pp. 4278–4284. able to train a network which can run over a wide computation [16] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely ef- scaling range while maintaining a high accuracy (see Figure 9). ficient convolutional neural network for mobile devices,” arXiv preprint arXiv:1707.01083, 2017. To the best of our knowledge, the dynamic adaptation [17] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured approach of IDP as well as the notion of multiple profiles and sparsity in deep neural networks,” in Advances in Neural Information network training for these profiles are novel. As CNNs are Processing Systems, 2016, pp. 2074–2082. increasingly running on devices ranging from smartphones to IoT devices, we believe methods that provide dynamic scaling such as IDP become more important. We hope that this paper can inspire further work in dynamic neural network reconfig- uration, including new IDP designs, training and associated methodologies. VIII. ACKNOWLEDGEMENTS This work is supported in part by gifts from the Intel Cor- poration and in part by the Naval Supply Systems Command award under the Naval Postgraduate School Agreement No. N00244-16-1-0018. REFERENCES [1] H. B. McMahan, E. Moore, D. Ramage, and B. A. y Arcas, “Federated learning of deep networks using model averaging,” CoRR, vol. abs/1602.05629, 2016. [Online]. Available: http://arxiv.org/abs/ 1602.05629 [2] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convo- lutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017. [3] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [4] Y. LeCun, C. Cortes, and C. J. Burges, “The mnist database of handwritten digits,” 1998. [5] A. Krizhevsky, V. Nair, and G. Hinton, “The cifar-10 dataset,” 2014. [6] S. Teerapittayanon, B. McDanel, and H. Kung, “Distributed deep neural networks over the cloud, the edge and end devices,” in 37th International Conference on Distributed Computing Systems (ICDCS 2017), 2017. [7] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Ben- gio, “Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1,” arXiv preprint arXiv:1602.02830, 2016. [8] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting.” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014. [9] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus, “Regularization of neural networks using dropconnect,” in Proceedings of the 30th International Conference on Machine Learning (ICML-13), 2013, pp. 1058–1066.