Incomplete Dot Products for Dynamic Computation
Scaling in Neural Network Inference
Bradley McDanel Surat Teerapittayanon H.T. Kung
Harvard University Harvard University Harvard University
Cambridge, MA, USA Cambridge, MA, USA Cambridge, MA, USA
Email: mcdanel@fas.harvard.edu Email: steerapi@seas.harvard.edu Email: kung@harvard.edu
Abstract—We propose the use of incomplete dot products (IDP)
to dynamically adjust the number of input channels used in Standard (CDP)
each layer of a convolutional neural network during feedforward Input M Filters with Output
inference. IDP adds monotonically non-increasing coefficients, Channels N Channels Channels
referred to as a ”profile”, to the channels during training. The
profile orders the contribution of each channel in non-increasing N M
order. At inference time, the number of channels used can be
dynamically adjusted to trade off accuracy for lowered power
consumption and reduced latency by selecting only a beginning
subset of channels. This approach allows for a single network
to dynamically scale over a computation range, as opposed to
training and deploying multiple networks to support different
levels of computation scaling. Additionally, we extend the notion Proposed (50% IDP)
to multiple profiles, each optimized for some specific range of
computation scaling. We present experiments on the computation M
and accuracy trade-offs of IDP for popular image classification N
models and datasets. We demonstrate that, for MNIST and
CIFAR-10, IDP reduces computation significantly, e.g., by 75%,
without significantly compromising accuracy. We argue that IDP
provides a convenient and effective means for devices to lower
computation costs dynamically to reflect the current computation
budget of the system. For example, VGG-16 with 50% IDP (using Input Channel (Used) Output Channel (Used)
only the first 50% of channels) achieves 70% in accuracy on Input Channel (Unused) Approx. Output Channel (Used)
the CIFAR-10 dataset compared to the standard network which Filter Channel (Used) Output Channel (Unused)
achieves only 35% accuracy when using the reduced channel set. Filter Channel (Unused)
Fig. 1: Contrasting computation using complete dot product
I. INTRODUCTION (CDP) in standard networks and incomplete dot product (IDP)
proposed in this paper for a convolutional layer. Under stan-
Inference with deep Convolutional Neural Networks dard CNN, for each filter at a given layer, N input channels are
(CNNs) on end or edge devices has received increasing used in the dot product computation (CDP), to compute the
attention as more applications begin to use sensor data as corresponding output channel. Under, for example, 50% IDP,
input for models running directly on the device (see, e.g., [1]). the filter uses the first 50% of the input channels to compute
However, each trained CNN model has a fixed accuracy, size, the corresponding output channel, which is an approximation
and latency profile determined by the number of layers and of the CDP. Furthermore, only 50% of these filters are used
parameters in each layer. The static nature of these models can since output channels for the other filters will not be utilized
be problematic in dynamic contexts, such as when running on in the next layer. This leads to a 75% reduction in computation
mobile device, where the power and latency requirements of for 50% IDP.
CNN inference may change based on the current battery life
or computation latency allowance of the device.
One approach to address these dynamic contexts is to train
multiple CNN models of ranging sizes, such as by varying instead train a single CNN that could dynamically scale across
the number of parameters in each layer as in MobileNet [2], a computation range in high resolution, trading off accuracy
and then selecting the appropriate model based on the current for power consumption and latency as desired.
system requirements. However, this kind of approach requires To provide a solution for dynamic scaling over a compu-
storing multiple models of different sizes to support the desired tation range, we propose to modify CNN training, by adding
computation flexibilities of the system. Ideally, we would a profile of monotonically non-increasing channel coefficients
for channels in a given layer. These coefficients provide an
ordering for channels to allow channels with small coefficients Profiles
to be dropped out during inference to save computation. By 1.0
training a CNN with this profile, dot products performed dur- all-one
ing forward propagation may use only some of the beginning 0.8 harmonic
channels in each filter, without compromising accuracy sig- linear
nificantly. We call such a dot product incomplete dot product 0.6 half-exp
(IDP). As we show in our evaluation in Section IV, using IDP
on networks trained with a properly chosen profile achieves 0.4
much higher accuracy relative to using IDP on standard CNNs
(trained without channel coefficients). Using the IDP scheme, 0.2
we are able to train a single CNN that can dynamically scale
across a computation range. 0.0
Figure 1 contrasts forward propagation in a standard CNN 1 3 5 7 9 11 13 15
layer using all input channels, which we refer to as complete Channel Coefficient Index (for 1 to 16)
dot product (CDP), to that of using only half of the input
channels (50% IDP). IDP has two sources of computation Fig. 2: The profiles evaluated in this paper. Each channel
reduction: 1) each output channel is computed with a filter coefficient index on the x-axis corresponds to an input channel
using only 50% of the input channels (meaning the output (as shown in Figures 3 and 4). The y-axis shows the value
is an approximation of using 100% of channels) and 2) half of each coefficient for a profile. The all-one (all coefficients
of the filters are unused since their corresponding output equal 1) profile corresponds to a standard network layer with-
channels are never consumed in the next layer. Therefore, out channel coefficients. The other profiles provide different
50% IDP leads to a 75% reduction in computation cost. In schemes for the coefficients. For instance, in half-exp, an
general, the IDP computation cost is p2 times the original exponential decay for second half of coefficients is used.
computation cost, where p is the percentage of channels used
in IDP. As we show in our analysis in Section IV, IDP reduces and (b1, b2, . . . , b )TN is a truncated version of the followingcomputations (number of active input channels used in forward expression
propagation) of several conventional network structures while
still maintaining similar accuracy when all channels are used. ∑N
γiaibi = γ1a1b1 + γ2a2b2 + · · ·+ γNaNbN
In the next section, we describe how IDP is integrated i=1
into multiple layer types (fully connected, convolution, and which keeps only some number of the beginning terms, where
depthwise separable convolution) in order to train such net- γ1, γ2, . . . , γN are the monotonically non-increasing profileworks where only a subset of channels may be used during coefficients.
forward propagation. We call such networks incomplete neural
networks. To compute IDP with a target IDP percentage, the beginning
components, starting with γ1a1b1, are accumulated until the
II. INCOMPLETE NEURAL NETWORKS target percentage of components is reached. The contribution
of the remaining components, with smaller coefficients, is
An incomplete neural network consists of one or more ignored, making the dot product incomplete. This is mathe-
incomplete layers. An incomplete layer is similar to a standard matically equivalent to setting the unused coefficients to 0.
neural network layer except that incomplete dot product (IDP)
operations replace conventional complete dot product (CDP) B. Choice of Profile
operations during forward propagation. We begin this section
by describing IDP during forward propagation and provide There are multiple ways to set the profile γ =
an explanation of how IDP is specifically added to fully- {γ1, γ2, . . . , γi, . . . , γN} corresponding to various policies
connected, convolutional, and depthwise separable convolu- which may favor dynamic computation scaling for certain
tional layers. ranges at the expense of other regions. In Section IV-C, we
discuss the performance implications of the various profiles.
A. Incomplete Dot Product The profiles evaluated in this paper, shown in Figure 2, are:
1) all-one: corresponds to a standard network layer without
IDP adds a profile consisting of monotonically non- a profile
increasing coefficients γ to the components of the dot prod- γi = 1 for i = 1, . . . , N
uct computations of forward propagation. These coefficients
order the importance of the components in decreasing or- 2) harmonic: a harmonic decay
der from the most to least important. Mathematically, the 1
incomplete dot product (IDP) of two vectors (a1, a2, . . . , a TN ) γi = for i = 1, . . . , Ni
Channel Coefficient
3) linear: a linear decay Convolution Filter
Input
γi = 1−
i
for i = 1, . . . , N N Channel Coefficients
N 3x3
3
4) half-exp: all-one for the first half of terms and an
exponenti{al decay for latter half of terms 3
1, if i < N
γi = ( ) 2 for i = 1, . . . , N
exp N2 − i− 1 , otherwise Dot Product Components (Nx3x3)
Incomplete Dot Product (IDP): 2x3x3
C. Incomplete Layers
In this section, we describe how IDP is integrated into stan- Fig. 3: IDP computation for a filter f (green) consisting
dard neural network layers, which we refer to as incomplete of N 3 × 3 kernels f1, f2, . . . , fN and a patch of the in-
layers. put x (green) consisting of slices from N input channels,
x1,x2, . . . ,xN which is highlighted by the dashed black lines.1) Incomplete Fully Connected Layer: A standard linear On the right, the filter and input are shown in vector form.
layer does the following computation:
∑ The coefficients γ1, . . . , γN correspond to each input channel.N The vertical dashed line (red) signifies an IDP where only
yj = wjixi, the first two channels are used to compute the dot product
i=1 γ1(f1 · x1) + γ2(f2 · x2).
where j ∈ {1, . . . ,M}, M is the number of output compo-
nents, N is the number of input components, wji is the layer
weight corresponding to the j-th output component and -th to the pointwise convolution. A standard depthwise separablei
input component, is the -th input component and is convolution layer does the following computation:xi i yj
j-th output component. An incomplete linear layer does the ∑N
following computation instead: yj = gji ∗ (fi ∗ xi),
∑ i=1N
y = γ w x , where j ∈ {1, . . . ,M}, M is the number of output channels,j i ji i
i=1 N is the number of input channels, gji is the i-th channel of
the j-th pointwise filter, fi is the i-th channel of the depthwisewhere γi is the i-th profile coefficient. filter, xi is the i-th input channel and yj is the j-th output
2) Incomplete Convolution Layer: A standard convolution channel. An incomplete depthwise convolution layer does the
layer does the following computation: following computation instead:
∑N ∑N
yj = fji ∗ xi, yj = γj(gji ∗ (fi ∗ xi)),
i=1 i=1
where j ∈ {1, . . . ,M}, M is the number of output channels, where γj is the profile coefficient of the j-th pointwise filter,
N is the number of input channels, fji is the i-th channel of as illustrated in Figure 4.
the j-th filter, xi is the i-th input channel and yj is the j-th
output channel. When the input data is 2D, fji is a 2D kernel D. Incomplete Blocks
and fji ∗ xi is a 2D convolution. For this paper, we use 2D
input data in all experiments. An incomplete convolution layer An incomplete block consists of one or more incomplete
does the following computation instead: layers, batch normalization and an activation functions. In
∑ this paper, Conv denotes an incomplete convolution blockN containing an incomplete convolution layer, batch normaliza-
yj = γi(fji ∗ xi), tion and an activation function. Similarly, S-Conv denotes an
i=1 incomplete depthwise separable convolution block. FC denotes
where γi is the profile coefficient for the i-th channel of a an incomplete fully connected block. B-Conv denotes an
filter, as illustrated in Figure 3. incomplete binary convolution block, where the layer weight
values are either -1 or 1, and B-FC denotes an incomplete
3) Incomplete Depthwise Separable Convolution Layer: binary fully connected block.
An incomplete depthwise separable convolution [2] consists
of depthwise convolution followed by pointwise convolution.
To simplify the presentation in this work, IDP is only applied
Depthwise Pointwise
Convolution FilterConvolution Input FC, 10 neurons FC, 10 neurons
N N Channel Coefficients FC, 512 neurons
1x1
3 1 Conv, 32 filters, 3x3
1
3 CDP Conv, 32 filters, 3x3
IDP Conv, 32 filters, 3x3 Conv, 32 filters, 3x3
Dot Product Components (Nx1x1) 1 1Channel Coefficients
0 0
Incomplete Dot Product (IDP): 2x1x1
Profile 1 Profile 2
Fig. 4: Incomplete depthwise separable convolution consists (0% to 50% IDP) (50% to 100% IDP)
of depthwise convolution followed by pointwise convolution.
In this illustration, IDP is only applied to the pointwise Fig. 5: Training an incomplete neural network with two
convolution. Note that IDP can also be applied to depthwise profiles. First, profile 1 is trained using only the first 50%
convolution such as illustrated in Figure 3. of channels in each filter for every IDP layer. Then profile
2 is trained using all channels in every IDP layer, but only
updates the latter 50% of channels in each filter. Each profile
has an independent first and last layer. In this example, linear
III. MULTIPLE-PROFILE profiles (see Figure 2) are used.
INCOMPLETE NEURAL NETWORKS
Multiple profiles, each focusing on a specific IDP range, total model size over a single profile VGG-16 network.
can be used in a single network to efficiently cover a larger
computation range. Figure 5 provides an overview of the
training process for a network with two profiles (the process IV. EVALUATION
generalizes to three or more profiles). In the example, the first
profile operates on the 0-50% IDP range and the second profile In order to show the general versatility of the approach, we
operates on the 50-100% IDP range. The training phase is incorporate IDP into several well studied models and evaluate
performed in multiple stages, first training profile 1 and then on two datasets (MNIST [4] and CIFAR-10 [5]).
fixing the learned weights and training profile 2. Specifically,
when training profile 1, only the channels corresponding to the A. Network Models
0-50% IDP range are learned. After training profile 1, profile
2 is trained in a similar manner, but only learns weights in the The network models used to evaluate IDP are shown in Fig-
50-100% IDP range while still utilizing the previously learned ure 6. These networks cover several network types including
weights of profile 1. MLPs, CNNs, Depthwise Separable CNNs (MobileNet [2]),
At inference time, a profile is selected based on the current and Binary Networks.
IDP percentage requirement. The IDP percentage is chosen 1) MLP: This single hidden layer MLP model highlights
by the application power management policy, which dictates the effects of IDP on a simple network. We use the MNIST
computation scaling based on, e.g., current battery budget of dataset when evaluating this network structure.
the device. For the two-profile example in Figure 5, when IDP
is between 0% and 50%, profile 1 is used, otherwise profile 2 2) VGG-16: VGG-16 is a large CNN; therefore it is of
is used. interest to see how IDP performs when a large number of
input channels are not used at each layer. We use the CIFAR-
While the middle IDP layers are shared across all profiles, 10 dataset when evaluating this network structure.
each profile maintains a separate first and last layer. This
design choice is optional. Experimentally, we found that 3) MobileNet: This network structure is interesting because
maintaining a separate first and last layer helps improve the of the 1x1 convolution layer over the depth dimension which
accuracy of each profile by adapting the profile to the subset is computationally expensive when depth is large, making it
of channels present in the IDP range. Generally, these separate a natural target for IDP. We use the CIFAR-10 dataset when
layers add a small amount of memory overhead to the base evaluating this network structure.
model. For instance, for VGG-16 [3] model used in our 4) Binarized Networks: Binarized Networks [7] are useful
evaluation in Section IV, the additional profile adds a 64, 3x3 in low-power and embedded device applications. We combine
filter convolutional layer (first layer) and a 10 neuron fully these networks with a device and cloud layer segmentation as
connected layer for the profile classifier (last layer). In this shown in Figure 6. We use the MNIST dataset when evaluating
case, the second profile layers translate into a 3% increase in this network structure.
CDP
IDP B-FC, 10 neurons
FC, 10 neurons 2x B-Conv, 32 filters, 3x3
FC, 512 neurons 2x B-Conv, 16 filters, 3x3
5x Conv, 512 filters, 3x3 FC, 10 neurons Cloud
3x Conv, 256 filters, 3x3 3x S-Conv, 256 filters, 3x3
FC, 10 neurons 2x Conv, 128 filters, 3x3 2x S-Conv, 128 filters, 3x3 B-Conv, 8 filters, 3x3
FC, 100 neurons 2x Conv, 64 filters, 3x3 S-Conv, 64 filters, 3x3 B-Conv, 8 filters, 3x3
FC, 100 neurons Conv, 64 filters, 3x3 Conv, 64 filters, 3x3 Device
(a) MLP (b) VGG-16 (c) MobileNet (d) Binarized Network
Fig. 6: The networks used to evaluate IDP. The complete dot product (CDP) layers (blue) are standard convolutional layers.
The IDP layers (red) are trained with a profile, described in Section II-B, in order to use a subset of dot product components
during inference across a dynamic computation scaling range. A multiplier (e.g., 2x) denotes multiple repeated layers of the
same type. For MobileNet, each S-Conv layer is a depthwise separable convolution layer as described in [2]. For the Binarized
Network model, B-Conv and B-FC refer to a binary convolution layer and a binary fully connected layer, respectively. For
this model, we segment the layer between the device and the cloud, with the lower layers on device and higher layers in the
cloud, as in [6]. In the evaluation, when an IDP percentage is specified, such as 50% IDP, all IDP layer shown will use this
percentage of input channels during forward propagation.
B. Impact of Profile Selection
100 MLP (MNIST)
The choice of profile has a large impact on the IDP range of
a trained model. Figure 7 shows the effect of different profiles 98
for the MLP. The MLP is used for this comparison since 96
the smaller model size magnifies the differences in behavior
of networks trained with the various profiles. Using the all- 94
one profile (shown in red) is equivalent to training a standard 92
network without IDP. We see that this all-one model achieves
the highest accuracy result when the dot product is complete 90 all-one
(100% IDP), but falls off quickly with poorer performance as 88 harmonic
IDP is decreased. This shows that the standard model does linear
not support a large dynamic computation range. By equally 86 half-exp
weighting the contribution of each channel, the quality of the 20 40 60 80 100
network deteriorates more quickly as fewer of the channels IDP (%)
are used.
The other profiles weight the early channels more and the Fig. 7: A comparison of the classification accuracy of the MLP
latter channels less. This allows a network to deteriorate more structure in Figure 6 (a), trained using the four profiles (shown
gradually, as the channels with the smaller contribution are in Figure 2) for the MNIST dataset. The x-axis shows IDP (%),
unused in the IDP. As an example, the MLP model trained with which is the percentage of components used in the IDP layer
the linear profile is able to maintain higher accuracy from 60- during forward propagation. The all-one profile is equivalent
100% IDP. At 100% IDP, the linear profile model still achieves to the standard network (without channel coefficients).
the same accuracy as the standard model. We will compare
the standard network (all-one profile) to linear profile for the
remaining analysis in the paper. of lowered accuracy in high IDP percentage regions.
Compared to the linear profile, the harmonic profile places C. Single-Profile Incomplete Neural Networks
a much higher weight on the beginning subset of channels.
This translates to a larger dynamic computation range, from In this section, we evaluate the performance of the networks
30-100% IDP, for a model trained with the harmonic profile. presented in Figure 6 trained with the linear profile compared
However, this model performs about 1% worse than the to a standard network. Figure 8 compares the dynamic scaling
standard model when all channels are used (100% IDP). of incomplete dot products for standard networks to incom-
Fortunately, as described in Section IV-D, we can generally use plete networks over four different networks: MLP, VGG-16,
multiple profiles in a single network to mitigate this problem MobileNet and Binarized Networks.
Classification Accuracy (%)
100 MLP (MNIST) 90 VGG (CIFAR-10)
all-one (standard)
98 linear 80
96 70
94 60
50
92
40
90
30
88 20 all-one (standard)
86 10 linear
20 40 60 80 100 20 40 60 80 100
IDP (%) IDP (%)
MobileNet (CIFAR-10) 100 Binarized Network (MNIST)
80 all-one (standard)
70 linear 98
60
50 96
40 94
30
20 92 all-one (standard)
10 linear
90
20 40 60 80 100 20 40 60 80 100
IDP (%) IDP (%)
Fig. 8: A comparison of the performance of dynamic scaling with IDP for standard networks (trained without a profile) to
incomplete networks (trained with the linear profile), for each network structure described in Section IV-A. Top-left is MLP.
Top-right is VGG-16. Bottom-left is MobileNet and bottom-right is Binarized Network.
For each network structure, we observe that using the profile. Each profile only learns the weights in its specified
linear profile allows the network to achieve a larger dynamic range (e.g., 30-100%), but utilizes the weights learned by
computation range compared to the standard network. For the earlier profiles (e.g., 0-30%). In this way, the later profiles can
VGG-16 (CIFAR-10) model, at 50% IDP, the model with the use weights learned from the earlier profiles without affecting
linear profile achieves an accuracy of 70% as opposed to 35% the performance of the earlier profiles. Training the network
by all-one (standard) model. Additionally, for each network in this multi-stage fashion enables a single set of weights to
the linear IDP network is able to achieve the same accuracy support multiple profiles.
as the standard network when all channels are used. For the first profile of the VGG-16 model, we observe that
the accuracy does not improve when an IDP of greater than
D. Multiple-Profile Incomplete Neural Networks 30% is used. Since 30% IDP is the maximum for the profile,
In this section, we explore the performance of multiple- it does not learn the channels in the 30-100% IDP range, and
profile incomplete neural networks, as described in Section III. therefore cannot use the higher ranges during inference. The
Figure 9 shows how each profile in a multiple-profile incom- second profile is able to achieve a higher final accuracy than
plete network scales across the IDP percentage range for the the first profile but performs worse in the lower part of the IDP
MLP and VGG-16 networks. For the MLP, a three profile range. The two profiles are able to achieve a higher accuracy
scheme is used, where the first profile is trained for the 0- across the entire IDP range compared to the single profile
20% range, the second profile for the 20-40% range, and the network shown in Figure 8.
third profile for the 40-100% range. For VGG-16, a two profile By training the profiles in a multi-stage fashion, the first
scheme is used, where the first profile is trained for the 0- profile is restricted to learn a classifier using only the first
30% range and the second profile for the 30-100% range. 30% of channels. This improves the accuracy of the model in
The profiles are trained incrementally, starting with the first
Classification Accuracy (%) Classification Accuracy (%)
Classification Accuracy (%) Classification Accuracy (%)
100 MLP (MNIST) VGG-16 (CIFAR-10)
Profile 1 80 Profile 199 Profile 2 Profile 2
Profile 3 70
98
60
97
50
96 40
95 30
94 20
93 10
20 40 60 80 100 20 40 60 80 100
IDP (%) IDP (%)
Fig. 9: A comparison of performance of dynamic IDP scaling under three profiles for MLP (left) and two profiles for VGG-16
(right). All profiles for a given model share the same network weights in the IDP layers. Both networks are trained using the
linear coefficients.
the lower IDP regions compared to the single-profile case. For There is a growing body of work on CNN models that
instance, profile 1 achieves a 70% classification accuracy at have a smaller memory footprint [11] and are more power
30% IDP compared to only 30% accuracy at 30% IDP in the efficient. One of the driving innovations is depthwise separable
single profile case. While profile 2 does not update the chan- convolution [12], which decouples a convolutional layer into
nels learned by profile 1, it still utilizes them during training. a depthwise and pointwise layer, as shown in Figure 4. This
Note that profile 2 can still achieve similar performance to the approach is a generalization of the inception module first
single-profile model in the 80-100% IDP region. introduced in GoogLeNet [13]–[15]. MobileNet [2] utilized
For the MLP model, we observe similar trends, but applied depthwise separable convolutions with the objective of per-
to a three profile case. As more profiles are added, we see that forming state of the art CNN inference on mobile devices.
the final profile (profile 3) is still able to achieve a high final ShuffleNet [16] extends the concept of depthwise separable
accuracy, even though the beginning 40% of input channels in convolution and divides the input into groups to further im-
the IDP layer are shared between all three profiles. prove the efficiency of the model. Structured Sparsity Learning
(SSL) [17] constrains the structure of the learned filters to
V. R W reduce the overall complexity of the model. IDP is orthogonalELATED ORK
to these approaches and can be used in conjunction with them,
In this section, we first compare IDP to methods that are as we show by incorporating IDP in MobileNet.
similar in style but have the objective of preventing overfitting
instead of providing a dynamic mechanism to scale over a VI. FUTURE WORK
computation range. Dropout [8] is a popular technique for A profile targeted for a specific application can use any
dealing with overfitting by using only a subset of features in number of channels in the given IDP range. The current
the model for any given training batch. At test time, dropout implementation approach, as described in Section III of this
uses the full network whereas IDP allows the network to paper, aims to have the profiles share the same coefficients
dynamically adjust its computation by using only a subset of on overlapping channels. For instance, in Figure 5, the first
channels. DropConnect [9] is similar to Dropout, but instead half of the profile 2 coefficients are the profile 1 coefficients.
of dropping out the output channels, it drops out the network To this end, we train the weights incrementally starting with
weights. This approach is similar to IDP, as IDP is also a the innermost profile and extending to the outermost profile.
mechanism for removing channels in order to support dynamic In the future, we want to study a more general setting, where
computation scaling. However, IDP adds a profile to directly this coefficient sharing constraint could be relaxed by jointly
order the contribution of the channels and does not randomly training both the network weights and the profile coefficients.
drop them during training. DeCov [10] is a more recent
technique for reducing overfitting which directly penalizes VII. CONCLUSION
redundant features in a layer during training. At a high level,
this work shares a similar goal with multiple-profile IDP, This paper proposes incomplete dot product (IDP), a novel
by aiming to create a set of non-redundant channels that way to dynamically adjust the inference costs based on the
generalizes well given the restricted computation range of a current computation budget in conserving battery power or
single profile. reducing application latency of the device. IDP enables this by
Classification Accuracy (%)
Classification Accuracy (%)
introducing profiles of monotonically non-increasing channel [10] M. Cogswell, F. Ahmed, R. Girshick, L. Zitnick, and D. Batra, “Reduc-
coefficients to the layers of CNNs. This allows a single net- ing overfitting in deep networks by decorrelating representations,” arXiv
work to scale the amount of computation at inference time by preprint arXiv:1511.06068, 2015.[11] B. McDanel, S. Teerapittayanon, and H. Kung, “Embedded binarized
adjusting the number of channels to use (IDP percentage). As neural networks,” arXiv preprint arXiv:1709.02260, 2017.
illustrated in Figure 1, IDP has two sources for its computation [12] F. Chollet, “Xception: Deep learning with depthwise separable convo-
saving, and their effects are multiplicative. For example, lutions,” arXiv preprint arXiv:1610.02357, 2016.50% [13] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
IDP will lead to reduce computation by 75%. V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
in Proceedings of the IEEE conference on computer vision and pattern
Additionally, we can improve the flexibility and effective- recognition, 2015, pp. 1–9.
ness of IDP at inference time by introducing multiple profiles. [14] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
Each profile is trained to target a specific IDP range. At the inception architecture for computer vision,” in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, 2016,
inference time, the current profile is chosen by the target IDP pp. 2818–2826.
range, which is selected by the application or according to a [15] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4,
power management policy. By using multiple profiles, we are inception-resnet and the impact of residual connections on learning.” inAAAI, 2017, pp. 4278–4284.
able to train a network which can run over a wide computation [16] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely ef-
scaling range while maintaining a high accuracy (see Figure 9). ficient convolutional neural network for mobile devices,” arXiv preprint
arXiv:1707.01083, 2017.
To the best of our knowledge, the dynamic adaptation [17] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured
approach of IDP as well as the notion of multiple profiles and sparsity in deep neural networks,” in Advances in Neural Information
network training for these profiles are novel. As CNNs are Processing Systems, 2016, pp. 2074–2082.
increasingly running on devices ranging from smartphones to
IoT devices, we believe methods that provide dynamic scaling
such as IDP become more important. We hope that this paper
can inspire further work in dynamic neural network reconfig-
uration, including new IDP designs, training and associated
methodologies.
VIII. ACKNOWLEDGEMENTS
This work is supported in part by gifts from the Intel Cor-
poration and in part by the Naval Supply Systems Command
award under the Naval Postgraduate School Agreement No.
N00244-16-1-0018.
REFERENCES
[1] H. B. McMahan, E. Moore, D. Ramage, and B. A. y Arcas,
“Federated learning of deep networks using model averaging,” CoRR,
vol. abs/1602.05629, 2016. [Online]. Available: http://arxiv.org/abs/
1602.05629
[2] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convo-
lutional neural networks for mobile vision applications,” arXiv preprint
arXiv:1704.04861, 2017.
[3] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[4] Y. LeCun, C. Cortes, and C. J. Burges, “The mnist database of
handwritten digits,” 1998.
[5] A. Krizhevsky, V. Nair, and G. Hinton, “The cifar-10 dataset,” 2014.
[6] S. Teerapittayanon, B. McDanel, and H. Kung, “Distributed deep neural
networks over the cloud, the edge and end devices,” in 37th International
Conference on Distributed Computing Systems (ICDCS 2017), 2017.
[7] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Ben-
gio, “Binarized neural networks: Training deep neural networks with
weights and activations constrained to+ 1 or-1,” arXiv preprint
arXiv:1602.02830, 2016.
[8] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and
R. Salakhutdinov, “Dropout: a simple way to prevent neural networks
from overfitting.” Journal of Machine Learning Research, vol. 15, no. 1,
pp. 1929–1958, 2014.
[9] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus, “Regularization
of neural networks using dropconnect,” in Proceedings of the 30th
International Conference on Machine Learning (ICML-13), 2013, pp.
1058–1066.