An Adaptive Greedy Algorithm With Application to Nonlinear Communications

Greedy algorithms form an essential tool for compressed sensing. However, their inherent batch mode discourages their use in time-varying environments due to significant complexity and storage requirements. In this paper two existing powerful greedy schemes developed in the literature are converted into an adaptive algorithm which is applied to estimation of a class of nonlinear communication systems. Performance is assessed via computer simulations on a variety of linear and nonlinear channels; all confirm significant improvements over conventional methods.


I. INTRODUCTION
Many real-life systems admit sparse representations, that is they are characterized by small number of non-zero coefficients.Sparse systems can be found in many signal processing [3] and communications applications [4]- [6].For instance, in High Definition Television the significant echoes form a cluster, yet interarrival times between different clusters can be very long [4].In wireless multipath channels there is a relatively small number of clusters of significant paths [5].Finally, underwater acoustic channels exhibit long time delays between the multipath terms due to reflections off the sea surface or sea floor [6].
Two major algorithmic approaches to compressive sensing are 1 -minimization (basis pursuit) and greedy algorithms (matching pursuit).Basis pursuit methods solve a convex minimization problem, which replaces the 0 quasi-norm by the 1 norm.The convex minimization problem can be solved using linear programming methods, and is thus executed in polynomial time [7].Greedy algorithms, on the other hand, iteratively compute the support set of the signal and construct an approximation to it, until a halting condition is met [1], [2], [8]- [11].Both of the above approaches pose their own advantages and disadvantages. 1 -minimization methods provide theoretical performance guarantees, but they lack the speed of greedy techniques.Recently developed greedy algorithms, such as those developed in [1], [2], [10], deliver some of the same guarantee as 1 -minimization approaches with less computational cost and storage.
Many signal processing applications [4]- [6] require adaptive estimation with minimal complexity and small memory requirements.Existing approaches to sparse adaptive estimation use the 1 -minimization technique, in order to improve the performance of conventional algorithms.Chen et al. [12] incorporated two different sparsity constraints (the 1 and the log-sum penalty functions) into the quadratic cost of the standard Least Mean Squares (LMS) to improve the filtering performance on sparse systems.In [13], Angelosante et al. developed a recursive subgradient-based approach for solving the batch Lasso estimator.An 1 -regularized RLS type algorithm based on a low complexity Expectation-Maximization is derived in [14] by Babadi et al.Sparse adaptive 1 -regularized algorithms based on Kalman filtering and Expectation Maximization are reported in [15] by Kalouptsidis et al.In contrast to the above work on adaptive sparse identification, this paper focuses on the greedy viewpoint.Greedy algorithms in their ordinary mode of operation, have an inherent batch mode, and hence are not suitable for timevarying environments.This paper establishes a conversion procedure that turns greedy algorithms into adaptive schemes for sparse system identification.In particular, a Sparse Adaptive Orthogonal Matching Pursuit (SpAdOMP) algorithm of linear complexity is developed, based on existing greedy algorithms [1], [2], that provide optimal performance guarantees.Also, the steady-state Mean Square Error (MSE) of the SpAdOMP algorithm is studied analytically.The developed algorithm is used to estimate ARMA and Nonlinear ARMA channels.It is shown that channel inversion for these channels, maintains sparsity and that it is equivalent to channel estimation.Computer simulations reveal that the proposed algorithm outperforms most existing adaptive algorithms for sparse channel estimation.
The paper is structured as follows.The problem formulation and literature review are addressed in section II.Section III describes the established algorithm, the steady-state error analysis and applications to nonlinear communication channels.Computer simulations are presented in section IV. Conclusions and future work are discussed in section V.

II. GREEDY METHODS AND THE COSAMP ALGORITHM
Consider the noisy representation of a vector y where c is the parameter vector, T is the additive noise.The measurement matrix Φ(n) ∈ C n×N is often referred to as dictionary and the parameter vector c is assumed to be sparse, i.e., c 0 N , where • 0 = |supp(•)| is the 0 quasi-norm.We will call the parameter vector s-sparse when it contains at most s non-zero entries.
Recovery of the unknown parameter vector c can be pursued by finding the sparsest estimate of c which satisfies the 2 norm error tolerance δ Convex relaxation methods cope with the intractability of the above formulation by approximating the 0 quasi-norm by the convex 1 norm.The set of resulting techniques is referred to as 1 -minimization.The 2 constraint can be interpreted as a noise removal mechanism when δ ≥ η(n) 2 .The 1 -minimization approach is a convex optimization problem and can be solved by linear programming methods [7], [16], projected gradient methods [17] and iterative thresholding [18].The exact conditions for retrieving the sparse vector rely either on the coherence of the measurement matrix [19] or on the Restricted Isometry Property (RIP) [16].A measurement matrix Φ(n) satisfies the Restricted Isometry Property for δ s (n) ∈ (0, 1) if we have for all s-sparse c.When δ s (n) is small, the restricted isometry property implies that the set of columns of Φ(n) approximately form an orthonormal system.

A. The CoSaMP greedy algorithm
Greedy algorithms provide an alternative approach to 1minimization.For the recovery of a sparse signal in the presence of noise, greedy algorithms iteratively improve the current estimate for the parameter vector c by modifying one or more parameters until a halting condition is met.The basic principle behind greedy algorithms is to iteratively find the support set of the sparse vector and reconstruct the signal using the restricted support Least Squares (LS) estimate.The computational complexity of these algorithms depends on the number of iterations required to find the correct support set.One of the earliest algorithms proposed for sparse signal recovery is the Orthogonal Matching Pursuit (OMP) [8], [9], [19].At each iteration, OMP finds the column of Φ(n) most correlated with the residual, v(n) = y(n) − Φ(n)ĉ, using the proxy signal p(n) = Φ * T (n)v(n) (where A * T (n) denotes the conjugate transpose of the matrix A(n) ∈ C n×N ), and adds it to the support set.Then, it solves the following least squares problem: and finally updates the residual by removing the contribution of the latter column.By repeating this procedure a total of s times, the support set of c is recovered.Although OMP is quite fast, it is unknown whether it succeeds on noisy measurements.
An alternative algorithm, called Stagewise OMP (StOMP), was proposed in [11].Unlike OMP, it selects all components of the proxy signal whose values are above a certain threshold.Due to the multiple selection step, StOMP achieves better runtime than OMP.Parameter tuning in StOMP might be difficult and there are rigorous asymptotic results available.
A more sophisticated algorithm has been recently developed by Needell and Vershynin, and it is known as Regularized OMP (ROMP) [10].ROMP chooses the s largest components of the proxy signal, followed by a regularization step, to ensure that not too many incorrect components are selected.For a measurement matrix Φ(n) with RIP constant δ 2s = 0.03/ √ log s, ROMP provides uniform and stable recovery results.The recovery bounds obtained in [10] are optimal up to a logarithmic factor.Tighter recovery bounds which avoid the presence of the logarithmic factor are obtained by Needell and Tropp via the Compressed Sampling Matching Pursuit algorithm (CoSaMP) [1].CoSaMP provides tighter recovery bounds than ROMP optimal up to a constant factor (which is a function of the RIP constants).An algorithm similar to the CoSaMP, was presented by Dai and Milenkovic and is known as Subspace Pursuit (SP) [2].
As with most greedy algorithms, CoSaMP takes advantage of the measurement matrix Φ(n) which is approximately orthonormal (Φ * T (n)Φ(n) is close to the identity).Hence, the largest components of the signal proxy p(n) = Φ * T (n)Φ(n)c is most likely to correspond to the non-zero entries of c. Next, the algorithm adds the largest components of the signal proxy to the running support set and performs least squares to get an estimate for the signal.Finally, it prunes the least square estimation and updates the error residual.The main ingredients of the CoSaMP algorithm are outlined below: 1) Identification of the largest 2s components of the proxy signal 2) Support Merger, which forms the union of the set of newly identified components with the set of indices corresponding to the s largest components of the least square estimate obtained in the previous iteration 3) Estimation via least squares on the merged set of components 4) Pruning, which restricts the LS estimate to its s largest components 5) Sample update, which updates the error residual.The above steps are repeated until a halting criterion is met.The main difference between CoSaMP and SP is in the identification step where the SP algorithm chooses the s largest components.
In a time-varying environment, the estimates must be updated adaptively to take into consideration system variations.In such cases, the use of existing greedy algorithms on a measurement block requires that the system remain constant within that block.Moreover, the cost of repetitively applying a greedy algorithm after a new block arrives becomes enormous.Adaptive algorithms, on the other hand, allow online operation.Therefore, our primary goal is to convert existing greedy algorithms into an adaptive mode, while maintaining their superior performance gains.We demonstrate below that the conversion is feasible with linear complexity.We focus our analysis on CoSaMP/SP due to their superior performance, but similar ideas are applicable to other greedy algorithms as well.

III. SPARSE ADAPTIVE ORTHOGONAL MATCHING
PURSUIT ALGORITHM This section starts by converting CoSaMP and SP algorithms [1], [2] into an adaptive scheme.The derived algorithm is then used to estimate sparse Nonlinear ARMA channels.
The proposed algorithm relies on three modifications to the CoSaMP/SP structure: the proxy identification, estimation and error residual update.The error residual is now evaluated by The above formula involves the current sample only, in contrast to the CoSaMP/SP scheme which requires all the previous samples.Eq. ( 3) requires s complex multiplications, whereas the cost of the sample update in the CoSaMP/SP is sn multiplications.A new proxy signal that is more suitable for the adaptive mode, can be defined as: and is updated by where the forgetting factor λ ∈ (0, 1] is incorporated in order to give less weight in the past and more weight to recent data.This way the derived algorithm is capable of capturing variations on the support set of the parameter vector c.In the case of a time-invariant environment, λ should be set to 1.The addition of the forgetting factor mechanism requires redefining the Restricted Isometry Property as follows: Definition 1.A measurement matrix Φ(n) satisfies the Exponentially-weighted Restricted Isometry Property (ERIP) for λ ∈ (0, 1] and δ s (λ, n) ∈ (0, 1), if we have where The last modification attacks the estimation step.The estimate w(n) is updated by standard adaptive algorithms such as the LMS and RLS [20].LMS is one of the most widely used algorithm in adaptive filtering due to its simplicity, robustness and low complexity.On the other hand, the RLS algorithm is an order of magnitude costlier but significantly improves the convergence speed of LMS.The LMS algorithm replaces the exact signal statistics by approximations, whereas RLS updates the inverse covariance matrix.The update rule for RLS cannot be directly restricted to the index support set Λ. Hence, a more sophisticated mechanism is required like the one proposed in [14].For reasons of simplicity and complexity we focus on the LMS algorithm.At each iteration the current regressor φ(n) and the previous estimate w(n − 1) are restricted to the instantaneous support originated from the support merging step.
The resulting algorithm, the Sparse Adaptive Orthogonal Matching Pursuit (SpAdOMP), is presented in Table I.Note that φ |Λ and w |Λ denote the sub-vectors corresponding to the index set Λ, max(|a|, s) returns s indices of the largest elements of a and Λ c represents the complement of set Λ.An important point to note about step 5 of Table I is that, although it is simple to implement, it is difficult to choose the step-size parameter µ which assures convergence.The Normalized LMS (NLMS) update addresses this issue by scaling with the input power where 0 < µ < 2 and is a small positive constant (to avoid division by small numbers for stability purposes).NLMS may be viewed as an LMS with time-varying step-size.This observation justifies the superior tracking ability of NLMS with respect to LMS in non-stationary environments.

A. Compressed Sensing Matrices satisfying the ERIP
We find it useful to provide an example of measurement matrices satisfying the ERIP, before proceeding with the steadystate analysis of SpAdOMP.Consider an n × N matrix Φ(n) whose rows are i.i.d.samples from a random Gaussian vector process distributed according to N (0, R).Let Λ := supp(c).Now, consider the matrix , where Φ Λ (n) is the sub-matrix of Φ(n) corresponding to the index set Λ.The matrix Ψ Λ (n) appears in the definition of the ERIP and its eigen-distribution is of interest.The matrix Ψ Λ (n) can be expressed as follows: where For simplicity, we assume that R = σ 2 φ I, hence the elements of each row of Φ(n) are distributed i.i.d. and according to N (0, σ 2 φ ).Hence, the set

zero mean Gaussian random variables with variance σ 2
φ .The exponentially weighted random matrix Ψ Λ (n) formed by the set {φ |i (k)} i∈Λ , can be identified as the empirical estimate of the covariance matrix through an exponentially weighted moving average.Such random matrices often arise in portfolio optimization applications (See, for example, [21]).In [21], using the resolvent technique (See, for example [22]) the eigen-distribution of such matrices is studied and compared to that of Wishart ensembles.The main result of [21] implies that in the limit of N → ∞ and λ → 1, with β := s/N < 1 and Q := 1/(s(1 − λ)) fixed, and n → ∞, the eigenvalues of the matrix (1 − λ)Ψ Λ (n) (denoted as x) are distributed according to the density where v is the solution to the non-algebraic equation For example, by solving the above equation numerically for Q = 100 and σ φ = 1, the range of the eigen-distribution of (1−λ)Ψ Λ (n) is found to be [0.8652,1.1482].By appropriately scaling the elements of Φ(n), e.g.σ 2 φ = 1/N , one can obtain an upper bound of δ s (λ, n) ≤ 0.1482 on the ERIP constant, as n → ∞ and λ → 1 while β and Q are fixed and β < 1/Q.As it is shown in [21], for finite but large values of N and n and λ close enough to 1, the empirical eigen-distribution is very similar to the asymptotic limit.Therefore, by the standard continuity characteristics of the eigen-distribution of random matrices, one expects to have δ s (λ, n) ≤ 0.1482 for finite but large values of n and N , and λ sufficiently close to 1, with overwhelming probability.Note that the above concentration result can be extended to the case of correlated input sequences, which is studied in [22].
In parallel to the above limit process for the matrix Ψ Λ (n), one can consider the alternate limit process of λ = 1, s/n and N/n fixed, and n → ∞.This limit process gives rise to the well-known Wishart ensemble, whose eigen-distribution is known [23].In fact, as it is argued in [21], in first limit process the parameter 1/ log(1/λ) can be intuitively interpreted as the effective row dimension of Φ Λ (n) as n → ∞.Simulation results in [21] show that the eigen-distribution of the exponentially weighted random matrix Ψ Λ (n) is indeed very similar to that of the corresponding Wishart ensemble, by considering 1/ log(1/λ) as the effective row dimension.
The above example reveals that there is a close connection between the RIP and ERIP conditions (by interpreting 1/ log(1/λ) as the effective row dimension).The RIP constant of Gaussian measurement matrices has been extensively studied by Blanchard et al. [24].The above parallelism suggests that one might be able to extend such results regarding the RIP of random measurement matrices to those satisfying ERIP.However, study of the eigen-distribution of the exponentially weighted matrices seems to offer more difficulty than their non-weighted counterparts.

B. Steady-State MSE of SpAdOMP
The following Theorem establishes the steady-state MSE performance of the SpAdOMP algorithm: Then, the SpAdOMP algorithm, for large n, produces a s-sparse approximation c(n) to the parameter vector c that satisfies the following steady-state bound: | where e o (n) is the estimation error of the optimum Wiener filter, and C 1 (n) and C 2 (n) are constants independent of c (which are explicitly given in the Appendix) and are functions of λ M > 0 (the minimum eigenvalue of R), the ERIP constants δ s (λ, n), δ 2s (λ, n), • • • , δ 4s (λ, n) and the step size µ.The approximation in the above inequality is in the sense of the Direct-averaging technique [20] employed in simplifying the LMS iteration.
The proof is supplied in the Appendix.The above bound can be further simplified if one considers the normalization φ(n) 2  2 = 1 for all n.Such a normalization is implicitly assumed for the above example on the i.i.d.Gaussian measurement matrix as n, N → ∞ with σ 2 φ = 1/N .In this case, 2 ≤ 1 and thus the second term of the error bound simplifies to C 2 (n)|e o (n)|.Note that for large values of n, the isometry constants can be controlled.As shown in the example above, for a suitably random input sequence (e.g., i.i.d.Gaussian input) and for n large enough, the restricted isometry constants can be sufficiently small.For example, if for n large enough, δ 4s (λ, n) ≤ 0.01 and µλ M = 0.75, then C 1 (n) ≈ 38.6 and C 2 (n) ≈ 7.7.The corresponding coefficient for the CoSaMP algorithm will be C 1 (n) ≈ 6.1.Hence, the parameters C 1 (n) and C 2 (n) can be well controlled by feeding enough number of measurements to the SpAdOMP algorithm.
The first term on the right hand side of the Eq. ( 8) is analogous to the steady-state error of the CoSaMP/SP algorithm, corresponding to a batch of data of size n.The second term is the steady-state error induced by performing a single LMS iteration, instead of using the LS estimate.This error term does not exist in the error expression of the CoSaMP/SP algorithm.However, this excess MSE error can be compromised by the significant complexity reduction incurred by removing the LS estimate stage.Note that the promising support tracking behavior of the CoSaMP/SP algorithm is inherited by the LMS iteration, where only the sub-vector of φ(n) corresponding to Λ(n) and w |Λ(n) participate in the LMS iteration, and hence the error term.In other words, the SpAdOMP enjoys the low complexity virtue of LMS, as well as the support detection superiority of the CoSaMP/SP.Indeed, this observation is evident in the simulation results, where the MSE curve of SpAdOMP is shifted from that of LMS towards that of the genie-aided LS estimate (See Section IV).

C. Sparse NARMA identification
The nonlinear model that we will be concerned with, constitutes a generalization of the class of linear ARMA models [25] and is known as Nonlinear AutoRegressive Moving Average (NARMA) [26].The output of NARMA models depends on past and present values of the input as well as the output where y i , x i and η i are the system output, input and noise, respectively; M y , M x denote the output and input memory orders; η i is Gaussian and independent of x i ; and f (•) is a sparse polynomial function in several variables with degree of nonlinearity p. Known linearization criteria [25] provide sufficient conditions for the Bounded Input Bounded Output stability of (9).Using Kronecker products, we write Eq. ( 9) as a linear regression model where Consider the pth order Kronecker powers y Then, the output and input regressor vectors are respectively given by φ T y (i) = [y It must be noted that in NARMA models, the input sequence is non-linearly related to the measurement matrix Φ(n) through the multi-fold Kronecker product procedure.Thus, the effective measurement matrix generated by an i.i.d.input sequence, will not necessarily maintain the i.i.d.structure.Nevertheless, in case of linear models, by invoking the frequently adopted independence assumption [20], the i.i.d.property of the input sequence is carried over to the corresponding measurement matrix, and thus one might be able to guarantee analytically-provable controlled ERIP constants for the measurement matrix (as in the example of Section III-A).Although we have not mathematically established any results regarding the isometry of such structured matrices, simulation results reveal that input sequences which give rise to measurement matrices satisfying the ERIP in linear models (e.g., i.i.d.Gaussian), also perform well in conjunction with non-linear models (See Section IV).Nevertheless, the problem of designing input sequences, with mathematical guarantees on the ERIP of the corresponding measurement matrices in the non-linear models, is of interest and remains open.

D. Equalization/Predistortion in nonlinear communication channels
Nonlinearities in communication channels are caused by Power Amplifiers (PA) operating near saturation [27] and are addressed by channel inversion.Right inverses are called predistorters and are placed at the transmitter side; left inverses are termed equalizers and are part of the receiver.Predistorters are the preferred solution in single transceiver multiple receiver systems, such as a base station and multiple GSM receivers.
Channel inversion is conveniently effected when Eq. ( 9) is restricted to In the above equation the present input sample enters linearly.If x i entered polynomially, inversion would require finding the roots of a polynomial which does not always result in a unique solution and is computationally expensive.The inverse of Eq. ( 11) is given by Note that modulo the scaling by b 0 correction, the system and its inverse are generated by the same function.Hence, estimation of the direct process is equivalent to the estimation of the reverse process.

IV. EXPERIMENTAL RESULTS
In this section we compare through computer simulations the performance of existing algorithms and the algorithm proposed in this paper.Experiments were conducted on both linear and nonlinear channel setups.In all experiments the output sequence is disturbed by additive white Gaussian noise for various SNR levels ranging from 5 to 26dB.SNR is the ratio of the noiseless channel output power to the noise power corrupting the output signal (σ 2 y /σ 2 η ).The Normalized Mean Square Error, defined as is used to assess performance.

A. Sparse ARMA channel identification
In the first experiment sparse ARMA channel estimation is considered.The channel memory is M y = M x = 50 and the channel to be estimated is given by The input sequence is drawn from a complex Gaussian distribution, CN (0, 1/5).To reduce the realization dependency, the parameter estimates were averaged over 30 Monte Carlo runs.Program (P 0 ) is solved by the CoSaMP [1], OMP [8], [9], log-LMS [12] and SpAdOMP.Moreover, two conventional methods were used, namely, the Least Squares (LS) and the LMS algorithm.The number of samples processed was 500.The sparsity tuning parameter required by the log-LMS is summarized in Table II.The step size for the conventional LMS and the SpAdOMP was set to µ LMS = 1 × 10 −2 and µ SpAdOMP = 7 × 10 −2 .Note that the choice of the step size µ is made near the instability point of each algorithm to provide the maximum possible convergence speed.
Fig. 1(a) shows the excellent performance match between the Genie LS, CoSaMP and OMP, all of which have quadratic complexity.The LMS, log-LMS and SpAdOMP have an order of magnitude less computational complexity, but only SpAdOMP achieves a performance gain close to Genie LS (9dB less).If we repeat this experiment for a fixed SNR level of 23dB and process 2000 samples, then as shown in Fig. 1(b), log-LMS improves by 20dB; however, it achieves 4dB less performance gain than SpAdOMP.
To demonstrate the support tracking ability of SpAdOMP, we run this experiment and after 300 iterations we set a 1 to zero.This time, since we have a support varying environment,  λ is set to λ = 0.8 in SpAdOMP.Fig. 2 illustrates the time evolution of the estimates of a 1 .We note from Fig. 2, that the conventional LMS does not take into account sparsity and hence the estimates are nonzero; while log-LMS and SpAdOMP succeed in estimating the zero entries.However, SpAdOMP has a much faster support tracking behavior for the estimation of the zero entries in comparison to log-LMS.

B. Sparse NARMA channel identification
In the second experiment the following NARMA channel is considered The experiment is based on 30 Monte Carlo runs and the input sequence is generated from a complex Gaussian distribution, CN (0, 1/4), consisting of 1000 samples.This time, the methods used are CoSaMP, OMP and SpAdOMP, along with the standard LMS algorithm and least squares.The step size parameters µ LM S = 6 × 10 −3 and µ SpAdOM P = 0.3 are used for the conventional LMS and SpAdOMP.OMP and SpAdOMP lag behind Genie LS by 5dB and 12dB respectively in performance.It is worth pointing out that SpAdOMP obtains an average gain of nearly 19dB over the conventional LMS.Note that this significant NMSE gain is the product of both the denoising mechanism and the compressed sampling virtue of the CoSaMP algorithm, which are lacking in conventional adaptive algorithms such as LMS.

C. Sparse nonlinear multipath channel identification
In this channel setup, a cubic baseband Hammerstein wireless channel with four Rayleigh fading rays (two on the liner and two on the cubic part) is employed; all rays fade at the same Doppler frequency of f D = 80Hz with sampling period T s = 0.8µs.The channel memory length is equal to M 1 = M 3 = 50 (for both the linear and cubic parts) and the position of the fading rays is randomly chosen.In this experiment, 2000 samples from a complex Gaussian distribution CN (0, 1/4) were processed.Fig. 3(b) illustrates that SpAdOMP provides an average gain of 11dB, over the conventional LMS and 5dB over the log-LMS developed in [12].

V. CONCLUSIONS
In this paper, an adaptive algorithm for sparse approximations with linear complexity was developed using the underlying principles of existing batch-greedy algorithms.Analytical bounds on the steady-state MSE are obtained, which highlight the superior performance of the proposed algorithm.The proposed algorithm was applied to sparse NARMA identification and in particular to NARMA channel equalization/predistortion.Simulation results validated the superior performance of the new algorithm.Future research is focused on blind algorithms for sparse system identification.

APPENDIX PROOF OF THEOREM 1
Note that, unlike CoSaMP, the iterations of SpAdOMP are not applied to a fixed batch of measurements.Hence, we need to revisit the error analysis of CoSaMP taking into account the time variations.Recall that the LMS update for w |Λ(n) (n) is given by 1) Suppose that the estimate at time n is given by c(n).Let where w o|Λ(n) is the optimum Wiener solution restricted to the set Λ(n), given by with R := E{φ(n)φ * T (n)} and r := E{φ * (n)y(n)}.One can write where e o (n) is the estimation error of the optimum Wiener filter, given by e o (n Invoking the Direct-Averaging approximation [20], one can substitute where λ M is the minimum eigenvalue of R.Here we assume that the covariance matrix R is non-singular, i.e., λ M > 0.
Note that the direct-averaging method yields a reasonable approximation particularly when µ 1 [28].A more direct and rigorous convergence analysis of the LMS algorithm is possible, which is much more complicated [29].Hence, for the sake of simplicity and clarity of the analysis, we proceed with the direct-averaging approach.
In order to obtain a closed set of difference equations for 1 (n) and 2 (n), we need to express the third and fourth terms of Eq. ( 17) in terms of 1 (n) and 2 (n) (and time-shifts thereof).First, we consider the third term.Let Note that w(n − 1) is supported on the index set Λ(n − 1).Hence, where ∆ represents the symmetric difference of Λ(n) and Λ(n − 1).The key here is the fact that the support estimates Λ(n − 1) and Λ(n) contain most of the energy of the true vector c, due to the restricted isometry of the measurement matrix and the construction of the proxy signal.Consider the squared form of the first term in the above equation: Hence, and with The effective noise vector η (n) consists of two parts: the first term is the exponentially-weighted additive noise vector, and the second term is the excess error due to the adaptive update of the proxy signal (in contrast to the batch construction used in the CoSaMP algorithm).Note that the isometry constants δ s (λ, n), Using this approximation, the 2 -norm of δ(n − 1) can be bounded as follows: Moreover, using Lemmas 4.2, 4.3, and 4.4 of [1], one can express c − b(n) 2 in terms of 1 (n) and η (n) as follows: where where the last line of Eq. ( 31) is obtained from the second line by adding and subtracting w o|Λ(n) from b(n)−w(n), and using the triangle inequality.The last term on the right hand side of Eq. (31) denotes the difference between the optimum Wiener solution and the LS solution, both restricted to the index set Λ(n).As mentioned earlier, one can approximate the covariance matrix R by the exponentially weighted sample covariance Φ * T (n)D(n)Φ(n), and the correlation vector r by Φ * T (n)D(n)r(n).In this case, we have w o|Λ(n) ≈ b(n), and hence the contribution of the last term on the right hand side of Eq. (31) to the steady-state error becomes negligible.Also, by construction, the estimate w(n) is supported on the index set Λ(n).Hence, the second term of Eq. (31) can be identified as 4 w o|Λ(n) − w |Λ(n) (n) 2 = 4 2 (n).With the abovementioned simplifications, one can arrive at the following set of non-linearly coupled difference equations for 1 (n), 2 (n) and 3 (n): (32) Although the above set of difference equations is sufficient to obtain the error measures 1 (n), 2 (n), and 3 (n) for all n, the solution is non-trivial for general n due to its high nonlinearity.However, for large n, it is possible to obtain the steady-state solution.It is easy to substitute 3 (n) in terms of 1 (n).Also, for large enough n, the arguments of the max{•, •} operator do not vary significantly with n.Hence, we can substitute the maximum with the second argument.Hence, the steady-state values of 1 (n) and 2 (n) can be obtained from the following equation: Note that, the contribution of proxy error term in η (n) becomes negligible for large n, due to the effect of forgetting factor, and the fact that the estimates c(n) do not vary much with n.Hence, we can approximate η (n) by D 1/2 (n)η(n) for large n.In particular, the asymptotic solution to 1 (n) is given by: (37) Note that a sufficient condition for the above bound to hold is ∆(n) > 0.
denotes all possible Kronecker product combinations of y i and x i of degree up to p.The components of c = [ c T y c T x c T yx ] T correspond to the coefficients of the polynomial f (•).Hence, if we collect n successive observations, recovery of the sparsest parameter vector can be accomplished by solving the mathematical program (P 0 ).

Fig. 1 .
Fig. 1.NMSE of the channel estimates versus SNR on a linear channel

Fig. 2 .Fig. 3 .
Fig. 2. Time evolution of a 1 signal entry on a linear ARMA channel

21 )
Lemmas and 4.3 of [1] provide the following bound on
• • • , δ 4s (λ, n) are all functions of n, since the matrix Φ(n) depends on n.If the input sequence is generated by a stationary source, for n large enough, one can approximate the covariance matrix R by the exponentially weighted sample covariance Φ * T (n)D(n)Φ(n).